Skip to content

Creating A Module

Adam Hooper edited this page Apr 6, 2021 · 35 revisions

Workbench comes with many modules for loading data, cleaning it, visualizing it, etc. But it's also a "package manager" for all those little pieces of code that are necessary to do data work. You can create your own modules with Python, and they can optionally include JavaScript to produce embeddable visualizations or custom UI elements.

Quickstart

  • Clone the Hello Workbench module
  • Clone the main workbench repo into a sibling directory and set up a Workbench development environment
  • Fire up Workbench with bin/dev start
  • Watch the module directory with bin/dev develop-module hello-workbench. This will re-import the module whenever you make any changes.
  • Browse to http://localhost:8000 to use Workbench and try out your module
  • To publish your module, email adam@adamhooper.com with a link to your GitHub repository.

Module Development Workflow

First, Set up a development environment

Start your local Workbench server: bin/dev start

Next, create a new directory (at the same level as the cjworkbench directory, a sibling to it) called modulename. Add these files:

  • README.md -- optional but highly recommended
  • LICENSE -- optional but recommended
  • [modulename].py -- Python code, including def render(table, params) function
  • [modulename].json -- JSON file. Can also be YAML instead.
  • [modulename].html -- optional, needed if producing a visualization

There must be at least two files in your repo: a JSON configuration file which defines module metadata and parameters, and a Python script which does the actual data processing. You can also add a JavaScript file which produces output in the right pane, as Workbench's built-in charts do.

We recommend you also write tests for your new module.

In a shell in the cjworkbench directory, start a process that watches that directory for changes and auto-imports the module into the running Workbench: bin/dev develop-module modulename

Now, edit the module's code. Every time you save, the module will reload in Workbench. To see changes to HTML and JSON, refresh the page. To see changes to Python, refresh the page and trigger a render() by changing a parameter.

When developing a module, we recommend running bin/dev develop-module as described above. You can also import directly from a github repo into your local Workbench instance with the "Import Module" command in the upper right ("three dots") menu. This only appears for admin users, which means you can't do this on our production server. See below for how to submit your module for inclusion on workbenchdata.com.

Here are some examples of existing Workbench modules:

Module API Reference

The Module Specification JSON (or YAML) file

The module specification is required for every module. It defines metadata including the module name, an internal unique identifier, and all parameters that are displayed in the module UI and provided to the fetch() and/or render() functions.

For example, if you were creating a "Search and Replace" module that modifies all values in a text column, you might write a replace.yaml file like this:

id_name: replace
name: Search and Replace
category: Clean
description: Search for text and replace it with something else.
help_url: "https://mymodules.com/docs/search-and-replace"
parameters:
- id_name: column
  type: column  # show the user a Column selector
  name: Search in Column  # The <label> text
  column_types: [ text ]
- id_name: search
  type: string  # show the user a text box
  name: Search for
- id_name: replace
  type: string
  name: Replace with

This module has three parameters: a column and two strings.

You could also write this file as JSON, in replace.json. YAML and JSON represent the same data in different ways. JSON is simpler; YAML is more terse and allows comments.

The module specification in detail

All modules must define the following keys

  • name - The user-visible name of the module
  • id_name - An internal unique identifier. It must never change, or currently applied modules will break.
  • category - The category the module appears under in the Module Library

The following keys are optional but recommended:

  • description - An optional one-line description used to help users search for modules
  • help_url - An optional link to a help page describing how to use the module
  • icon - Must be one of a set of internal icons; see other module JSON files for options
  • loads_data - when true (default is false), this module may appear as the first step in a Tab (and its render() function must handle None as an input)

Parameters

Each parameter must define the following keys:

  • name - User visible <label> text
  • id_name - Internal unique identifier. Must not change, or Workbench will think it's a brand new parameter. However, different modules can use the same id_name.

They can have several optional keys:

  • default - The initial value of the parameter.
  • placeholder - The text that appears when the parameter field is empty, or no column is selected.
  • visible_if - Hides or shows this parameter based on the value of a menu or checkbox parameter.

The visible_if key is JSON object (content is inside braces) which itself has the following keys:

  • id_name - Which parameter controls the visibility of this parameter. It must be a menu or checkbox.
  • values - A list of menu values separated by |, or true or false for a checkbox
  • invert - Optional. If set to true, the parameter is visible if the controlling parameter does not have one of the values.

Some parameter types also support custom flags; see below.

Parameter types

Workbench currently supports the following parameter types:

  • string - A text value. Can have multiline set to true to allow newlines.
  • integer - An integer value.
  • float - A decimal value.
  • column - Allows the user to select a column. Can have column_types (Array of "text", "datetime" and/or "number") to help the user select correctly-typed columns; only columns of those types will be passed to render() and fetch().
  • multicolumn - Allows the user to select multiple columns. Can also have column_types.
  • menu and radio - Enumerated values. You must supply options, an Array of Objects with value and label keys; the user will see labels and select a value to pass to render().
  • checkbox - A boolean.
  • timezone - an IANA timezone database ID (String), such as "UTC" or "America/New_York"
  • statictext - Just shows the name as text, has no parameter value. Useful to explain to the user what to do.

The Module Python file

The Python file may contain a single function called render.

render()

Write a render function that accepts two parameters: a Pandas DataFrame and a dictionary of parameter values. It should return a DataFrame. For instance:

def render(table, params):
    s = params['search']
    r = params['replace']
    col = params['column']
   
    if col is None:
        return table.replace(s, r)
    else
        return table[col].replace(s,r)

Tips:

  • Your module will be rendered as soon as the user adds it. It's a good idea to return the input unchanged if no parameters are set and some parameters are needed, so the user isn't greeted with an error message.
  • You can produce an error message by returning a str. You can produce a warning by returning a tuple of (pd.DataFrame(...), str).
  • The null table (pd.DataFrame(), or None or just returning a str) is special: Workbench won't render it and won't feed it to any other modules in the workflow. This is different from a zero-row table (e.g., an empty "filter" module result), which Workbench will treat as a normal table.
  • It is safe for your module to modify the input table.

Optional render() keyword arguments

Workbench will supply more data when calling render() if you ask for it. At the moment you can get two additional pieces of data: the types and formats of all input columns, and result of your previous fetch.

input_columns

def render(table, params, *, input_columns):
    # now, `input_columns` is a Dict mapping name to RenderColumn.

Dictionary, keyed by table column name. Each column has these properties:

  • .name: the key
  • .type: one of "number", "text" or "datetime"
  • .format: (for "number" type only) A subset of the Python format specification. For instance: ${:,d}M means "dollar-sign prefix" ($), "thousands separator" (,), "cast as integer" (d) and "M suffix" (M).
    • The default format is {:,} -- which usually renders as decimal of arbitrary precision
    • Whole numbers -- even floating-point numbers -- are cast to int before being passed to Python's format() function, because format strings are supported in JavaScript and JavaScript has only one Number type.
    • Common, useful formats: {:,.1%} (percentage), {:,.2f} (fixed point notation).

fetch_result

def render(table, params, *, fetch_result):
    # fetch_result will be whatever your fetch() previously returned

This is useful if you want to do some parameter-dependent parsing on fetch results, e.g. letting the user specify what part of the retrieved data they want.

Writing a module that imports data

Don't query remote APIs in your render method. Instead, do it in a fetch. The user controls when fetches happen. Workbench stores old versions so the user can revisit them.

For instance:

import pandas as pd
import datetime

def fetch(params):
    return pd.DataFrame({'time': [datetime.datetime.now()]}, dtype='datetime64[ns]')

If this Fetch module can be the first module in a Tab, add loads_data: true to its JSON/YAML file.

You will also need to add a version selection parameter to your JSON/YAML file, like this:

- id_name: version_select
  type: custom
  name: Update  # or whatever you want the button to say

Tips:

  • Fetch happens when the user requests it. The user may set up a timer to fetch periodically.
  • Workbench keeps previous fetch results, as long as they fit the user's storage quota.
  • You may return None to tell Workbench not to store any result.
  • You may accept some keyword arguments: for instance, def fetch(params, *, secrets, get_input_dataframe). The full listing:
    • secrets: dict of secrets
    • settings: namespace with maximum column lengths and such
    • get_input_dataframe: async callback that returns the output of the previous module, or None if the previous module isn't rendered or if we're the first module. (Example usage: input_dataframe = asyncio.run(get_input_dataframe()))

migrate_params()

Congratulations, people are using your module! Now you want to add a feature or fix a bug. The parameters you chose for version 1 of the plugin won't work any more. How do you deploy version 2 of your plugin, with version-2 parameters? You'll need a way to "migrate" the version-1 parameters that are out in the wild. Enter migrate_params().

This is a function you must create the first time you change, add or delete a parameter. it accepts a dict argument (the params as they are in the wild -- created by any previously-deployed version of your module) and it returns a dict (the params your module JSON/YAML specifies). Think long-term -- the params in the wild could be out there for years to come. We like this pattern because it makes code append-only:

def migrate_params(params):
    if _are_params_v1(params):
        params = _migrate_params_v1_to_v2(params)
    if _are_params_v2(params):
        params = _migrate_params_v2_to_v3(params)
    ...
    return params

# the helper functions might look like this:
def _are_params_v1(params):
    # In this example, we're nixing an old parameter and replacing it 
    return 'param_from_v1_but_not_v2' in params

def _migrate_params_v1_to_v2(params):
    # v1 had a 'param_from_v1_but_not_v2' that was an int.
    #
    # v2 has 'v2_value' instead, which is boolean.
    ret = dict(params)  # copy
    del ret['param_from_v1_but_not_v2']  # delete param that isn't in v2
    ret['v2_value'] = params['param_from_v1_but_not_v2'] != 0  # add v2 param
    return ret

This param-migration code must last forever; and it must handle all sets of parameters that may ever have been produced by users. Code your helpers such that you won't need to modify them later; unit-test all old incarnations of params so you won't feel scared when you add another migration later. And most of all, add comments in each migration describing the old format and the new format.

Better yet: choose ideal parameters in the first place to avoid needing migrations.

migrate_params() is run whenever the user views a module. If it raises an exception or returns a dict that doesn't match parameters in your JSON module description, the render() method will never be called; the user will see a Python-esque error message ("ValueError: ...") and the user will need to intervene to submit all-new params derived from defaults.

The Module HTML file

Your module can optionally produce a visualization or other HTML UI. Set "html_output": true in your module JSON file to create an HTML output pane. Add a [modulename].html file to your module's directory, and it will appear in that output pane.

Workbench will display your HTML page in an iframe whenever your module is selected in a workflow. The most common reason is to render a chart.

Your HTML page can include inline JavaScript. This is useful when passing JSON data into your html -- see below.

Passing data to your HTML code

Every Python module produces "embed data": JSON destined for the embedded iframe. By default, that data is null.

To return embed data, make your Python render method return a triplet of data in this exact order: (dataframe, error_str, json_dict). For instance:

def render(table, params):
    return (table, 'Code not yet finished', {'foo': 'bar'})

Workbench will encode json_dict as JSON, so it must be a dict that is compatible with json.dumps().

Then you need to read this data inside the JavaScript in your HTML file. On page load, Workbench will inject a <script> tag with a global variable at the top of your HTML's <head>. You can access it by reading window.workbench.embeddata. For instance:

<!DOCTYPE html>
<html>
  <head><!-- You _must_ have a <head> element -->
    <title>Embeddata is set</title>
  </head>
  <body>
    <main></main>
    <script>
      document.querySelector('main').textContent = JSON.stringify(window.workbench.embeddata)
    </script>
  </body>
</html>

After page load, Workbench adds a #revision=N hash to your iframe's URL. That means the hashchange event will fire every time the JSON data will be recomputed. You can query the embeddata API endpoint to load the new data.

<!DOCTYPE html>
<html>
  <head>
    <title>Let's query embeddata from the server</title>
  </head>
  <body>
    <main></main>
    <script>
      function renderData (data) {
        document.querySelector('main').textContent = JSON.stringify(data)
      }

      function reloadEmbedData () {
        const url = String(window.location).replace(/\/output.*/, '/embeddata')
        fetch(url, { credentials: 'same-origin' })
          .then(function(response) {
            if (!response.ok) {
              throw new Error('Invalid response code: ' + response.status)
            }
            return response.json()
          })
          .then(renderData)
          .catch(console.error)
      }

      // Reload data whenever it may have changed
      window.addEventListener('hashchange', reloadEmbedData)

      // Don't forget to render the data on page load, _before_ the first change
      renderData(window.workbench.embeddata)
      // (alternatively: `reloadEmbedData()`)
    </script>
  </body>
</html>

Importing a module

If you are an admin user, you can select “Import from GitHub” to add a module from GitHub. Workbench will ensure that your module is ready to load and let you know if it runs into any trouble. Once you fix the issue, and commit the changes to GitHub, you can attempt to import the module from GitHub once again.

All imported modules are versioned, by typing the imported code to the Github revisions. Currently applied modules are automatically updated to new module code versions (which can involve adding, removing, and resetting parameters.)

Publishing your module

At this point, you've tested your module on your local machine. Now to publish it!

First, push all your code to a GitHub repository you control.

Email an introduction to adam@adamhooper.com with a link to the repository.

Workbench developers will fork your repository to a project within https://github.com/CJWorkbench/; then they'll review your code. They may submit pull requests to you to make the module fit Workbench's standards. Then they'll install the module on GitHub.

As users begin to use your module, Workbench developers will send pull requests to address bugs that pop up with data on production. (Any unhandled exception is an error. The Workbench team may need to deploy fixes first and notify you afterwards.)

To maintain your module, merge Workbench-requested pull requests. Then add features and send pull requests of your own back to the Workbench-owned module.

Your repository will have all the features you want; Workbench's repository will have all the bugfixes Workbench needs.