Skip to content

Exploring and annotating datasets

Sal Hagen edited this page Dec 7, 2022 · 8 revisions

4CAT's 'Explorer' function allows exploring and annotating datasets within a customisable Web page. This presents posts and annotation fields in a more convenient and legible manner than spreadsheet software. The Explorer also answers to 4CAT's call for 'traceability' by allowing to navigate between micro- and macro-views on a dataset.

To access the Explorer, click the Explore button on the dataset page. This will show the first posts from your dataset.

Annotate your data

The bottom-right of the Explorer page features annotation controls. Click Edit fields to add or change annotation fields.

Clicking New field in the field editor will add a new annotation field. The Label field determines its name, the Input type dropdown what type of annotation field this should be, and Options the possible pre-selected options.

Currently you can choose between a text field, a large text field (textarea), a dropdown, and checkboxes.

Clicking Apply will save your annotation fields to the database and apply them to the page. After this, you can safely close the page without the annotation fields disappearing. Changing the input type of existing annotation fields will delete any old values you might have already saved.

After applying, you can click Show annotations to show the annotation fields for every post. You can save the annotations to the database by clicking Save annotations. You can collaborate with others by annotating posts on different pages, but new changes to existing annotations will override the old ones.

If you check Also write to dataset, 4CAT will create a .csv file with the original data plus your annotations in new columns. This can take a while for large datasets, so be sure to only do this when you're done annotating.

Add custom fields

By default, the Explorer shows the author, thread ID, post ID, timestamp, and body for every post. However, it can be useful to show additional information. 4CAT therefore allows admins customise what data is shown per data source.

To do so, make sure there is a directory called explorer in the data source's folder within datasources/ (e.g. datasources/reddit/explorer/). In it, make a JSON file with the name of the data source appended by -explorer.json (e.g. datasources/reddit/explorer/reddit-explorer.json). This JSON file provides what and how custom values should be shown. The general format looks like this:

{
    "original_key1": "{{ column_to_use1 }}",
    "original_key2": "Label: {{ column_to_use2 }}
}

The keys will be the title of the field (shown on mouse hover) and the value represents what is actually shown. With the curly brackets you can retrieve a value from a specific column or key in your original dataset. For instance, {{ subreddit }} will retrieve the value of the subreddit column for the respective post. You can also insert HTML to add things like icons or anchor tags. The also support basic string slicing (e.g. {{ author[:5] }}) and (custom) Jinja2 filters (e.g. {{ timestamp | timify_long }}.

Note: If the primary dataset is a NDJSON, you must explicitly set the right keys for both JSON and CSV types. You do this by nesting the dictionaries in another dictionary with the extension suffix as a key (in lowercase). For instance:

{
	"ndjson": {
            "image": "{{ attachments.media_keys.url }}",
            // more stuff..
    },
	"csv": {
            "image": "{{ images }}",
            // more stuff..
    }
}

You can add a sort_options field to add the option to sort the posts in a certain way. This item should include a list of dictionaries containing key and label items, with the value of key being the column you want to use for sorting and label being the dropdown label. There are two additional options: descending and force_int. descending will result in the sort order being reversed. This is for instance useful when you want to sort your posts according to the highest to the lowest score or from new to old. force_int will convert the value to an integer. Integer values are stored as strings in the csv datasets used by the Explorer, which causes wonky sorting (9 will for instance be sorted after 7890). force_int fixes this. Setting descending and force_int to true will enable these options. For instance, these are the sort options we used for Tumblr posts:

"sort_options": [
		{
			"key": "timestamp",
			"label": "Old to new"
		},
		{
			"key": "timestamp",
			"label": "New to old",
			"descending": true
		},
		{
			"key": "id",
			"label": "Post id"
		},
		{
			"key": "notes",
			"label": "Most notes",
			"descending": true,
			"force_int": true
		}
	]

Here is an example of the full custom fields JSON we added for Reddit datasets:

{
	"subreddit": "<a href='https://reddit.com/r/{{subreddit}}' target='__blank'>r/{{subreddit}}</a>",
	"score": "<i class='fas fa-arrow-up'></i> {{score}} <i class='fas fa-arrow-down'></i>",
	"external_url": "https://reddit.com/r/{{subreddit}}/comments/{{thread_id}}/comment/{{id}}",
	"image": "{{ image_file }}",
	"subject": "{{ subject }}",
	"subject_url": "<a href='{{ url }}'>{{ domain }}</a>",
	"sort_options": [
		{
			"key": "timestamp",
			"label": "Old to new"
		},
		{
			"key": "timestamp",
			"label": "New to old",
			"descending": true
		},
		{
			"key": "id",
			"label": "Post id"
		},
		{
			"key": "thread_id",
			"label": "Thread id"
		},
		{
			"key": "score",
			"label": "Score",
			"descending": true,
			"force_int": true
		}
	]
}

The values from the author, thread_id, id, and body columns will be shown by default, but can be overwritten or hidden. All fields with author in the key will be hidden if the dataset is pseudonymised. Some fields moreover have special rules:

  • Adding an external_url field will add a link to the original post. This is disabled when the data is pseudonymised.
  • Adding an image field with an image URL as a value will add an image to the post.
  • Adding an images field with comma-separated image URLs as values will add multiple images to the post.

Customise the look

You can add custom CSS for different data sources. To do so, make sure there is a directory called explorer in the data source's folder within datasources/ (e.g. datasources/reddit/explorer/). In it, make a CSS file with the name of the data source appended by -explorer.css (e.g. datasources/reddit/explorer/reddit-explorer.css). This CSS file can be edited to override the original formatting from webtool/static/css/explorer.css. As such, it is possible to mimic the look and feel of the website the data is derived from. For instance, we created the following look for Reddit posts: