-
Notifications
You must be signed in to change notification settings - Fork 64
Exploring and annotating datasets
4CAT's 'Explorer' function allows exploring and annotating datasets within a customisable Web page. This presents posts and annotation fields in a more convenient and legible manner than spreadsheet software. The Explorer also answers to 4CAT's call for 'traceability' by allowing to navigate between micro- and macro-views on a dataset.
To access the Explorer, click the Explore
button on the dataset page. This will show the first posts from your dataset.
The bottom-right of the Explorer page features annotation controls. Click Edit fields
to add or change annotation fields.
Clicking New field
in the field editor will add a new annotation field. The Label
field determines its name, the Input type
dropdown what type of annotation field this should be, and Options
the possible pre-selected options.
Currently you can choose between a text field, a large text field (textarea), a dropdown, and checkboxes.
Clicking Apply
will save your annotation fields to the database and apply them to the page. After this, you can safely close the page without the annotation fields disappearing. Changing the input type of existing annotation fields will delete any old values you might have already saved.
After applying, you can click Show annotations
to show the annotation fields for every post. You can save the annotations to the database by clicking Save annotations
. You can collaborate with others by annotating posts on different pages, but new changes to existing annotations will override the old ones.
If you check Also write to dataset
, 4CAT will create a .csv file with the original data plus your annotations in new columns. This can take a while for large datasets, so be sure to only do this when you're done annotating.
By default, the Explorer shows the author, thread ID, post ID, timestamp, and body for every post. However, it can be useful to show additional information. 4CAT therefore allows admins customise what data is shown per data source.
To do so, make sure there is a directory called explorer
in the data source's folder within datasources/
(e.g. datasources/reddit/explorer/
). In it, make a JSON file with the name of the data source appended by -explorer.json
(e.g. datasources/reddit/explorer/reddit-explorer.json
). This JSON file provides what and how custom values should be shown. The general format looks like this:
{
"original_key1": "{{ column_to_use1 }}",
"original_key2": "Label: {{ column_to_use2 }}
}
The keys will be the title of the field (shown on mouse hover) and the value represents what is actually shown. With the curly brackets you can retrieve a value from a specific column or key in your original dataset. For instance, {{ subreddit }}
will retrieve the value of the subreddit
column for the respective post. You can also insert HTML to add things like icons or anchor tags. The also support basic string slicing (e.g. {{ author[:5] }}
) and (custom) Jinja2 filters (e.g. {{ timestamp | timify_long }}
.
Note: If the primary dataset is a NDJSON, you must explicitly set the right keys for both JSON and CSV types. You do this by nesting the dictionaries in another dictionary with the extension suffix as a key (in lowercase). For instance:
{
"ndjson": {
"image": "{{ attachments.media_keys.url }}",
// more stuff..
},
"csv": {
"image": "{{ images }}",
// more stuff..
}
}
You can add a sort_options
field to add the option to sort the posts in a certain way. This item should include a list of dictionaries containing key
and label
items, with the value of key
being the column you want to use for sorting and label
being the dropdown label. There are two additional options: descending
and force_int
. descending
will result in the sort order being reversed. This is for instance useful when you want to sort your posts according to the highest to the lowest score or from new to old. force_int
will convert the value to an integer. Integer values are stored as strings in the csv datasets used by the Explorer, which causes wonky sorting (9 will for instance be sorted after 7890). force_int
fixes this. Setting descending
and force_int
to true
will enable these options. For instance, these are the sort options we used for Tumblr posts:
"sort_options": [
{
"key": "timestamp",
"label": "Old to new"
},
{
"key": "timestamp",
"label": "New to old",
"descending": true
},
{
"key": "id",
"label": "Post id"
},
{
"key": "notes",
"label": "Most notes",
"descending": true,
"force_int": true
}
]
Here is an example of the full custom fields JSON we added for Reddit datasets:
{
"subreddit": "<a href='https://reddit.com/r/{{subreddit}}' target='__blank'>r/{{subreddit}}</a>",
"score": "<i class='fas fa-arrow-up'></i> {{score}} <i class='fas fa-arrow-down'></i>",
"external_url": "https://reddit.com/r/{{subreddit}}/comments/{{thread_id}}/comment/{{id}}",
"image": "{{ image_file }}",
"subject": "{{ subject }}",
"subject_url": "<a href='{{ url }}'>{{ domain }}</a>",
"sort_options": [
{
"key": "timestamp",
"label": "Old to new"
},
{
"key": "timestamp",
"label": "New to old",
"descending": true
},
{
"key": "id",
"label": "Post id"
},
{
"key": "thread_id",
"label": "Thread id"
},
{
"key": "score",
"label": "Score",
"descending": true,
"force_int": true
}
]
}
The values from the author
, thread_id
, id
, and body
columns will be shown by default, but can be overwritten or hidden. All fields with author
in the key will be hidden if the dataset is pseudonymised. Some fields moreover have special rules:
- Adding an
external_url
field will add a link to the original post. This is disabled when the data is pseudonymised. - Adding an
image
field with an image URL as a value will add an image to the post. - Adding an
images
field with comma-separated image URLs as values will add multiple images to the post.
You can add custom CSS for different data sources. To do so, make sure there is a directory called explorer
in the data source's folder within datasources/
(e.g. datasources/reddit/explorer/
). In it, make a CSS file with the name of the data source appended by -explorer.css
(e.g. datasources/reddit/explorer/reddit-explorer.css
). This CSS file can be edited to override the original formatting from webtool/static/css/explorer.css
. As such, it is possible to mimic the look and feel of the website the data is derived from. For instance, we created the following look for Reddit posts:
🐈🐈🐈🐈