Conceptualization of datasets and outputs #103

kasnerz · 2024-09-25T14:43:30Z

This pull request brings another package of major changes to factgenie workflows.

👉️ You should be able to safely integrate the changes into your existing installation. Factgenie will detect that it has been updated and will migrate your custom configuration.

❗️ Careful: You should still back up your files before performing this update.

Changes:

All the configuration files are now stored in the factgenie/config directory. That includes:
- the main configuration file config.yml,
- the list of local datasets datasets.yml,
- the new list of external resources resources.yml (see below),
- the configuration files for the campaigns (no change here).
We no longer store the example resources (datasets, model outputs, annotations) directly in the repository. Instead, the workflow is the following:
- The available resources are listed in factgenie/config/resources.yml.
- We still provide the loaders for datasets in factgenie/loaders. What is new is that the loaders for the external datasets provide an additional download() method. This method can download the example dataset along with all the related resources, i.e. model outputs and annotations.
- The Data Management page contains a tab External resources where the user can download these resources.
- Once the resource is downloaded, all the related data (examples, model outputs, annotations) can be managed locally.
- A user can still add a local dataset manuall or through the web interface as before. (In these cases, the download() method needs not to be implemented.)
- This resolves Make example datasets optional #73
We now store the model outputs in a JSONL format instead of the JSON format:
- The format is equivalent to the output format produced by the generation campaigns.
- It allows us to be more flexible. Each line has example_idx which relates it to the example in the dataset. Therefore, we no longer require to have a full set of model outputs for the particular split.
- It allows us to export the generation outputs any time during the generation campaign, not only when the campaign is finished.
- This resolves Index model outputs using ids #86
We made some progress towards supporting more crowdsourcing services (Support other crowdsourcing services #18):
- It is now possible to specify that the campaign is intended for local annotators: in that case, the annotation page expects only a single parameter ANNOTATOR_ID.
- We still support Prolific with its three parameters PROLIFIC_PID, SESSION_ID, STUDY_ID
- Experimentally, we added a similar possibility for Amazon Mechanical Turk. Here, we expect the parameters workerId, assignmentId, hitId. (However, this is completely untested.)
We started using ast.literal_eval for parsing model arguments from the YAML file, hopefully covering majority of cases even for arguments we do not know about.
- This resolves Robust parsing of int and float model arguments #93

We provide an utils.migrate() function that tries to convert old files to the new format, move configuration files to the new directories, etc. The method is invoked only if the main configuration file is detected in the old location.

On top of these major changes:

🪲 We implemented many small bugfixes, some of which may be even crucial for proper workings of factgenie.
🏀 We created a set of detailed tutorials on factgenie wiki, showing how to add the subset of Rotowire dataset into factgenie and how to generate outputs and annotations on this dataset.
🌈 We did minor graphic updates, including a new favicon, colorful badges in campaign overview, etc.
📚️ We added a page for managing annotations on the Data management page.
📊 We are now using Bootstrap Table on the Data management page, so the tables are more flexible and powerful.
📜 We updated the example configuration files to provide more sensible defaults.

@oplatek I guess you don't have time for a proper review, so I am merging this update myself. It would be great if you could test the update on any of your instances that you are still using (but make sure to back up everything first).

…vailable for download

…t cyclic dependencies, unify the usage of pathlib

… datasets

…data-link and outputs-link

…enerations, provide backward compatibility methods

kasnerz added 21 commits September 24, 2024 15:05

Add a page for no available datasets

063f544

Temporarily track config files

7bb3210

Do not track campaign configs

4adaa7e

Move main config to /config

3b81546

Update favicon

7002d15

Split data configuration files between local (empty by default) and a…

ebee4e2

…vailable for download

Ignore the /data directory and the list of local datasets

26f1156

Update config paths, create an empty local dataset config ad hoc

c8545d6

Remove example outputs

d9f826a

Add bootstrap table resources

3127ef3

Stop tracking /data

79d708e

Add bootstrap icons

0028924

Move path constants to __init__.py to unify their location and preven…

31ce4fd

…t cyclic dependencies, unify the usage of pathlib

Split datasets between local and to-be-downloaded, enable downloading…

199c628

… datasets

Modify external datasets config

d35dee1

Use bootstrap tables

29f189a

Unify favicon color with primary button color

b425180

Update logicnlg loader

9f31f49

Rename base.py to basic.py

3d8ea42

Minor fixes

ab1fdc1

Add remaining datasets, enable default download implementation using …

828d77e

…data-link and outputs-link

kasnerz changed the title ~~WIP: Dataset management~~ WIP: External and local datasets Sep 26, 2024

kasnerz added 8 commits September 26, 2024 08:08

Move the default download logic under BasicDataset

95505b4

Update README

249a580

Enable annotations to be external

3d1fb57

Update datasets descriptions

f38de97

Start outsourcing self-contained JS functions to utils.js

f375641

Minor updates

16548c9

Prepare for downloading annotations alongside the dataset

66c91d8

Update data management page

bd973e9

kasnerz added 13 commits September 27, 2024 20:44

Minor usability improvements

4462faf

Use ast.literal_eval() to determine the type of model arguments

9b7e454

Comment out extra settings

a3c37d4

Use gpt4-o in example config files

0df5ada

Add generation through text-generation-webui API

73a5132

Rename the 'rstrip' parameter to a more appropriate 'remove_suffix'

b450ad2

Minor label updates

81cc891

Remove example outputs

44a2253

Remove example annotations

3585af4

Store model outputs in JSONL format analogically to annotations and g…

64437b8

…enerations, provide backward compatibility methods

Remove unused parameter

59d5989

Fix loading LogicNLG outputs

0691ff5

Minor fixes

f51ef4c

kasnerz changed the title ~~WIP: External and local datasets~~ WIP: Conceptualization of datasets and outputs Oct 2, 2024

kasnerz added 13 commits October 4, 2024 14:18

Fix external campaign stats

6e8e6cb

Hardcode annotator prompt

7c8f0b0

Add instructions about annotations not overlapping

da12514

Slugify campaign id when duplicated

06e1422

Show llm campaign status on overview page

e52f08d

Specify crowdsourcing service, enable local annotators

188fe83

Update README

4e40c38

Add contribution guidelines

8b444e4

Updates and fixes for migration

94fdb72

Migrate also base -> basic

e6da56a

Do not preserve old logicnlg and xsum

f580e60

Fix: list vs set

e72d05c

Minor fixes and updates

68ca289

kasnerz changed the title ~~WIP: Conceptualization of datasets and outputs~~ Conceptualization of datasets and outputs Oct 6, 2024

kasnerz merged commit 4844853 into main Oct 6, 2024

oplatek deleted the dataset-management branch November 13, 2024 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conceptualization of datasets and outputs #103

Conceptualization of datasets and outputs #103

kasnerz commented Sep 25, 2024 •

edited

Loading

Conceptualization of datasets and outputs #103

Conceptualization of datasets and outputs #103

Conversation

kasnerz commented Sep 25, 2024 • edited Loading

kasnerz commented Sep 25, 2024 •

edited

Loading