Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conceptualization of datasets and outputs #103

Merged
merged 64 commits into from
Oct 6, 2024
Merged

Conceptualization of datasets and outputs #103

merged 64 commits into from
Oct 6, 2024

Conversation

kasnerz
Copy link
Collaborator

@kasnerz kasnerz commented Sep 25, 2024

This pull request brings another package of major changes to factgenie workflows.


👉️ You should be able to safely integrate the changes into your existing installation. Factgenie will detect that it has been updated and will migrate your custom configuration.

❗️ Careful: You should still back up your files before performing this update.


Changes:

  • All the configuration files are now stored in the factgenie/config directory. That includes:
    • the main configuration file config.yml,
    • the list of local datasets datasets.yml,
    • the new list of external resources resources.yml (see below),
    • the configuration files for the campaigns (no change here).
  • We no longer store the example resources (datasets, model outputs, annotations) directly in the repository. Instead, the workflow is the following:
    • The available resources are listed in factgenie/config/resources.yml.
    • We still provide the loaders for datasets in factgenie/loaders. What is new is that the loaders for the external datasets provide an additional download() method. This method can download the example dataset along with all the related resources, i.e. model outputs and annotations.
    • The Data Management page contains a tab External resources where the user can download these resources.
    • Once the resource is downloaded, all the related data (examples, model outputs, annotations) can be managed locally.
    • A user can still add a local dataset manuall or through the web interface as before. (In these cases, the download() method needs not to be implemented.)
    • This resolves Make example datasets optional #73
  • We now store the model outputs in a JSONL format instead of the JSON format:
    • The format is equivalent to the output format produced by the generation campaigns.
    • It allows us to be more flexible. Each line has example_idx which relates it to the example in the dataset. Therefore, we no longer require to have a full set of model outputs for the particular split.
    • It allows us to export the generation outputs any time during the generation campaign, not only when the campaign is finished.
    • This resolves Index model outputs using ids #86
  • We made some progress towards supporting more crowdsourcing services (Support other crowdsourcing services #18):
    • It is now possible to specify that the campaign is intended for local annotators: in that case, the annotation page expects only a single parameter ANNOTATOR_ID.
    • We still support Prolific with its three parameters PROLIFIC_PID, SESSION_ID, STUDY_ID
    • Experimentally, we added a similar possibility for Amazon Mechanical Turk. Here, we expect the parameters workerId, assignmentId, hitId. (However, this is completely untested.)
  • We started using ast.literal_eval for parsing model arguments from the YAML file, hopefully covering majority of cases even for arguments we do not know about.

We provide an utils.migrate() function that tries to convert old files to the new format, move configuration files to the new directories, etc. The method is invoked only if the main configuration file is detected in the old location.

On top of these major changes:

  • 🪲 We implemented many small bugfixes, some of which may be even crucial for proper workings of factgenie.
  • 🏀 We created a set of detailed tutorials on factgenie wiki, showing how to add the subset of Rotowire dataset into factgenie and how to generate outputs and annotations on this dataset.
  • 🌈 We did minor graphic updates, including a new favicon, colorful badges in campaign overview, etc.
  • 📚️ We added a page for managing annotations on the Data management page.
  • 📊 We are now using Bootstrap Table on the Data management page, so the tables are more flexible and powerful.
  • 📜 We updated the example configuration files to provide more sensible defaults.

@oplatek I guess you don't have time for a proper review, so I am merging this update myself. It would be great if you could test the update on any of your instances that you are still using (but make sure to back up everything first).

@kasnerz kasnerz changed the title WIP: Dataset management WIP: External and local datasets Sep 26, 2024
@kasnerz kasnerz changed the title WIP: External and local datasets WIP: Conceptualization of datasets and outputs Oct 2, 2024
@kasnerz kasnerz changed the title WIP: Conceptualization of datasets and outputs Conceptualization of datasets and outputs Oct 6, 2024
@kasnerz kasnerz merged commit 4844853 into main Oct 6, 2024
@oplatek oplatek deleted the dataset-management branch November 13, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant