Skip to content
peichins edited this page Aug 30, 2018 · 14 revisions

The role of datasets is to group arbitrary segments of audio. This audio could belong to different audio recordings, sites or even projects. Datasets will allow other tables to track history of when these segments of audio have been viewed, played, or annotated.

Schema

A diagram of the schema is found here: https://www.lucidchart.com/invitations/accept/2f1588db-769a-4418-98d9-0d1c8ad3a2b9

Datasets for Citizen Science Projects

One important use for datasets will be for citizen science projects. A citizen science project will be associated with a dataset. The dataset will define which segments of audio are shown to citizen scientists. The schema diagram also includes the proposed schema for citizen science projects (in green), although at the time of writing this is still in draft and subject to change.

Citizen-scientist's responses will not be saved as progress progress events, but will be instead be saved elsewhere. Progress events may eventually be used by the algorithm that selects dataset items to be shown to citizen scientists.

"Default" dataset for history tracking

There will be a dataset called "default". Most or all user interactions with any audio will be saved via this default dataset. For example, when a user views a section of audio on the /listen page, that segment of audio will be added (if not already there) as a dataset item to the default dataset, and a "played" progress_event will be associated with it.

Details

  • The default dataset will have items added to it as needed when users browse through audio on the /listen page. Currently there are no other pages in baw-client that will result in additions to the default dataset. Listening to audio in the annotation library will not be recorded as progress events, and citizen science projects will have their own dedicated datasets.
  • Dataset items should be uniquely identifiable on the tuple audio_recording_id, start_offset and end_offset. That is, if a user visits the same audio segment twice, they will create two sets of progress_events, but they will be associated with the same dataset_item in the default dataset. Therefore, the process by which dataset_items and progress_events are added for browsing history must check for an existing dataset_item before adding the new item. This should not slow down the response for audio, since the api requests for progress_events will be made asynchronously by the client after any media responses.
  • users are able to request arbitrary start and end times for the audio that they listen to. If, for example, they listen from 300 to 330, and this is will result in the creation of a dataset item and associated progress event(s). If they then later listen to 310 to 320, this will be stored as a new (overlapping) dataset item.

    ("ooooohhhoooohhooo", ghost in the machine here: we original supported the notion of compacting overlapping segments, however, I think this is now a very complex implementation. We'll stick to the notion that progress_events are an event stream (thus are always distinct). If we want to produce aggregate reports, we'll do that (and deal with the problems) later on.)

Implementation specifics

Immutability

  • Datasets can have their name and description edited. They should therefore store the updated_at time and updater id.
  • Datasets are not deletable. Progress events will be associated with dataset_items and rather than have to consider the implications of cascading the delete all through all the dependents, it will be better just to have datasets live forever once created. We envisage there being a few large datasets, which means that there should be no great need to clear away old unused datasets.
  • Progress events are not updatable or deletable. They record an event that happened at one moment in time, and therefore it is conceptually incorrect to allow them to be modified.
  • Dataset items will be mutable, with the intention that their creator may want to amend them with a correction soon after creation. It is undesirable for a dataset item to be updated after a progress event has been associated with it. Therefore the condition of updating is that the dataset item has no child progress events.
  • Dataset items are deletable. This will cascade to children, e.g. progress items.

Endpoints and permissions

Permissions are applied to projects and their descendants. Dataset items are descendants of project through dataset_item, audio_recording, site, project. Datasets can contain dataset items from more than on project, and therefore permissions are not applicable to dataset items.

The following permissions only apply to non-admins, as admins have full access.

Datasets endpoints

  • index, filter and show:
    • Any user including guest
    • There are no permissions defined for datasets, so all users see all datasets
  • create and update:
    • Any logged in user can create a dataset.
    • Only the creator can update a dataset.
  • Note: there is no destroy action

Dataset items

  • index and filter
    • Any user including guest can request the index or filter
    • Dataset items will only be included in the results if the user has read access on them via dataset item -> audio_recording -> site -> project
    • If the user doesn’t have permission on any of the matching dataset items, they will get a zero item count
  • new
    • any logged in user can access new
  • show:
    • As with index filter, users can only view a dataset item if its audio recording is associated with a project that the user has permission for.
    • If they don’t have permission for it, the response is forbidden
  • Create, update and delete
    • Only admins can create or update, delete dataset items.
    • The exception to this is for the Default Dataset. Any user can create a dataset item in the default dataset as long as they have read access on the dataset item's audio recording.

Progress Events

  • index and filter
    • Any user including guest can request the index or filter
    • Progress Events will only be included in the results if the user has read access on them via progress event -> dataset item -> audio_recording -> site -> project
  • new
    • any logged in user can access new
  • show:
    • As with index and filter, users can only view a progress event if its dataset item is associated with a project that the user has permission for.
    • If they don’t have permission for it, the response is forbidden
  • create
    • Users can create a progress event for any dataset_item that they have read access on.
  • update and delete
    • Only admins can update or delete progress events.