Skip to content

Metadata lifecycle

Cristian Vasquez edited this page Oct 18, 2024 · 6 revisions

Metadata lifecycle

The Operational Metadata is collected to enable Data Operations

When the Operational Metadata is collected?

Metadata can be generated at different points in the process:

  1. Before the transformation: Capturing what is already known about the pipeline and mappings.
  2. During transformation: Logging performance metrics or key events.
  3. After transformation: Recording the results and outcomes of the job.

When to collect the metadata will depend on the transformation method. For example, in a streaming scenario, part-whole relations and count statistics may be captured during step 2, while in a batch process, these statistics may occur in step 3.

How is the Metadata persisted?

Data operations happen at two levels (See:Data Operations#Granularity), for batches and notices. The simplest is to maintain one JSON document for each batch and notice, containing the Operational Metadata.

Note: The metadata can be stored in current databases simplifying https://github.com/OP-TED/ted-rdf-conversion-pipeline/issues/553. Alternatively, metadata can be logged as quads in a TRIG file, specifying named graphs.

How is the Metadata consumed?

It should be possible to do queries to support Data Operations. Additionally, the stored metadata of batches and notices must be accessible to downstream systems through a URL, making easy its consumption. The metadata later on can be transformed into RDF to be linked or included in a Data Catalog.

How the metadata is updated?

There exists only one metadata document for each Batch or Notice. A proposed approach for updating metadata is as follows:

  • On Success: The metadata document is upserted, ensuring that the most recent information is always available.
  • On Failure: Failure events are appended to the existing metadata document. This maintains a history of failures until the job succeeds.