Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new field descriptor "relationship" in TableSchema #803

Closed
loco-philippe opened this issue Jul 29, 2022 · 9 comments
Closed

Add a new field descriptor "relationship" in TableSchema #803

loco-philippe opened this issue Jul 29, 2022 · 9 comments

Comments

@loco-philippe
Copy link
Contributor

loco-philippe commented Jul 29, 2022

In many cases, data are unconsistent between two fields (see extract IRVE example ).

To specify and to check the relationship between two fields (i.e. cardinality in data model), we propose to add a new field descriptor "relationship" in TableSchema.

The proposal is detailled in the joined pdf file

if you want more details or if you want me to translate the linked example, ask me.

Regards

relationship_property.pdf
.

@loco-philippe
Copy link
Contributor Author

loco-philippe commented Aug 23, 2022

I translate in english the example above (extract IRVE example)
If you are interested to the principles of the tool used in this example, you can look at this presentation.

Thank you for your consideration of this proposal !

@bzg
Copy link

bzg commented Oct 12, 2022

It could help studying the proposal to give an example within the issue itself (not just to refer to the PDF.)

Can you write down a minimal example on how to use such a new field descriptor "relationship" in a schema?

Can you explain what value it provides for describing and processing (consistency-checks, etc.) the data?

@loco-philippe
Copy link
Contributor Author

Thank you Bastien for your comment.

Below is a note that offers both a simple example, an argument and links to detailed examples.

From my point of view, the value is very clear :

  • the relationships between entities are at the heart of the entity-relationship models (ER model) which describe open-data files
  • a tabular file cannot therefore be considered validated if the relationships are not themselves validated

Methodology for taking into account relations between fields in tabular representations

This note proposes an evolution to the methodology used in several opendata projects (eg. french guide to préparation des données à l'ouverture et la circulation)

Only the additions that could be made to the existing approach are discussed below.

0 - Introduction

0.1 - Objective

The data schema definition tools (eg TableSchema) define on the one hand descriptive and explanatory information of a data structure and on the other hand the rules to be respected to document this structure.

The rules currently defined mainly concern the fields taken separately but do not include the relationships between the fields that make up this structure.

Example of rules not currently processed:

  • a "person" is associated with a single "social security number" (and vice versa)
  • a "student" belongs to a single "class"

Relationships between fields are important in the consistency of a dataset. Moreover, they are very often expressed in the data models that describe them.
However, they are not included in the data schemas nor controlled in the dataset documentation.

The proposed evolution therefore consists in taking into account these relationships between fields at the level of the preparation phase as well as at the level of the operating phase:

0.2 - Example

In order to facilitate understanding of the subject, an example will be treated throughout this presentation. It concerns electric vehicle charging infrastructures (IRVE) which is the subject of a detailed schema and a large data set lien data.gouv.fr

An analysis of the complete IRVE dataset is available on this link.

The example presented is also detailed on this link.

1 - Preparation: Establishing the table schema

1.1 Description of the conceptual data model

The conceptual data model makes it possible to describe the structuring of the information that makes up the data sets.
The most used modeling and the most adapted to tabular datasets is the "entity-association" modelling. . This allows you to describe:

  • the entities,
  • associations and dependencies between entities,
  • the identifiers (primaryKeys) and attributes which explain the entities.

The initial modeling does not take implementation constraints into account; it is a tool for dialogue between the various stakeholders.

IRVE example:

In the simplified example, we consider two main entities:

  • stations: they are uniquely identified by an "Id" and characterized by a name,
  • the points de charge (pdc): These are the equipment associated with a station which ensure the connection to the vehicles to be charged. They are also identified by an "Id".

Two other entities are also present:

  • a localisation: identified by geographical coordinates and described by an address
  • an opérateur: the operator of the infrastructure. It is identified by a name

1.2 Description of the logical data model

The logical data model declines the conceptual model according to the envisaged information system (eg relational database, object modelling, etc.).

In the case of a tabular implementation, the logical model shows each of the future fields as an entity with the following rules:

  • the entities of the conceptual model are replaced by the entities of the identifiers
  • the attributes of the conceptual model are transformed into new entities
  • attribute entities are associated with identifier entities by 1-n relationships

The logical model is therefore directly deduced from the conceptual model.

Note:

1 - The 1-n relationship between attributes and identifiers expresses the fact that an attribute describes a given object. It can be reinforced into a 1-1 relationship if the considered attribute must be unique for a given identifier.
2 - In a relational database implementation, the notion of attributes can remain attached to the notion of entity when a "table" includes both identifiers and attributes.

IRVE example:

The logical model deduced from the previous conceptual model is as follows:

If the name of the station must be unique, the 1-n relationship between id_station_itinerance and nom_station can be reinforced by a 1-1 relationship.

1.3 Physical model

The physical model consists on the one hand in describing the fields in the schema and on the other hand in specifying the division into files.

1.3.1 Field structure

The fields are defined in the schema (not detailed here). It is therefore appropriate to add to this schema the relationships expressed at the level of the logic model.

To do this, a relationship property with two possible values is added to the schema for the fields concerned:

  • "derived" which expresses a 1 - n relationship
  • "coupled" which expresses a 1 - 1 relationship

Note:

1 - The "coupled" property is symmetric, so it can be carried indifferently by one of the two fields (unlike the "derived" property).
2 - A cardinality 0-n (or 0-1) in a tabular representation is equivalent to indicating that the field is optional (undefined value - null, Nan, None or other - authorized in the corresponding field).
3 - This new property is the subject of a TableSchema upgrade request (issue 803 under examination)

IRVE example:

Properties replace cardinalities:

With the TableSchema syntax the structure of the fields is as follows (in addition to the existing properties) :

"fields": [
  {
    "name": "nom_operateur",
    "relationship" : {
        "parent" : "id_station_itinerance",
        "link" : "derived" 
    }
  },
  {  
    "name": "id_station_itinerance",
    "relationship" : {
        "parent" : "id_pdc_itinerance",
        "link" : "derived" 
    }
  },
  {  
    "name": "nom_station",
    "relationship" : {
        "parent" : "id_station_itinerance",
        "link" : "derived" 
    }
  },
  {  
    "name": "adresse_station",
    "relationship" : {
        "parent" : "coordonneesXY",
        "link" : "derived" 
    }
  },
  {  
    "name": "coordonnéesXY",
    "relationship" : {
        "parent" : "id_station_itinerance",
        "link" : "coupled" 
    }
  }
]

1.3.2 Splitting into files

Several strategies are possible:

  • minimize the number of files: This makes it easier to access the data (eg direct use in a spreadsheet)
  • create one file per main entity: This allows you to "physically" enforce the defined structure

In the multi-file case, the separation is necessarily carried out at the level of the identifier entities.

IRVE example:

An example of a two-file implementation is shown below:

An example of a documented single file with 4 PDC is given below:

nom_operateur id_station_itinerance nom_station adresse_station coordonneesXY id_pdc_itinerance
SEVDEC FRSEVP1SCH01 SCH01 151 Rue d'Uelzen 76230 Bois-Guillaume [1.106329, 49.474202] FRSEVE1SCH0101
SEVDEC FRSEVP1SCH03 SCH03 151 Rue d'Uelzen 76230 Bois-Guillaume [1.106329, 49.474202] FRSEVE1SCH0301
SEVDEC FRSEVP1SCH02 SCH02 151 Rue d'Uelzen 76230 Bois-Guillaume [1.106329, 49.474202] FRSEVE1SCH0201
Sodetrel FRS35PSD35711 RENNES - PLACE HONORE COMMEREUC 13 Place Honoré Commeurec 35000 Rennes [-1.679739, 48.108482] FRS35ESD357111

2 - Operation: Documentation and assembly of data sets

Dataset documentation consists of documenting a set of rows according to the defined file structure.
The main expectation of this phase is to be able to detect and correct as soon as possible and simply any deviations from the defined rules.

Four levels of analysis must be taken into account:

  • unit validation of data for a field
  • validation of a record (multi-fields)
  • internal validation of a single data set (multi records)
  • external validation of the global dataset (multiple data sets)

The first two levels are processed in the existing tools (not detailed here).
The third level consists in validating the rules defined by the relationship property on the data set before an integration request in the corresponding file.
The fourth level is functionally identical to the third but can only be performed on the aggregation of all the data sets.

The restitution of errors can be done simply by adding Boolean control fields associated with each property checked.

Note:

1 - To be easily operable, the control tool must make it possible to precisely locate the errors identified.
2 - The tool indicated in this link is a simple example of a control but this one does not allow error localization
3 - The aggregation of multiple validated data sets does not necessarily result in a valid file. The external validation of a data set can therefore lead to the identification of errors that potentially relate to data sets that had already been validated
4 - To integrate the 4th level as soon as data is entered, global control must be able to be activated on request (eg via a service made available by the holder of the dataset aggregation file). this can be expensive in terms of resources and response time.

IRVE example:

The activation and validation of these rules on a dataset are presented on this link.
In particular, it presents the implementation of a tool allowing both the detection and localization of errors.

@roll
Copy link
Member

roll commented Dec 4, 2023

Thanks a lot for the analysis!

Usually, new concepts are added to the Standard as a pattern, for example, #844. Would you be interested in working on one?

@loco-philippe
Copy link
Contributor Author

loco-philippe commented Dec 4, 2023

Thank you @roll for this positive feedback.

It's a good idea and I would be very happy to be able to add this topic to the list of patterns.

Do I prepare a first version that I share with a PR (and we then exchange/complete it) ?

Another question: On Github, to share a data model I use Mermaid. Is this compatible with the https://specs.frictionlessdata.io/patterns page?

@roll
Copy link
Member

roll commented Dec 5, 2023

Hi @loco-philippe,

Do I prepare a first version that I share with a PR (and we then exchange/complete it) ?

Yes 🚀

v1 website supports Mermaid - https://specs.frictionlessdata.io/, but v2 might not support it, but I think starting from mermaid version will make sense and if it's not supported in v2 (for interoperability reasons) we can just convert to text + images later

@loco-philippe
Copy link
Contributor Author

loco-philippe commented Dec 9, 2023

Thank you @roll for your answers, i will create the PR towards the end of the year (i'm too busy to do that before !)

@roll
Copy link
Member

roll commented Apr 12, 2024

DONE

https://datapackage.org/recipes/relationship-between-fields/

Thanks a lot @loco-philippe !

@roll roll closed this as completed Apr 12, 2024
@bzg
Copy link

bzg commented Apr 12, 2024

Wow, glad to see this happening, congrats @loco-philippe and thanks to the whole frictionless team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants