Registry of metadata identifier entities like UUID, GUID, person fullname, address and so on. Linked with other sources and to be converted to ontology in the future.
It's created from list of identifiers in Metacrafter tool and list of data classes in Datacrafter data catalog.
- data/datatypes/ - list of all known semantic data types as separate YAML files
- data/patterns/ - list of known patterns as separate YAML files
- data/categories.yaml - list of all data contexts. Categories used by Metacrafter tool to use only rules for certain situation/data/contexts set by user
- data/countries.yaml - list of all countries with available rules
- data/langs.yaml - list of all languages with available rules
Semantic data type is a primary data class with description of unique type of the data which is somehow defined identifier or commonly used data type.
Each semantic data type YAML file objects have following structure.
- id - unique identifier of the entity
- name - name of the entity
- category - list of contexts associated with entity.
- country - list of countries where this identifier used
- doc - English documentation/short description of this entity.
- langs - list of languages
- is_pii - true if this data is Personal identifiable information and false if not. PII could be detected also from contexts
- links - list of associated links with type as link type and url as url. Supported link types: wikipedia, wikidata, other
- regexp - regular expression that match this data type
- wikidata_property - property in Wikidata if applicable
- examples - list of examples with value and description for each one
- parent_type - name of the parent semantic type
- translations - name and doc translated to selected language.
Patterns are extensions, additional helpers to identify certain ways to represent semantinc data types. They could be different by usage type, country, language and so on. Patterns have no category since they inherit category from semantic data type
Each entity YAML file objects has following structure.
- id - unique identifier of the entity
- name - name of the entity
- doc - English documentation/short description of this entity.
- country - list of countries where this identifier used
- langs - list of languages
- links - list of associated links with type as link type and url as url. Supported link types: wikipedia, other
- regexp - regular expression that match this data type
- wikidata_property - property in Wikidata if applicable
- examples - list of examples with value and description for each one
Tools are software libraries, open source or proprietary software with support of semantic data types. Each entity YAML file objects has following structure.
- id - unique identifier of the entity
- name - name of the entity
- category - category of the tool. It could be one of: detector, pii, etl, other
- doc - English documentation/short description of this entity.
- website - URL of the primary web resource about this tool
- supported_types - array of strings with id of datatype or pattern for each string
Metadata for this registry collected and interlinked with multiple metadata sources.
Source link defined in property link
sub-property type
and it could be one of:
wikipedia
- wikipedia page urlwikidata
- Wikidata property url, also should be defined as id only, not url, inwikidata_property
propertyschema.org
- URL to Schema.org property, like https://schema.org/booleandatadrivendiscovery
- D3M metadata registry https://metadata.datadrivendiscovery.orgother
- any other url
Under development
Identification rules are regex, other pattern matching algorithms and code that help to identify certain semantic data type directly or by pattern.
- scripts/ - list of scripts to convert and process data types and related registry data
- src/ - minimalistic server side code to run metadata server/
Under development
Current data update procedure:
- Edit YAML files in data directory
- Run builder.py script. It will to rebuild data/datatypes_latest.json and data/datatypes_latest.jsonl files from YAML files
- Run src/registry.py to see changes locally https://127.0.0.1:8089
- Add, commit and push changed files
TODO: Add github actions for automatic registry build, version control, release and validation.
Server uses data/datatypes_latest.jsonl file to produce HTML for datatypes list
- Go to "src" directory
- Run "python registry.py"
Maintainer - Ivan Begtin (ivan@begtin.tech)