Skip to content

nenad1002/CDM-trait-semantic-analysis

Repository files navigation

CDMTraitSemanticAnalysis project

The purpose of this project is to find traits for entities and attributes inside CDM (Common Data Model) schema documents. The schema documents folder can be found in the project.

The proposed traits are being found by running NLP analysis on the name and descriptions of every entity.

The Jaccard index between the set of generated and sample traits is above 0.7

The project uses both NLTK and Spacy as NLP processing libraries in order to tokenize, stem, lemma and do vector-based comparison of the description sencences. In order to run it, just install the requirements and run main.py to follow additional instructions.

Example:

Attribute name: agingId

Description: Represents the Microsoft's subsidiary age ID that have positive ROI every year.

Proposed traits: ['means.demographic.age', 'means.measurement.age', 'means.identity', 'means.idea.company', 'means.idea.organization', 'means.idea.organization.unit', 'means.identity.company.name']

As it is clear from the proposed set of traits, the analyzer will try to find appropriate features inside the description while ignoring the ones that are not important to find the correct traits.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages