Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to output format #12

Open
cmungall opened this issue Aug 12, 2021 · 5 comments
Open

Changes to output format #12

cmungall opened this issue Aug 12, 2021 · 5 comments

Comments

@cmungall
Copy link
Member

Related to #11.

Refer to sssom for good practice

  • use lowercase
  • split entity into two columns
    • ID
    • Property (synonym, label, etc)
  • split origin
  • what is zone?
  • sentence id:
    • this is a bit opaque
    • should there be an intermediate file created that is one line per sentence with document ids and sentence ids as the first two columns?
@hrshdhgd
Copy link
Collaborator

  • what is zone?

This is something OGER spits out in the output. I have no clue what it represents. The documentation does not specify it's significance either. I'll keep looking.

  • sentence id:

So when the text has multiple sentences, S1 is Sentence 1, S2 is Sentence 2 so on and so forth. It basically splits sentences by the separator (. for example) and assign these IDs to sentences.

@hrshdhgd
Copy link
Collaborator

I looked in the code for OGER and it seems that 'zone' is represented by section_type in the code. This is relevant to clinical notes. I have seen in this in the past that when clinical text is recorded in EHRs , there are sections in the text represented in all caps (for e.g. DIAGNOSIS, TREATMENT PLAN etc.). This basically highlights that. In our case it will always be blank.

@cmungall
Copy link
Member Author

there may be analogs, e.g. a typical journal article will be structured, maybe it is also for the structure/section heading, e.g. methods, abstract, ...?

@hrshdhgd
Copy link
Collaborator

That makes sense.

hrshdhgd added a commit that referenced this issue Aug 13, 2021
hrshdhgd added a commit that referenced this issue Aug 13, 2021
@cmungall
Copy link
Member Author

currently the match_field is sometimes empty sometimes filled

let's change to reuse sssom data dictionary where possible

  • object_id: (currently entity_id). The ontology term id that was matched
  • object_label: (currently sometimes this is in match field). This is the primary label of the object_id, regardless of whether the match was on the label or synonym
  • object_category (currently "type")

hrshdhgd added a commit that referenced this issue Oct 22, 2021
hrshdhgd added a commit that referenced this issue Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants