Dumb Question - overview #12736
Replies: 1 comment
-
Hey standeman, If you would like to extract pieces of information from texts its good to think about these as spans: each span of text is a contiguous sequence of characters. For example let's consider: import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp("Jerry walked past the Fish Square where Sandra Dune was waiting.")
print(doc.ents) This prints
When you annotate your data for named entity recognition -- or in general span recognition problems -- what you ultimately do is to label these character spans in your raw text. When creating the training data you should annotate all spans that you would would like your model to extract. Its always good to start with marking a bunch of spans yourself to get a good feeling about what kind of patterns exist in your data set. Then I would recommend going through our tutorial for rule-based matching to get some inspiration for some rules as well: https://spacy.io/usage/rule-based-matching |
Beta Was this translation helpful? Give feedback.
-
I often tell my legal clients there are no dumb questions, but when it comes to my creating a NLP app (which is what I am trying to do) , I may be breaking that rule. I have done a lot of reading and studying and just don't quite get the framework. It seems, however, that spacy and/or medspacy are good choices So if I could, I would like to use just one task of my project in the pre-processing area as an example.
I want to get and tag the treatment date for a variety of medical treatment notes. They come in all shapes and sizes depending on the EHR system. These are pdfs - I have no ability to access the actual EHR system that created them. Sometimes we have "Visit: ##/##/###", sometimes "Encounter: ##/##/####, other times "Date of Visit: ##/##/####". And sometimes we simply have a plain date with no indentifying text but the location (at the top) or the font (bold and bigger than other text) is the clue to that date being in fact the date of treatment.
Now I know I could use a rules-based approach and try to cover as many instances of "introductory text" that point to a treatment date. Is this the same thing as going through "training data" and hand-marking each instance of a correct treatment date? And if I so train a model, it would be without question that if in training I had marked "Visit Date: ##/##/####" as a "treatment date" that my model is going to work on a new pdf records with "Visit Date: ##/##/####" right?'
Bit would I also be correct that if I mark a date in my training data that is near the top of the page, in bold text, that the AI is going to be using those criteria in the search for treatment dates?
And is it true that the real power of the AI is that it is going to come up with different criteria for finding "treatment date" based upon my training data - factors I did not even pick up on?
,
Beta Was this translation helpful? Give feedback.
All reactions