Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Only Year from text #4

Open
swathimithran opened this issue Jul 26, 2017 · 5 comments
Open

Extract Only Year from text #4

swathimithran opened this issue Jul 26, 2017 · 5 comments

Comments

@swathimithran
Copy link

Thanks for this great project.
Currently I am able to extract the dates, but for only year i.e for eample "In year 2011 the incident happened." The program retrieves "2011-01-01 00:00:00+00".

But we need to retrieve it as "2011-01-01 12:14:12+00"
Can you please let me know how should I change in the library to achieve this.

The basic Aim is to differentiate the original "1st Jan 2011" and "2011".

Thanks

@DanielJDufour
Copy link
Owner

Great question. Give me 24 hours and I'll have a solution for you :)

@swathimithran
Copy link
Author

Thanks man!!!!!!
waiting for your reply :)

@DanielJDufour
Copy link
Owner

Hey, I thought about it a lot and this is what I came up with. You can set return_precision to True and the functions will return a tuple of (date, precision). Precision can be "year", "month", or "day". So precision is "day" for "1st Jan 2011" and "year" for "2011". Consult the Readme for a full example. Let me know if this doesn't work for you and we can work on another solution! Thanks for your interest!

@swathimithran
Copy link
Author

swathimithran commented Jul 27, 2017

Hi Daniel,
Thanks for the immediate update, I tried it and its working perfectly fine for our requirement. One more issue which I am facing is in normalisation of dates in the format "98".

For example my text is :

He was selected by the Sacramento Kings in the 2nd round (48th overall) of the 2004 NBA_Draft. A 6'4' guard from Morehead State University, Minard was signed by the Kings in July 2004, but they waived him in November the same year, and so far he has never appeared in an NBA game.

Ouput :

[(datetime.datetime(1948, 1, 1, 0, 0, tzinfo=), 'year'), (datetime.datetime(2004, 1, 1, 0, 0, tzinfo=), 'year'), (datetime.datetime(2004, 7, 1, 0, 0, tzinfo=), 'month')]

So here the 1948 year should not have been fetched.

I think we can solve this issue if we implement a login to only normalise those 2 digits which are preceded by "-" and not followed by "th".

Please let any know if you have any other solution to resolve this.

Thanks & Regards,
M Swathi Mithran

@DanielJDufour
Copy link
Owner

DanielJDufour commented Jul 28, 2017

@swathimithran, thanks for the example. Your help is sincerely appreciated! As a quick fix, I made it so it won't capture ordinal numbers that end in th. You can view the change here: 68deab4

However, we will need more discussion on what rule should be used for what precedes the 2 digits. Here's a few examples of 2 digit years:

  • 12/23/09
  • 15-11-21
  • 9/1/99 22:00
  • paper_170120 (in a filename)
  • taxes_16.docx

Basically, I'm afraid that if we make the rule too strict, people won't be able to parse dates out of filenames.

Here's a few possible solutions. What would you like?

Option 1

Add a parameter source_type, which could be filename, filepath, text, html, or javascript. This way you could customize it, so you can restrict the rules to certain types of sources. Here's an example of what this could look like

from date_extractor import extract_date
string = "I went to my first basketball game in 1990.  My favorite player had number 34."
date = extract_date(string, source_type="text")

string = "my_resume_091216.pdf"
date = extract_date(string, source_type="filename")

Option 2

The second option could be allowing users to override existing patterns.

import date_extractor
date_extractor.patterns['y'] = "\d{4}"
string = "The author asserts that the earliest encounter never happened (43)"
date_extractor.extract_dates(string)

Option 3

The third option could be returning a confidence level, low, medium, or high. You could then filter depending on your need.

from date_extractor import extract_dates
string = "He was selected by the Sacramento Kings in the 2nd round (48th overall) of the 2004 NBA_Draft. A 6'4' guard from Morehead State University, Minard was signed by the Kings in July 2004, but they waived him in November the same year, and so far he has never appeared in an NBA game."
dates = extract_dates(string, return_confidence=True)
# dates = [(datetime.datetime(1948, 1, 1, 0, 0, tzinfo=), 'low'), (datetime.datetime(2004, 1, 1, 0, 0, tzinfo=), 'medium'), (datetime.datetime(2004, 7, 1, 0, 0, tzinfo=), 'high')]

Option 4

I'm open to suggestions as long as it doesn't prevent users from extracting years out of filenames.

Which do you prefer? What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants