Extract Only Year from text #4

swathimithran · 2017-07-26T10:24:00Z

Thanks for this great project.
Currently I am able to extract the dates, but for only year i.e for eample "In year 2011 the incident happened." The program retrieves "2011-01-01 00:00:00+00".

But we need to retrieve it as "2011-01-01 12:14:12+00"
Can you please let me know how should I change in the library to achieve this.

The basic Aim is to differentiate the original "1st Jan 2011" and "2011".

Thanks

DanielJDufour · 2017-07-26T12:21:57Z

Great question. Give me 24 hours and I'll have a solution for you :)

swathimithran · 2017-07-26T12:29:25Z

Thanks man!!!!!!
waiting for your reply :)

DanielJDufour · 2017-07-27T04:33:44Z

Hey, I thought about it a lot and this is what I came up with. You can set return_precision to True and the functions will return a tuple of (date, precision). Precision can be "year", "month", or "day". So precision is "day" for "1st Jan 2011" and "year" for "2011". Consult the Readme for a full example. Let me know if this doesn't work for you and we can work on another solution! Thanks for your interest!

swathimithran · 2017-07-27T07:43:47Z

Hi Daniel,
Thanks for the immediate update, I tried it and its working perfectly fine for our requirement. One more issue which I am facing is in normalisation of dates in the format "98".

For example my text is :

He was selected by the Sacramento Kings in the 2nd round (48th overall) of the 2004 NBA_Draft. A 6'4' guard from Morehead State University, Minard was signed by the Kings in July 2004, but they waived him in November the same year, and so far he has never appeared in an NBA game.

Ouput :

[(datetime.datetime(1948, 1, 1, 0, 0, tzinfo=), 'year'), (datetime.datetime(2004, 1, 1, 0, 0, tzinfo=), 'year'), (datetime.datetime(2004, 7, 1, 0, 0, tzinfo=), 'month')]

So here the 1948 year should not have been fetched.

I think we can solve this issue if we implement a login to only normalise those 2 digits which are preceded by "-" and not followed by "th".

Please let any know if you have any other solution to resolve this.

Thanks & Regards,
M Swathi Mithran

DanielJDufour · 2017-07-28T03:19:39Z

@swathimithran, thanks for the example. Your help is sincerely appreciated! As a quick fix, I made it so it won't capture ordinal numbers that end in th. You can view the change here: 68deab4

However, we will need more discussion on what rule should be used for what precedes the 2 digits. Here's a few examples of 2 digit years:

12/23/09
15-11-21
9/1/99 22:00
paper_170120 (in a filename)
taxes_16.docx

Basically, I'm afraid that if we make the rule too strict, people won't be able to parse dates out of filenames.

Here's a few possible solutions. What would you like?

Option 1

Add a parameter source_type, which could be filename, filepath, text, html, or javascript. This way you could customize it, so you can restrict the rules to certain types of sources. Here's an example of what this could look like

from date_extractor import extract_date
string = "I went to my first basketball game in 1990.  My favorite player had number 34."
date = extract_date(string, source_type="text")

string = "my_resume_091216.pdf"
date = extract_date(string, source_type="filename")

Option 2

The second option could be allowing users to override existing patterns.

import date_extractor
date_extractor.patterns['y'] = "\d{4}"
string = "The author asserts that the earliest encounter never happened (43)"
date_extractor.extract_dates(string)

Option 3

The third option could be returning a confidence level, low, medium, or high. You could then filter depending on your need.

from date_extractor import extract_dates
string = "He was selected by the Sacramento Kings in the 2nd round (48th overall) of the 2004 NBA_Draft. A 6'4' guard from Morehead State University, Minard was signed by the Kings in July 2004, but they waived him in November the same year, and so far he has never appeared in an NBA game."
dates = extract_dates(string, return_confidence=True)
# dates = [(datetime.datetime(1948, 1, 1, 0, 0, tzinfo=), 'low'), (datetime.datetime(2004, 1, 1, 0, 0, tzinfo=), 'medium'), (datetime.datetime(2004, 7, 1, 0, 0, tzinfo=), 'high')]

Option 4

I'm open to suggestions as long as it doesn't prevent users from extracting years out of filenames.

Which do you prefer? What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Only Year from text #4

Extract Only Year from text #4

swathimithran commented Jul 26, 2017

DanielJDufour commented Jul 26, 2017

swathimithran commented Jul 26, 2017

DanielJDufour commented Jul 27, 2017

swathimithran commented Jul 27, 2017 •

edited

Loading

DanielJDufour commented Jul 28, 2017 •

edited

Loading

Extract Only Year from text #4

Extract Only Year from text #4

Comments

swathimithran commented Jul 26, 2017

DanielJDufour commented Jul 26, 2017

swathimithran commented Jul 26, 2017

DanielJDufour commented Jul 27, 2017

swathimithran commented Jul 27, 2017 • edited Loading

DanielJDufour commented Jul 28, 2017 • edited Loading

Option 1

Option 2

Option 3

Option 4

swathimithran commented Jul 27, 2017 •

edited

Loading

DanielJDufour commented Jul 28, 2017 •

edited

Loading