py_rule_based is a self-defined rule-based approach in regex which is able to detect the following patterns:
- yyyy(./-)mm(./-)dd
- dd(./-)mm(./-)yyyy
- yyyy(./-)yyyy
- yyyys
- yyyy
It has been developed by Jorge Mendes under the supervision of Professor Ricardo Campos in the scope of the Final Project of the Computer Science degree at the Polytechnic Institute of Tomar, Portugal.
https://github.com/JMendes1995/py_rule_based
pip install git+https://github.com/JMendes1995/py_rule_based.git
from py_rule_based import py_rule_based
text = "The start of the war in Europe is generally held to be 1 September 1939,"\
"beginning with the German invasion of Poland; the United Kingdom and France declared war on Germany two days later."\
"The dates for the beginning of war in the Pacific include the start of the Second Sino-Japanese War on 7 July 1937,"\
"or even the Japanese invasion of Manchuria on 19-09-1931."
Default date_granularity is "full" (highest possible granularity detected will be retrieved), begin_date is 0 and end_date is 2100 which means that all the dates within this range will be retrieved. The following code shows two different ways of obtaining the results:
results = py_rule_based(text)
results = py_rule_based(text, date_granularity='full', begin_date=2000, end_date=2100)
The output will be a list of 3 elements or an empty list [] if no temporal expression is found in the text. The three elements are:
- a list of tuples with two positions (e.g., ('2011-01-02', '2011-01-02')). The first one is the detected temporal expression normalized by py_rule_based model. The second is the temporal expression as it was found in the text; The first may differ from the second when the date_granularity is different than "full".
- a normalized version of the text, where each temporal expression is tagged with d;
- the execution time of the algorithm, divided into
rule_based_processing
(i.e., the time spent by the rule_based model in extracting temporal expressions) andtext_normalization
(the time spent by the program in labelling the temporal expressions found in the text with a tag d).
TempExpressions = results[0]
TempExpressions
[('1939', '1939'), ('1937', '1937'), ('19-09-1931', '19-09-1931')]
TextNormalized = results[1]
TextNormalized
'The start of the war in Europe is generally held to be 1 September <d>1939</d>,beginning with the German invasion of Poland; the United Kingdom and France declared war on Germany two days later.The dates for the beginning of war in the Pacific include the start of the Second Sino-Japanese War on 7 July <d>1937</d>,or even the Japanese invasion of Manchuria on <d>19-09-1931</d>.'
ExecutionTime = results[3]
ExecutionTime
{'rule_based_processing': 0.000993490219116211, 'rule_based_text_normalization': 0}
Besides running py_rule_based with the default parameters, users can also specify some advanced options. These are:
date granularity
: "full" (highest possible granularity detected will be retrieved); "year" (YYYY will be retrieved); "month" (YYYY-MM will be retrieved); "day" (YYYY-MM-DD will be retrieved);begin_date
: an integer (default is 0) that defines the lowest date beginning of the time period to consider;end_date
: an integer (default is 2100) that defines the end of the time period to consider
result = py_rule_based(text, date_granularity='year', begin_date=1930, end_date=1935)
The output follows the same patterns as described above.
TempExpressions = results[0]
TempExpressions
[('1931', '19-09-1931')]
TextNormalized = results[1]
TextNormalized
'The start of the war in Europe is generally held to be 1 September 1939,beginning with the German invasion of Poland; the United Kingdom and France declared war on Germany two days later.The dates for the beginning of war in the Pacific include the start of the Second Sino-Japanese War on 7 July 1937,or even the Japanese invasion of Manchuria on <d>1931</d>.'
ExecutionTime = results[3]
ExecutionTime
{'rule_based_processing': 0.0, 'rule_based_text_normalization': 0.0}
py_rule_based --help
Make sure that the input parameters are within quotes.
Default Parameters:
py_rule_based -t "2011 Haiti Earthquake Anniversary."
All the Parameters:
py_rule_based -t "2011 Haiti Earthquake Anniversary." -dg "year" -bd "2000" -ed "2015"
[required]: either specify a text or an input_file path.
----------------------------------------------------------------------------------------------------------------------------------
-t, --text TEXT Input text.
Example: “2011 Haiti Earthquake Anniversary.”.
-i, --input_file TEXT Text path.
Example: “C:\\text.txt
[not required]
-----------------------------------------------------------------------------------------------------------------------------------
-dg, --date_granularity TEXT Date granularity
Default: "full"
Options:
"full" - (means that all types of granularity will be retrieved, from the coarsest to the finest-granularity).
"day" - (means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD).
"month" (means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved);
"year" (means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved);
-bd, --begin_date TEXT begin date.
Options:
any integer > 0
-ed, --end_date TEXT end date.
Options:
any integer > 0
--help - Show this message and exit.
We highly recommend you to use this python notebook if you are interested in playing with py_rule_based.
Please check py_heideltime if you are interested in extracting temporal expressions using Heideltime Temporal Tagger.
Please check Time-Matters if you are interested in detecting the relevance (score) of dates in a text.