Skip to content

JMendes1995/py_rule_based

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

py_rule_based

py_rule_based is a self-defined rule-based approach in regex which is able to detect the following patterns:

  • yyyy(./-)mm(./-)dd
  • dd(./-)mm(./-)yyyy
  • yyyy(./-)yyyy
  • yyyys
  • yyyy

It has been developed by Jorge Mendes under the supervision of Professor Ricardo Campos in the scope of the Final Project of the Computer Science degree at the Polytechnic Institute of Tomar, Portugal.

Where can I find py_rule_based?

https://github.com/JMendes1995/py_rule_based

How to install py_rule_based

pip install git+https://github.com/JMendes1995/py_rule_based.git

How to use py_rule_based

from py_rule_based import py_rule_based

text = "The start of the war in Europe is generally held to be 1 September 1939,"\
        "beginning with the German invasion of Poland; the United Kingdom and France declared war on Germany two days later."\
        "The dates for the beginning of war in the Pacific include the start of the Second Sino-Japanese War on 7 July 1937,"\
        "or even the Japanese invasion of Manchuria on 19-09-1931."

With the default parameters

Default date_granularity is "full" (highest possible granularity detected will be retrieved), begin_date is 0 and end_date is 2100 which means that all the dates within this range will be retrieved. The following code shows two different ways of obtaining the results:

results = py_rule_based(text)
results = py_rule_based(text, date_granularity='full', begin_date=2000, end_date=2100)
Output

The output will be a list of 3 elements or an empty list [] if no temporal expression is found in the text. The three elements are:

  • a list of tuples with two positions (e.g., ('2011-01-02', '2011-01-02')). The first one is the detected temporal expression normalized by py_rule_based model. The second is the temporal expression as it was found in the text; The first may differ from the second when the date_granularity is different than "full".
  • a normalized version of the text, where each temporal expression is tagged with d;
  • the execution time of the algorithm, divided into rule_based_processing (i.e., the time spent by the rule_based model in extracting temporal expressions) and text_normalization (the time spent by the program in labelling the temporal expressions found in the text with a tag d).
TempExpressions = results[0]
TempExpressions
[('1939', '1939'), ('1937', '1937'), ('19-09-1931', '19-09-1931')]
TextNormalized = results[1]
TextNormalized
'The start of the war in Europe is generally held to be 1 September <d>1939</d>,beginning with the German invasion of Poland; the United Kingdom and France declared war on Germany two days later.The dates for the beginning of war in the Pacific include the start of the Second Sino-Japanese War on 7 July <d>1937</d>,or even the Japanese invasion of Manchuria on <d>19-09-1931</d>.'
ExecutionTime = results[3]
ExecutionTime
{'rule_based_processing': 0.000993490219116211, 'rule_based_text_normalization': 0}

Optional parameters

Besides running py_rule_based with the default parameters, users can also specify some advanced options. These are:

  • date granularity: "full" (highest possible granularity detected will be retrieved); "year" (YYYY will be retrieved); "month" (YYYY-MM will be retrieved); "day" (YYYY-MM-DD will be retrieved);
  • begin_date: an integer (default is 0) that defines the lowest date beginning of the time period to consider;
  • end_date: an integer (default is 2100) that defines the end of the time period to consider
result = py_rule_based(text, date_granularity='year', begin_date=1930, end_date=1935)
Output

The output follows the same patterns as described above.

TempExpressions = results[0]
TempExpressions
[('1931', '19-09-1931')]
TextNormalized = results[1]
TextNormalized
'The start of the war in Europe is generally held to be 1 September 1939,beginning with the German invasion of Poland; the United Kingdom and France declared war on Germany two days later.The dates for the beginning of war in the Pacific include the start of the Second Sino-Japanese War on 7 July 1937,or even the Japanese invasion of Manchuria on <d>1931</d>.'
ExecutionTime = results[3]
ExecutionTime
{'rule_based_processing': 0.0, 'rule_based_text_normalization': 0.0}

Python_CLI

Help

py_rule_based --help

Usage Examples

Make sure that the input parameters are within quotes.

Default Parameters:

py_rule_based -t "2011 Haiti Earthquake Anniversary." 

All the Parameters:

py_rule_based -t "2011 Haiti Earthquake Anniversary." -dg "year" -bd "2000" -ed "2015"

Options

  [required]: either specify a text or an input_file path.
  ----------------------------------------------------------------------------------------------------------------------------------
  -t, --text TEXT                       Input text.
                                        Example: “2011 Haiti Earthquake Anniversary.”.

  -i, --input_file TEXT                 Text path.
                                        Example: “C:\\text.txt
  [not required]
  -----------------------------------------------------------------------------------------------------------------------------------
  -dg, --date_granularity TEXT          Date granularity
                                        Default: "full"
                                        Options:
                                                "full" - (means that all types of granularity will be retrieved, from the coarsest to the finest-granularity).
                                                "day" - (means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD).
                                                "month" (means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved);
                                                "year" (means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved);

  -bd, --begin_date TEXT                begin date.
                                        Options:
                                                any integer > 0
                                            
  -ed, --end_date TEXT                  end date.
                                        Options:
                                                any integer > 0
                                                
  --help                           - Show this message and exit.

Python Notebook

We highly recommend you to use this python notebook if you are interested in playing with py_rule_based.

Related Projects

Please check py_heideltime if you are interested in extracting temporal expressions using Heideltime Temporal Tagger.

Please check Time-Matters if you are interested in detecting the relevance (score) of dates in a text.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published