Skip to content

How to use Time Matters SingleDoc

JMendes1995 edited this page Oct 8, 2019 · 1 revision

Please do not change any wiki page without permission from Time-Matters developers.


In this wiki, we will explain:

How to use Time Matters SingleDoc

Time-Matters-SingleDoc aims to score temporal expressions found within a single text. Given an identified temporal expression it offers the user two scoring options:

  • ByDoc: it retrieves a unique single score for each temporal expression found in the document, regardless it occurs multiple times in different parts of the text, that is, multiple occurrences of a temporal expression in different sentences (e.g., 2019....... 2019), will always return the same score (e.g., 0.92);

  • BySentence: to retrieve a multiple (eventually different) score for each occurrence of a temporal expression found in the document, that is, multiple occurrences of a temporal expression in different sentences (e.g., 2019....... 2019), will return multiple (eventually different) scores (e.g., 0.92 for the occurrence of 2019 in sentence 1; and 0.77 for the occurrence of 2019 in sentence 2);

The first one (ByDoc) evaluates the score of a given candidate date in the context of a text, with regards to all the relevant keywords that it co-occurs with (regardless if it's on sentence 1 or 2). The following example illustrates one such case, in which all the relevant keywords (w1, w2, w3) that co-occur with temporal expression (d1) will be considered in the computation of the temporal score given by the GTE equation, i.e., (Median([IS(d1, w1); IS(d1, w2); IS(d1, w3)])):

The second, evaluates the score of a given candidate date with regards to the sentences where it occurs (thus taking into account only the relevant keywords of each sentence (within the search space defined)). This means that, if 2010 co-occurs with w1 in sentence 1, only this relevant keyword will be considered to compute the temporal score of 2010 for this particular sentence. Likewise, if 2010 co-occurs with w2 and with w3 in sentence 2, only these relevant keywords will be considered to compute the temporal score of 2010 for this particular sentence. This means that we would have a temporal score of 2010 for sentence 1 computed by GTE equation as follows: (Median([IS(d1, w1)])), and a temporal score of 2010 for sentence 2 computed by GTE equation as follows: (Median([IS(d1, w2); IS(d1, w3)]))

How to work with each one will be explained next. But before, both the libraries, as well as the text, need to be imported.

from Time_Matters_SingleDoc import Time_Matters_SingleDoc

text= "2011 Haiti Earthquake Anniversary. As of 2010 (see 1500 photos here), the following major earthquakes "\
    "have been recorded in Haiti. The first great earthquake mentioned in histories of Haiti occurred in "\
    "1564 in what was still the Spanish colony. It destroyed Concepción de la Vega. On January 12, 2010, "\
    "a massive earthquake struck the nation of Haiti, causing catastrophic damage inside and around the "\
    "capital city of Port-au-Prince. On the first anniversary of the earthquake, 12 January 2011, "\
    "Haitian Prime Minister Jean-Max Bellerive said the death toll from the quake in 2010 was more "\
    "than 316,000, raising the figures in 2010 from previous estimates. I immediately flashed back to the afternoon "\
    "of February 11, 1975 when, on my car radio, I first heard the news. Yesterday..."

Score


The structure of the score depends on the type of extraction considered: ByDoc or BySentence.

ByDoc

Getting temporal scores by doc is possible through the following code. This configuration assumes "py_heideltime" as default temporal tagger (more about this here), "ByDoc" as the default score_type and the default parameters of time_matters. In this configuration, a single score will be retrieved for a temporal expression regardless it occurs in different sentences.

results = Time_Matters_SingleDoc(text)
#results = Time_Matters_SingleDoc(text, score_type="ByDoc")

The output is a dictionary where the key is the normalized temporal expression and the value is a list with two positions. The first is the score of the temporal expression. The second is a list of the instances of the temporal expression (as they were found in the text). Example: '2011-01-12': [0.5, ['2011-01-12', '12 January 2011']],, means that the normalized temporal expression 2011-01-12 has a score of 0.5 and occurs twice in the text. The first time as 2011-01-12, and the second time as 12 January 2011.

#Score
results[0]

{'2011-01-12': [1.0, ['12 January 2011']],
 '2010': [0.983, ['2010', '2010', '2010']],
 '1564': [0.799, ['1564']],
 '2010-01-12': [0.743, ['January 12, 2010']],
 '2011': [0.568, ['2011']],
 '1975-02-11taf': [0, ['the afternoon of February 11, 1975']],
 '1975-02-10': [0, ['Yesterday']]}

BySentence

Getting temporal scores by sentence is possible through the following code. This configuration assumes "py_heideltime" as default temporal tagger (more about this here), "BySentence" as the score_type and the default parameters of time_matters. In this configuration, multiple occurrences of a temporal expression in different sentences (e.g., "As of 2010..."; "...the quake in 2010 was..."), will return multiple (eventually different) scores (e.g., 0.2 for its occurrence in sentence 1; and 0.982 for its occurrence on the other sentence).

results = Time_Matters_SingleDoc(text, score_type='BySentence')

The output is a dictionary where the key is the normalized temporal expression and the value is a dictionary (where the key is the sentenceID and the value is a list with two positions. The first is the score of the temporal expression in that particular sentence. The second is a list of the instances of the temporal expression (as they were found in the text in that particular sentence). Example: {'2010': {1: [0.2, ['2010']], 5: [0.983, ['2010', '2010']]}}, means that the normalized temporal expression 2010 has a score of 0.2 in the sentence with ID 1, and a score of 0.983 in the sentence with ID 5 (where it occurs two times);

results[0]

{'2011': {0: [0.831, ['2011']]},
 '2010': {1: [0.2, ['2010']], 5: [0.983, ['2010', '2010']]},
 '1564': {2: [0.828, ['1564']]},
 '2010-01-12': {4: [0.68, ['January 12, 2010']]},
 '2011-01-12': {5: [1.0, ['12 January 2011']]},
 '1975-02-11taf': {6: [0, ['the afternoon of February 11, 1975']]},
 '1975-02-10': {7: [0, ['Yesterday']]}}

Remaining Output


  • TempExpressions: A list of tuples, each having two positions. The first is the normalized temporal expression. The second is the temporal expression as it was found in the text. The order in which the elements appear in the list, reflect the order of the temporal expressions in the text. Example: [('1975-02-11TAF', 'the afternoon of February 11, 1975'),..].

  • RelevantKWs: a dictionary of the relevant keywords (and corresponding scores). In our algorithm, keywords (which may be constituted by one or more tokens, default = 1) are detected by YAKE!. If you want to know more about the role of YAKE! in Time-Matters, please refer to the following link. Example: {'haiti': 0.03, 'haiti earthquake': 0.07} means that the tokens haiti and haiti earthquake were determined as relevant keywords by YAKE! keyword extractor with the scores 0.03 and 0.07 (the lower the score the more relevant the keyword is).

  • TextNormalized: A normalized version of the text, a string, where temporal expressions are marked with the tag <d> and relevant keywords with the tag <kw>. Example: As of <d>2010</d> (see 1500 photos here), the following major earthquakes have been recorded in <kw>haiti</kw>.

  • TextTokens: A list of the text tokens. Tokens that are temporal expressions are marked with the tag <d>, whereas relevant keywords are marked with the tag <kw>. Example: ['As', 'of', '<d>2010</d>', 'see', '1500',...].

  • SentencesNormalized: A normalized version of the text by sentence, that is a list of lists (position 0 of the list corresponds to sentence 0, etc). Temporal expressions found in the text are marked with the tag <d> while relevant keywords are marked with the tag <kw>. [..., 'As of <d>2010</d> (see 1500 photos here), the following major earthquakes have been recorded in <kw>haiti</kw>.',...].

  • SentencesTokens: A list of the text tokens by sentence, that is a list of lists (position 0 of the list gives the tokens of sentence 0, etc). Tokens that are temporal expressions are marked with the tag <d>, whereas relevant keywords are marked with the tag <kw>. Example: [[...,..], ['As', 'of', '<d>2010</d>', 'see', '1500',...], [...,..],].

Optional Parameters


Apart from the score_type (ByDoc and BySentence) there are also parameters regarding the temporal_tagger and time_matters.

Temporal Tagger

While 'py_heideltime' is the default temporal tagger, a 'rule_based' approach can be used instead. In the following, we assume the default parameters of the rule-based approach, that is: date_granularity is "full" (highest possible granularity detected will be retrieved), begin_date is 0 and end_date is 2100 which means that all the dates within this range will be retrieved.

results = Time_Matters_SingleDoc(text, temporal_tagger=['rule_based'])
results[0]

{'1975': [1.0, ['1975']],
 '2011': [0.97, ['2011', '2011']],
 '2010': [0.904, ['2010', '2010', '2010', '2010']],
 '1564': [0.856, ['1564']],
 '1500': [0.853, ['1500']]}

Instead, we can run the following code to define year as the default granularity and a begin and end date.

results = Time_Matters_SingleDoc(text, temporal_tagger=['rule_based', 'year', 2000, 2011])
results[0]

{'2011': [0.97, ['2011', '2011']],
 '2010': [0.904, ['2010', '2010', '2010', '2010']]}

In addition, a few other parameters are available to py_heideltime, namely:

  • language: English - default; Portuguese; Spanish; Germany; Dutch; Italian; and French. To know how to configure py_heideltime for other languages please refer to this link;
  • date granularity: "full" - default (Highest possible granularity detected will be retrieved); "year" (YYYY will be retrieved); "month" (YYYY-MM will be retrieved); "day" (YYYY-MM-DD will be retrieved). Note that this parameter can also be used with the rule_based model.
  • document type "news" - default (news-style documents); "narrative" (narrative-style documents (e.g., Wikipedia articles)); "colloquial" (English colloquial (e.g., Tweets and SMS)); "scientific" (scientific articles (e.g., clinical trails))
  • document creation time: in the format YYYY-MM-DD

In the following we consider the English language, year as date granularity (which means that dates will be reduced to years), news as document type, and 2009-01-01 as document creation time.

results = Time_Matters_SingleDoc(text, temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2009-01-01'])

Naturally, the results will be different than those obtained when using the "full" option for the date granularity, as "year" will join different instances of the same year together (e.g., {'2011': [0.981, ['2011', '12 January 2011']],..}).

It's interisting to note, for instance, that the cluster 2011 consists of two dates, and that the date Yesterday has been mapped to 2008 (as the date 2009-01-01 has been given as the document creation time).

results[0]

{'2011': ['2011', '12 January 2011'],
 '2010': ['2010', 'January 12, 2010'],
 '1564': ['1564'],
 '1975': ['the afternoon of February 11, 1975'],
 '2008': ['Yesterday']}

Time Matters

  • n-gram: maximum number of terms a keyword might have. Default value is 1 (but any value > 0 is considered. For instance n = 1 means that single tokens such as "keyword" can be considered; instead n = 2 means that "keyword" but also "keyword extractor" can be considered). More about this here and here;
  • num_of_keywords: number of YAKE! keywords to extract from the text. Default value is 10 (but any value > 0 is considered) meaning that the system will extract 10 relevant keywords from the text. More about this here and here;
  • n_contextual_window: defines the n-contextual window distance. Default value is "full_sentence" (but a n-window where n > 0 can be considered as alternative), that is, the system will look for co-occurrences between terms that occur within the search space of a sentence; More about this here;
  • N: size of the context vector for X and Y at InfoSimba. Default value is 'max' (but any value > 0 is considered) meaning that the context vector should have the maximum number of n-terms co-occurring with X (likewise with Y). More about this here;
  • TH: minimum threshold value from which terms are eligible to the context vector X and Y at InfoSimba. Default value is 0.05 (but any value > 0 is considered) meaning that any terms co-occuring between them with a DICE similarity value > 0.05 are eligible for the n-size vector. More about this here.

The following code assumes the default parameters of the temporal_tagger and specifies the five parameters (also the default ones) for time_matters.

results = Time_Matters_SingleDoc(text, time_matters=[1, 10, 'full_sentence', 'max', 0.05])

More interistingly is that if we consider a different n-gram for the keywords. In the following we consider n = 3.

results = Time_Matters_SingleDoc(text, time_matters=[3, 10, 'full_sentence', 'max', 0.05])

Debug Mode


We also offer the user a debug mode where users can access a more detailed version of the results. Thus in addition to the fields already explained before we also make available the InvertedIndex, the DiceMatrix and the ExecutionTime.

results = Time_Matters_SingleDoc(text, debug_mode=True)
  • InvertedIndex: An inverted index of the document, most notably of its relevant keywords and temporal expressions. As other inverted indexes it follows the following dictionary structure: {'term' : [SF, TotFreq, {SentenceID : [Freq, [Offsets]]}], where SF is the Sentence Frequency, TotFreq is the total frequency of the term, SentenceID is the ID of the sentence (knowing that IDs start on 0), Freq is the frequency of the term is that sentence and [Offsets] is a list of offsets, that is, a list of the position(s) where the term appears in the text. For instance, a term with the following structure '2010': [2, 3, {1: [1, [6]], 5: [2, [90, 100]]}] means that it has 3 occurrences in 2 different sentences. In the sentence with ID 1, it occurs 1 time in position 6. In sentence with ID 5, it occurs 2 times in position 90 and 100. Please note that positions are numbered sequentially from the firt to the last token of the text.
  • DiceMatrix: It retrieves (in pandas format) the DICE matrix between each term according to the n-contextual window distance defined. For instance, a DICE similarity of 1 between prime and minister means that, whenever each of these terms occur, they always occur together. If you want to know more about the role of DICE in our algorithm please refer to this link.
  • ExecutionTime: It retrieves information about the processing times of our algorithm, in particular, of the TotalTime required to execute the algorithm, but also of each of its most important components, namely: heideltime_processing, py_heideltime_text_normalization, keyword_text_normalization, YAKE, InvertedIndex, DICEMatrix and GTE. As it can be observed from the example, most of the time is consumed by the py_heideltime component (which entails the heideltime_processing and the text_normalization process, that is, the tagging of the text with the tag).

CLI


Help

$ Time_Matters_SingleDoc --help

Usage Examples

Usage_examples (make sure that the input parameters are within quotes):

Default Parameters: This configuration assumes "py_heideltime" as default temporal tagger, "ByDoc" as the default score_type and the default parameters of time_matters.

Time_Matters_SingleDoc -i "['text', 'August 31st']"

All the Parameters:

All the Parameters: Time_Matters_SingleDoc -i "['text', '2019-12-31']" -tt "['py_heideltime','English', 'full', 'news', '2019-05-05']" -tm "[1, 10,'full_sentence', 'max', 0.05]" -st ByDoc -dm False
Options
  [required]: either specify a text or an input_file path.
  ----------------------------------------------------------------------------------------------------------------------------------
  -i, --input               A list that specifies the type of input: a text or a file path
  
                            Example:
                                    -i "['text', 'August 31st']"
                                    -i "['path', 'c:\text.txt']"
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -tt, --temporal_tagger   Specifies the temporal tagger and the corresponding parameters.
                           Default: "py_heideltime"
			   Options:
			   	    "py_heideltime"
				    "rule_based"
				 
			   py_heideltime (parameters):
			   ____________________________
			   - temporal_tagger_name
			     Options:
				     "py_heideltime"

			   - language
			     Default: "English"
			     Options:
			   	      "English";
				      "Portuguese";
				      "Spanish";
				      "Germany";
				      "Dutch";
				      "Italian";
				      "French".

		          - date_granularity
			    Default: "full"
			    Options:
			           "full": means that all types of granularity will be retrieved, from the coarsest to the 
					   finest-granularity.
			           "day": means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD;
				   "month": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved;
				   "year": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved;

			  - document_type
			    Default: "News"
			    Options:
			  	    "News": for news-style documents - default param;
				    "Narrative": for narrative-style documents (e.g., Wikipedia articles);
				    "Colloquial": for English colloquial (e.g., Tweets and SMS);
				    "Scientific": for scientific articles (e.g., clinical trails).

			  - document_creation_time
			    Document creation date in the format YYYY-MM-DD. Taken into account when "News" or "Colloquial" texts
		            are specified.
		            Example: "2019-05-30".

			  - Example: 
			  	    -tt "['py_heideltime','English', 'full', 'news', '2019-05-05']"	 

		          
			  Rule_Based (parameters):
		          ____________________________
			  - temporal_tagger_name
			    Options:
			  	    "rule_based"

			  - date_granularity
			    Default: "full"
			    Options:
			           "full": means that all types of granularity will be retrieved, from the coarsest to the 
					   finest-granularity.
			           "day": means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD;
				   "month": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved;
				   "year": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved;

			  - begin_date
			    Default: 0
                            Options: any number > 0

			  - end_date
			    Default: 2100
                            Options: any number > 0

			  - Example: 
			  	    -tt "['rule_based','full','2000','2100']"
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -tm, --time_matters     Specifies information about Time-Matters, namely:
			  - n-gram: maximum number of terms a keyword might have. 
			    Default: 1
			    Options:
				    any integer > 0

			  - num_of_keywords: number of YAKE! keywords to extract from the text
			    Default: 10
			    Options:
				    any integer > 0

		          - n_contextual_window: defines the search space where co-occurrences between terms may be counted.
			    Default: "full_sentence"
			    Options:
                                    "full_sentence": the system will look for co-occurrences between terms that occur within the search 
				                    space of a sentence;
			            n: where n is any value > 0, that is, the system will look for co-occurrences between terms that 
				       occur within a window of n terms;
				       
		          - N: N-size context vector for InfoSimba vectors
			    Default: "max"
			    Options: 
			            "max": where "max" is given by the maximum number of terms eligible to be part of the vector
				    any integer > 0
				    
			  - TH: all the terms with a DICE similarity > TH threshold are eligible to the context vector of InfoSimba
			    Default: 0.05
			    Options: 
				    any float > 0


			  - Example: 
			  	    -tm "[1, 10, 'full_sentence', 'max', 0.05]"
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -st, --score_type       Specifies the type of score for the temporal expression found in the text
  			  Default: "ByDoc"
                          Options:
                                  "ByDoc": returns a single score regardless the temporal expression occurs in different sentences;
                                  "BySentence": returns multiple scores (one for each sentence where it occurs)
				  
			  - Example: 
			  	    -st ByDoc
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -dm, --debug_mode      Returns detailed information about the results
  	                 Default: False
			 Options:
			          False: when set to False debug mode is not activated
				  True: activates debug mode. In that case it returns 
                                        "Text";
					"TextNormalized"
					"Score"
					"CandidateDates"
					"NormalizedCandidateDates"
					"RelevantKWs"
					"InvertedIndex"
					"Dice_Matrix"
					"ExecutionTime"
					
			  - Example: 
			  	    -dm True
				    
  --help                 Show this message and exit.
Output

The output is a json list that retrieves the following information: score, temporal expressions, relevant keywords, text normalized, text tokens, sentences normalized and sentences tokens.