Temporal Similarity Measure

Please do not change any wiki page without permission from Time-Matters developers.

To model this temporal relevance, we define a Generic Temporal Similarity measure (GTE) that makes use of co-occurrences of keywords and temporal expressions as a means to identify relevant d_j dates within a text T_i.

GTE ranges between 0 and 1, and is defined as follows:

In the equation, median is the median function, IS is the InfoSimba similarity measure, and W_{l, j} represents one of the several terms of (W_j^*), that co-occur with the candidate date d_j within a text t_i. A more detailed description of the median function and of the InfoSimba similarity measure will be given next.

InfoSimba

In this section, we will describe the rationale behind InfoSimba. In particular, we will explain how to:

construct the Context Vectors;
obtain Statistical Information, namely from Single Documents and Multiple Documents;
compute the DICE Matrix
compute the GTE Scores for each Candidate Date

In this work, we apply the InfoSimba (IS) second-order similarity measure, a measure supported by corpus-based token correlations proposed by Dias et al. (2007). While first order association measures (e.g., DICE) evaluate the relatedness between two tokens as they co-occur in a given context (e.g., a sentence, a paragraph, a corpus), second order measures are based on the principle that two words are similar if their corresponding context vectors are also similar.

The intuition behind second-order similarity measures is that two terms having many co-occurring words often carry the same sense in such a way that the information content of both words is likely to share similar terms. For instance, the similarity between the terms ‘‘professor’’ and ‘‘teacher’’ is expected to rest on a number of common co-occurring words such as student, school, etc. Likewise, the similarity between the terms '2010' and 'haiti' is expected to be related with terms about the earthquake. Adopting one such solution will enable to overcome the problem of data sparseness in cases when two terms, despite being similar, do not co-occur frequently in a corpus.

InfoSimba is defined as follows:

IS calculates the correlation between all pairs of two context vectors X and Y, where X is the context vector representation of W_{l, j}, Y is the context vector representation of d_j and DICE is the DICE similarity measure.

Context Vectors

To define the contextual vectors we consider N terms (where N > 0, or the word 'max' if, instead, we want to consider all the terms) with a DICE similarity > TH (where TH is a threshold > 0 to be defined by the user). For instance, to determine the context vector of a candidate date d_j only those keywords (w₁,w₂,...,w_k) and candidate dates (d₁,d₂,...,d_t) having a minimum DICE similarity > TH with (.,d_j) are eligible for the N-size context vector.

A representation of the context vectors is given in the following figure. Again for the sake of understanding, we consider d_j to be "2010" and w_{l, j} to be "haiti":

By looking at the picture we can observe that both terms w_{l, j} and d_j are represented by two vectors of N terms (keywords such as w₁ and candidate dates such as d₁) with a DICE similarity value > TH.

Obtaining Statistical Information

In order to compute the similarity between terms, we need to extract statistical information from the corpus. In our case, a corpus can be a single document or multiple documents.

Extracting this statistical information is grounded on three factors: (1) the type of Corpus (single document or multiple documents); (2) the Search Space that should be taken into account when accounting for individual term occurrences; and the (3) n-contextual window that should be taken into account when accounting for the co-occurrences between the terms.

Single Documents

In the case of a single document, individual occurrences of terms (keywords and candidate dates) are counted on the document's sentences (which is thus the search space). Co-occurrences of terms, in turn, are counted within the two following n_contextual_window's:

sentence (n_contextual_window = "full_sentence"), in which co-occurrence between terms is accounted for at the sentence level.
window of n terms (n_contextual_window = n, where n> 0), in which co-occurrence between terms is counted in a window of n terms defined in the search space of a sentence.

Such a contextual window requires each term to be stored in an Inverted Index with the following structure:

{key: [SF, TotFreq, {SentID : [Freq, [OffSets]],......}]}

where key is the term (a relevant keyword or a candidate date that is found in the document), SF is the Sentence Frequency (i.e., the number of document sentences where the term occurs, TotFreq is the total frequency of the term in the document, SentID is the ID of the sentence where the term appears, Freq is the frequency of the term in that particular sentence, and OffSets is the the list of offsets where the term appears in that particular sentence. For instance, a term with the following structure:

{'2010': [2, 3, {1: [1, [6]], 5: [2, [87, 96]]}]}

means that it has 3 occurrences in 2 different sentences. In the sentence with ID 1, it occurs 1 time in position 6. In sentence with ID 5, it occurs 2 times in position 87 and 96.

Once this structure is defined, we can then compute the similarities between terms. In order to better understand this process, we consider the following figure:

By looking at the picture, we can observe a document consisting of three sentences. In the picture, x and y represent two different terms, and n represent the n-contextual window distance between them.

In our work, similarities between terms are computed using Dice coefficient as follows:

where |x| counts the number of distinct sentences where x appears, |y| counts the number of distinct sentences where y occurs, and |x| intersected with |y| counts the number of distinct sentences where both terms co-occur together within the defined context window.

In this first example, sentences is the search space considered for counting the occurrence of terms, while the full sentence and a window of n terms are the n_contextual_window's considered to register the co-occurrences between terms.

For the first case (the full sentence), we consider to count co-occurrences between terms within the full sentence, meaning that the distance between terms will simply not be taken into account, that is, co-occurrences between terms will be counted regardless the distance between them, and as long as they appear in the same sentence. Thus, we will have a |x| of 3 (as x occurs in 3 distinct sentences), a |y| of 2 (as y occurs in 2 distinct sentences), and a |x| intersected with |y| of 2 (as both terms only occur together - within the search space sentence - in 2 distinct sentences). This would result in the following DICE similarity value:

For the second case (a window of n tokens), we consider to count co-occurrences within a window of n tokens, that is, co-occurrences between terms will be counted as long as they appear together in the same sentence, in a window of n tokens. For the purposes of this example, we consider a window where n = 10. Thus, we will have a |x| of 3 (as x occurs in 3 distinct sentences), a |y| of 2 (as y occurs in 2 distinct sentences), and a |x| intersected with |y| of 1 (as both terms only occur together - within the search space sentence and a window of 10 tokens - in the second sentence. Indeed, if we look carefully at sentences 1 we will observe that x and y dist 12 tokens between them, which is greater than 10). This would result in the following DICE similarity value:

Multiple Documents

In the case of multiple documents, individual occurrences of terms (keywords and candidate dates) can be counted within two different search spaces: document's and sentences. In the case of document's, co-occurrences of terms can be counted within the full document (n_contextual_window = "full_document"). In case sentences becomes the search space, the n_contextual_window can be a full sentence (n_contextual_window = "full_sentence") and a window of n terms.

In order to compute the similarities between terms under this kind of contextual windows, we consider an Inverted Index with the following structure:

{key: [DF, TotFreq, {DocID : [DocFreq, [DocOffSets], {SentID : [SentFreq, [SentOffsets]]},......}]}

where key is the term (a relevant keyword or a candidate date that is found within the corpus of documents), DF is the Document Frequency (i.e., the number of documents where the term occurs, TotFreq is the total frequency of the term in the set of documents, DocID is the ID of the document where the term appears, DocFreq is the frequency of the term in that particular document, DocOffSets is the the list of offsets where the term appears in that particular document, SentID is the ID of the sentence where the term appears in that particular document, SentFreq is the frequency of the term in that particular sentence of the document, and SentOffSets is the the list of offsets where the term appears in that particular sentence of the document.

For instance, a term with the following structure '2010': [2, 3, {1: [1, [6], {1 : [1, [6]]}], 5: [3, [87, 96, 106], {8: [1, [87]], 10: [2, [96, 106]]} ]}] means that it has 3 occurrences in 2 different documents. In the document with ID 1, it occurs 1 time in position 6, or, if seen at the sentence level, it occurs 1 time in the sentence with ID 1 at position 6. In document with ID 5, it occurs 3 times in position 87, 96 and 106, or, if seen at the sentence level, it occurs 1 time in the sentence with ID 8 at position 87, and 2 times in the sentence with ID 10 at positions 96 and 106.

Once this structure is defined, we can then compute the similarities between terms. In order to better understand this process, we consider the following figure:

By looking at the picture, we can observe a corpus consisting of three documents. In the picture, x and y represent two different terms, and n represent the n-contextual window distance between them.

In our work, similarities between terms are computed using Dice coefficient as follows:

where |x| counts the number of distinct documents or sentences (depending on the search space considered) where x appears, |y| counts the number of distinct documents or document sentences (depending on the search space considered) where y appears, and |x| intersected with |y| counts the number of distinct documents or document sentences (depending on the search space considered) where both terms co-occur together within the defined context window.

In this example, we begin by considering documents to be the search space where individual occurrences are counted, and full document to be the n_contextual_window'.

In this case, (search space: documents; window: full document), we consider to count co-occurrences between terms within the full document, meaning that the n-contextual window distance will simply not be taken into account, that is, co-occurrences between terms will be counted regardless the distance between them, and as long as they appear in the same document. Thus, we will have a |x| of 3 (as x occurs in 3 distinct documents), a |y| of 2 (as y occurs in 2 distinct documents), and a |x| intersected with |y| of 2 (as both terms only occur together - within the search space document - in two distinct documents). This would result in the following DICE similarity value:

Next, we begin by considering sentences to be the search space where individual occurrences are counted, and the two following n_contextual_window's where co-occurrences between terms are counted: the full sentence and a window of n terms.

For the first case (search space: sentences; window: full sentence), we consider to count co-occurrences within the full sentence , meaning that the n-contextual window distance will simply not be taken into account, that is, co-occurrences between terms will be counted regardless the distance between them, and as long as they appear in the same document and sentence. Thus, we will have a |x| of 5 (as x occurs in 5 distinct document sentences, 3 in document one, 1 in document two, and 1 in document three), a |y| of 4 (as y occurs in 4 distinct documents, 2 in document one, and 2 n document two), and a |x| intersected with |y| of 2 (as both terms only occur together - within the search space sentences - in two distinct sentences). This would result in the following DICE similarity value:

For the second case (search space: sentences; window: a window of n terms), we consider to count co-occurrences within a window of n tokens, that is, co-occurrences between terms will be counted as long as they appear together in the same document and sentence, in a window of n tokens. For the purposes of this example, we consider a window where n = 10. Thus, we will have a |x| of 5 (as x occurs in 5 distinct sentences), a |y| of 4 (as y occurs in 4 distinct sentences), and a |x| intersected with |y| of 1 (as both terms only occur together - within the search space sentence and a window of 10 tokens - in just one sentence (in document 1). Indeed, if the terms also occur at sentence 1 of document 1, yet with a distance of 12 tokens between them, which is greater than 10). This would result in the following DICE similarity value:

DICE Matrix

The calculated DICE similarities will then be stored in a matrix that keeps all the similarities between all the terms (keywords (w₁,w₂,...,w_k) and candidate dates (d₁,d₂,...,d_t) (see the figure bellow for illustrative purposes) under the search space defined.

Compute GTE Scores for each Candidate Date

By looking at the similarities stored on the matrix we can then compute the final IS score for each candidate date. For instance, for d₁ = 2010, this means we will have to compute the similarities between (d₁,w₁), (d₁,w₂) and (d₁,w₃), as according to our example, d₁ co-occurs with each of this relevant keywords in a given search space.

In this example, we will only consider the calculation between d₁ = 2010 and w₁ = haiti. To construct each vector we will consider all the terms (thus N = maximum number) that co-occur the candidate vector (likewise with the keyword vector) having a DICE similarity > 0. This means that the vector representation of w₁ would consist of 9 elements (all but the w₁ itself will be selected) and the vector representation of d₁ would be made of 5 elements (that is all the terms with DICE similarities > 0, but itself will be selected).

Given that the vectors have to have the same N size, we need to reduce the size of the w₁ vector, such that it ends with the same size of the d₁ vector. IS can now be computed as the corresponding similarity between each pairs of terms present in the N-size context vectors as depicted in following figure. Specifically, it will compute the level of relatedness between w₃ from the context vector of w₁ and the two other context terms of d₁, i.e., w₂, d₄, d₂, d₃ and w₃, and then the similarity between d₂ from the context vector of w₁ and the other context terms of d₁, and so on and so forth.

Instead, if we consider a vector size of N = 2 (for a matter of simplicity) we would have the following vector representation:

The final score of (d₁,w₁) can now be calculated by applying the IS equation as follows:

and results in the following score:

Similarly we should process the similarities between d₁ and the remaining words of (W_{d_j}^*), i.e., w₂ and w₃. The final score of each computation is given as follows:

IS(d₁,w₂) = 0.830
IS(d₁,w₃)) = 0.439

Next, we describe the F aggregation function which is used to combine the several smilarity values sim(w_l,j,d_j), computed by IS.

Median Function

All these similarity values are then combined through the median measure (a measure of central tendency). In our running example this would represent a final score of median[0.439, 0.830, 0.439] = 0.439.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly