Text Representation

Please do not change any wiki page without permission from Time-Matters developers.

Unlike previous metadata and query log-based approaches, we follow a content-based approach where the information is extracted from the document's contents.

In this wiki, we will explore:

Text Representation, which will detail the features used to represent a text;
Relevant Keywords, which will explain the software used to extract relevant keywords;
Temporal Expressions, which will explain the software used to extract candidate temporal expressions.

Text Representation

Each T_i, for i = 1,...,n, that is, each text, is represented by a number of relevant keywords and a number of candidate temporal expressions. In what follows, we assume that each text T_i is composed by two different sets denoted W_{T_i} and D_{T_i}:

T_i = (W_{T_i}, D_{T_i})

where W_{T_i} = {w_1,i, w_2,i, ..., w_k,i} is the set of the k most relevant terms associated with a text T_i and D_{T_i} = {d_1,i, d_2,i, ..., d_t,i} is the set of the t candidate temporal expressions associated with a text T_i. Moreover,

is the set of distinct relevant keywords extracted, within a text or a set of texts T, i.e., the relevant vocabulary. Similarly,

is defined as the set of distinct candidate temporal expressions extracted from a text or a set of texts T.

To illustrate our algorithm we present the following running example: Let W_T = {w₁; w₂; w₃; w₄; w₅; w₆} be the set of distinct relevant keywords, D_T = {d₁; d₂; d₃; d₄;} the set of candidate dates and (W_j^*) as the set of relevant words W_T that co-occur (within a given search space - to be defined) with each of the four candidate dates D_T found in the text (or texts, in case we are talking about multiple documents).

The following picture shows the list of six keywords W_T that co-occur with the four candidate dates D_T. In each column, the "X" indicate the keywords belonging to the (W_j^*). For the sake of understanding we consider d₁ to be "2010", and w₁ to be "Haiti". By looking at the picture we can understand that the candidate date 2010, occurs (within a given search space, for instance a sentence) with the relevant keywords w₁, w₂ and w₃. For instance, it could have occurred (hyphotetically) with w₁ on sentence one, with w₂ and w₃ on sentence two.

Relevant Keywords

Relevant keywords in Time-Matters can be identified through YAKE!, a keyword extractor system (ECIR'18 Best Short Paper) which is available as a demo, as a Python package and as an app on Google Play. In this work, relevant keywords (num_of_keywords) equals to n, where n is any number > 0.

If you are interested in knowing more about YAKE! please refer to the Publications section where you can find a few papers about it.

Temporal Expressions

Temporal expressions in Time-Matters can be identified through:

Heideltime Temporal Tagger by means of a Python wrapper package;
Rule-based approach a self-defined rule-based approach in regex.

The first (temporal_tagger = "py_heideltime") uses a Python wrapper of Heideltime Temporal Tagger (state-of-the-art in this kind of task). It is able to detect a huge number of different types of temporal expressions, yet, depending on the size of the text it may require a considerable amount of time to execute. In this work, we set py_heideltime to its default parameters (that is, Language='English' and document_type='news'). If you are interested in knowing more about Heideltime please refer to the Publications section where you can find a few papers about it.

The second (temporal_tagger = "rule_based") makes use of a self-defined rule-based approach developed in regex which is able to detect the following patterns:

yyyy(./-)mm(./-)dd
dd(./-)mm(./-)yyyy
yyyy(./-)yyyy
yyyys
yyyy

While not as good (i.e., effective) as Heideltime, it can be used when efficiency (time-performance) is a requirement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Representation

Text Representation

Relevant Keywords

Temporal Expressions

Menu

Clone this wiki locally