Skip to content

Text Representation

JMendes1995 edited this page Oct 8, 2019 · 1 revision

Please do not change any wiki page without permission from Time-Matters developers.


Unlike previous metadata and query log-based approaches, we follow a content-based approach where the information is extracted from the document's contents.

In this wiki, we will explore:

Text Representation

Each Ti, for i = 1,...,n, that is, each text, is represented by a number of relevant keywords and a number of candidate temporal expressions. In what follows, we assume that each text Ti is composed by two different sets denoted WTi and DTi:

Ti = (WTi, DTi)

where WTi = {w1,i, w2,i, ..., wk,i} is the set of the k most relevant terms associated with a text Ti and DTi = {d1,i, d2,i, ..., dt,i} is the set of the t candidate temporal expressions associated with a text Ti. Moreover,

is the set of distinct relevant keywords extracted, within a text or a set of texts T, i.e., the relevant vocabulary. Similarly,

is defined as the set of distinct candidate temporal expressions extracted from a text or a set of texts T.

To illustrate our algorithm we present the following running example: Let WT = {w1; w2; w3; w4; w5; w6} be the set of distinct relevant keywords, DT = {d1; d2; d3; d4;} the set of candidate dates and (Wj*) as the set of relevant words WT that co-occur (within a given search space - to be defined) with each of the four candidate dates DT found in the text (or texts, in case we are talking about multiple documents).

The following picture shows the list of six keywords WT that co-occur with the four candidate dates DT. In each column, the "X" indicate the keywords belonging to the (Wj*). For the sake of understanding we consider d1 to be "2010", and w1 to be "Haiti". By looking at the picture we can understand that the candidate date 2010, occurs (within a given search space, for instance a sentence) with the relevant keywords w1, w2 and w3. For instance, it could have occurred (hyphotetically) with w1 on sentence one, with w2 and w3 on sentence two.

Relevant Keywords


Relevant keywords in Time-Matters can be identified through YAKE!, a keyword extractor system (ECIR'18 Best Short Paper) which is available as a demo, as a Python package and as an app on Google Play. In this work, relevant keywords (num_of_keywords) equals to n, where n is any number > 0.

If you are interested in knowing more about YAKE! please refer to the Publications section where you can find a few papers about it.

Temporal Expressions


Temporal expressions in Time-Matters can be identified through:

The first (temporal_tagger = "py_heideltime") uses a Python wrapper of Heideltime Temporal Tagger (state-of-the-art in this kind of task). It is able to detect a huge number of different types of temporal expressions, yet, depending on the size of the text it may require a considerable amount of time to execute. In this work, we set py_heideltime to its default parameters (that is, Language='English' and document_type='news'). If you are interested in knowing more about Heideltime please refer to the Publications section where you can find a few papers about it.

The second (temporal_tagger = "rule_based") makes use of a self-defined rule-based approach developed in regex which is able to detect the following patterns:

  • yyyy(./-)mm(./-)dd
  • dd(./-)mm(./-)yyyy
  • yyyy(./-)yyyy
  • yyyys
  • yyyy

While not as good (i.e., effective) as Heideltime, it can be used when efficiency (time-performance) is a requirement.

Clone this wiki locally