Skip to content

CRL Rules (REMOVE ME)

Alexey O. Shigarov edited this page Dec 17, 2019 · 1 revision

CRL rules are intended for table analysis and interpretation. They map explicit features (layout, style, and text of cells) of an arbitrary table into its implicit semantic data items (entries, labels, and categories).

Each rule is expressed as a production in the form presented below

rule #n
  when 
    condition1
    condition2
    ...
  then
    action1
    action2
    ...
end

#n, a number that follows the keyword rule, determines the order of executing this rule. The left-hand side (when) of a rule consists of one or more conditions that enable to query available facts which are cells, entries, labels, and categories of a table. Each of the conditions listed in the left-hand side of a rule has to be true to execute its right-hand side (then) that contains actions to modify the existed or to generate new facts about the table.

Contents

Conditions

We use two kinds of conditions. The first requires that there exists at least one fact of a specified data type, which satisfies a set of constraints:

cell variable: constraints, assignments
entry variable: constraints, assignments
label variable: constraints, assignments
category variable: constraints, assignments

The condition consists of three parts. In its order of occurrence, the first is a keyword which denotes one of the following fact types: cell (CCell), entry (CEntry), label (CLabel), or category (CCategory) (see the table object model). The second is variable, a variable of the specified fact type. The third optional part begins with the colon character. It defines constraints for restricting the requested facts and assignments for binding additional variables with values. A constraint is a boolean expression in Java. The comma character separating the constraints is the logical conjunction of them. An assignment (variable: value) sets a value (Java expression) to a variable. A condition without constraints allows querying all facts of a specified type.

The second kind of conditions determines that there exist no facts of a specified type, which satisfy a set of constraints:

no cells: constraints
no entries: constraints
no labels: constraints
no categories: constraints

The first part of these conditions is a keyword for satisfying a type of facts. The second part contains constraints on the facts.

Actions

Cell cleansing

Hand-coded tables often have an inaccurate layout (e.g. improperly split or merged cells) and messy text content (e.g. typos, homoglyphs, extra spaces, or errors in indents). We address several actions to the issues of cell cleansing, that can be used as the preprocessing stage.

Cell merging

Two cells can be merged when they share one border. The action combines two adjacent cells cell1 and cell2 into the one merged cell:

merge cell1 with cell2

As a result, the addressee cell2 becomes a merged cell with new coordinates that span both cells.

Cell splitting

This action allows dividing a merged cell cell that spans n-tiles into n-cells.

split cell

Each of the n-cells completely copies content and style from the merged cell and coordinates from the corresponding tile.

Cell splitting is mostly needed to divide merged cells that contain entries. Since an entry can be associated with only one label in each category, when a cell consists of n-tiles and contains m-entries, then usually it should be considered as a container for n x m repeating entries. In these cases, cell splitting allows avoiding the complexity of associating entries from merged cells with labels from one category.

Example 1

The table shown in the picture below (a) contains the merged cells (1, 4, and 5).

Sample Table

We can split them, using the following rule:

rule #n
  when
    cell cc: cl == 1, rt == 1, blank
    cell c: cl > cc.cr, rt > cc.rb
  then
    split c
end

As a result, the table (a) is transformed into the table (b).

Cell content modification

There are two actions modifying cell content. The first sets a new value string_value to a cell cell:

set text string_value to cell

Some string processing (e.g. regular expressions and string matching algorithms) implemented as Java-methods can be involved in the action.

The second one modifies the indent value integer_value of a cell cell:

set indent integer_value to cell

Role analysis

This stage aims to recover entries and labels as functional data items presented in tables. We also enable associating cells with user-defined tags that can assist in both role and structural analysis.

Cell annotating

The action provides annotating a cell cell with a tag word or phrase string_value:

set tag string_value to cell

The assigned tag can substitute the corresponding constraints in subsequent rules. It allows using more laconic conditions in subsequent rules. The typical practice is to set a tag to all cells, which play the same role or are located in the same table functional region. Thereafter, we can use these tags in subsequent rules instead of repeating constraints on cell location in the regions.

Example 2

Looking at the pivot tables shown in the picture below, we can assume that each of them has an empty cell (stub head region) located in the top-left corner. It can be considered as a critical cell (Nagy, 2012) which determines three functional regions: body, head, and stub.

The rule based on the assumption adds the tag word (body) to each cell c located in the body.

rule #n1
  when
    cell cc: cl == 1, rt == 1, blank
    cell c: cl > cc.cr, rt > cc.rb
  then
    set tag "body" to c
end

Similarly, we can write rules for tagging the corresponding cells with the words: head and stub.

rule #n2
  when
    cell cc: cl == 1, rt == 1, blank
    cell c: cl > cc.cr, rb <= cc.rb
  then
    set tag "head" to c
end

rule #n3
  when
    cell cc: cl == 1, rt == 1, blank
    cell c: cr <= cc.cr, rt > cc.rb
  then
    set tag "stub" to c
end

As a result of matching these rules against the cells of the table shown in the picture below,

Sample Table

we recover the following facts:

 c1=(cl=3, rt=3, cr=3, rb=3, value="1", tag="body"),...,
c12=(cl=6, rt=5, cr=6, rb=5, value="5", tag="body"),
c13=(cl=3, rt=1, cr=4, rb=1, value="a", tag="head"),...,
c18=(cl=6, rt=2, cr=6, rb=2, value="d", tag="head"),
c19=(cl=1, rt=3, cr=1, rb=4, value="e", tag="stub"),...,
c23=(cl=2, rt=5, cr=2, rb=5, value="g", tag="stub")

Entry and label generating

Two actions presented below generate entries and labels in a cell cell, using string expressions entry_value and label_value usually obtained as a result of string processing its textual content:

new entry cell as entry_value
new label cell as label_value

The following short form creates an entry and a label from the cell text:

new entry cell
new label cell

Example 3

The bilingual table that is shown in the picture below duplicates labels in two languages (Greek and Latin symbols).

Sample Table

Assuming that the first label (word) in a cell is written in one language and the second in other, we can use the rule below to generate two labels from each cell located in the leftmost column or the topmost row:

rule #n
  when
    cell c: cl==1 || rt==1, !blank
  then
    new label c as token(c, 0)
    new label c as token(c, 1)
end

In this example, we expect that the function token is implemented as a Java-method and imported into the rules. It returns a token (word) specified by an index from the text of a cell.

For the table this rule generates 8 labels:

l1=(value="α"), l2=(value="a"),..., l7=(value="𝛿"), l8=(value="d")

Example 4

For tables similar to the one shown in the picture below,

Sample Table

where any cell under the topmost row contains a text as key=value, the following rule creates a label from the key-part and an entry from the value-part of the text:

rule #n
  when
    cell c: rt > 1
  then
    new label c as left(c, '=')
    new entry c as right(c, '=')
end

The functions left and right are implemented as Java-methods. In the presented case, they extract substrings before and after the character (=) respectively.

For the table this rule generates 9 entries and 9 labels:

e1=(value="1"),..., e9=(value="9"), 
l1=(value="a"),..., l9=(value="i")

Structural analysis

This stage recovers pairs of two kinds: entry-label and label-label.

Entry-label associating

The action below binds an entry entry with a label label:

add label label to entry

There are two additional ways to create an entry-label pair. The first associates an entry entry with a label specified by its value (label_value) from a category indicated by its name (category_name):

add label label_value of category_name to entry

The second creates a pair between them similarly but using a defined category category:

add label label_value of category to entry

In both cases, we try to find or create the label in the specified category.

Example 5

The table shown in the picture below depicts the use of a cell background color (Color.GRAY in the cell B2) as a reference to the footnote (e).

Sample Table

We can recover this relationship, using the style features as follows:

rule #n
  when
    entry e: cell.style.bgColor == Color.GRAY
  then
    add label "e" of "footnotes" to e
end

We assume that the following facts exist before executing the rule:

c1=(style.bgColor=Color.GRAY, entries={e1}),
e1=(value="1", cell=c1)

As a result, we generate the new facts after its execution:

e1=(value="1", cell=c1, labels={l1}), l1=(value="e", category=d1),
d1=(name="footnotes", labels={l1})

Label-label associating

This action connects two labels label1 as a parent and label2 as its child:

set parent label1 to label2

Example 6

A header located in the stub (leftmost column) often begins with an indent presented as a series of spaces, dots, or other padding characters. Usually, the indents denote hierarchical label-label pairs. For example, when each level in a label hierarchy augments the indents with two additional dots as shown in the picture below,

Sample Table

we can recover label-label pairs as follows:

rule #n
  when
    cell c1: cl == 1
    cell c2: cl == 1, rt > c1.rt, indent == c1.indent + 2
    no cells: cl == 1, rt > c1.rt, rt < c2.rt, indent == c1.indent
  then
    set parent c1.label to c2.label
end

As a result, we recover the following label-label pairs:

(c,c1), (c1,c11), (c1,c12), (c,c2), (c2,c21), (d,d1), (d1,d11) 

Interpretation

The stage includes actions for recovering label-category pairs.

Label categorizing

The action of label categorizing consists in associating a label label with a category category:

set category category to label

Furthermore, a string expression category_name presenting the name of a category can also be used as an argument:

set category category_name to label

In the latter case, we try to find or create the category with this name.

Example 7

Some tables contain category names among their headings. The stub head of the table shown in the picture below contains two category names: A is for the category of the column labels (a1, a2, and a3) and B is for the category of the row labels (b1 and b2).

Sample Table

The names can be used to create corresponding categories and to categorize labels. In the case of tables similar to this one, we can assume that a cell in the top-left corner (stub head) contains two category names: the first one describes column labels and the second one addresses row labels.

The rule below creates a category from the first token (word) contained in the top-left corner cell and uses it to categorize column labels:

rule #n
  when
    cell cc: cl == 1, rt == 1
    label l: cell.rt == 1
  then
    set category token(cc, 0) to l
end

In the case of the shown table, we generate the following facts:

l1=(value="a1", category=d1), l2=(value="a2", category=d1), 
l3=(value="a3", category=d1), d1=(name="A", labels={l1, l2, l3})

Label grouping

Arbitrary tables often place all labels of one category in the same row or column. Consequently, we can suppose that the labels belong to a category without defining its name. In the cases, grouping two or more labels means that they all belong to an undefined category. The action places two labels label1 and label2 in one group:

group label1 with label2

All labels of a group can be associated with only one category.

Example 8

The stub of the pivot table shown in the picture below consists of two columns.

Sample Table

We can suppose that all stub labels originated from one column belong to the same undefined category. The rule below arranges its stub labels (see also Example 2) into two groups ({h, i} and {j, k, l}):

rule #n
  when
    label l1: cell.tag == "stub"
    label} l2: cell.tag == "stub", cell.rt == l1.cell.rt
  then
    group l1 with l2
end