-
Notifications
You must be signed in to change notification settings - Fork 4
CRL Rules (REMOVE ME)
CRL rules are intended for table analysis and interpretation. They map explicit features (layout, style, and text of cells) of an arbitrary table into its implicit semantic data items (entries, labels, and categories).
Each rule is expressed as a production in the form presented below
rule #n
when
condition1
condition2
...
then
action1
action2
...
end
#n
, a number that follows the keyword rule
, determines the order of executing this rule.
The left-hand side (when
) of a rule consists of one or more conditions that enable to query available facts which are cells, entries, labels, and categories of a table.
Each of the conditions listed in the left-hand side of a rule has to be true to execute its right-hand side (then
) that contains actions to modify the existed or to generate new facts about the table.
We use two kinds of conditions. The first requires that there exists at least one fact of a specified data type, which satisfies a set of constraints:
cell variable: constraints, assignments
entry variable: constraints, assignments
label variable: constraints, assignments
category variable: constraints, assignments
The condition consists of three parts. In its order of occurrence, the first is a keyword which denotes one of the following fact types: cell
(CCell
), entry
(CEntry
), label
(CLabel
), or category
(CCategory
) (see the table object model).
The second is variable
, a variable of the specified fact type.
The third optional part begins with the colon character.
It defines constraints for restricting the requested facts and assignments for binding additional variables with values.
A constraint is a boolean expression in Java.
The comma character separating the constraints is the logical conjunction of them.
An assignment (variable: value
) sets a value (Java expression) to a variable.
A condition without constraints allows querying all facts of a specified type.
The second kind of conditions determines that there exist no facts of a specified type, which satisfy a set of constraints:
no cells: constraints
no entries: constraints
no labels: constraints
no categories: constraints
The first part of these conditions is a keyword for satisfying a type of facts. The second part contains constraints on the facts.
Hand-coded tables often have an inaccurate layout (e.g. improperly split or merged cells) and messy text content (e.g. typos, homoglyphs, extra spaces, or errors in indents). We address several actions to the issues of cell cleansing, that can be used as the preprocessing stage.
Two cells can be merged when they share one border.
The action combines two adjacent cells cell1
and cell2
into the one merged cell:
merge cell1 with cell2
As a result, the addressee cell2
becomes a merged cell with new coordinates that span both cells.
This action allows dividing a merged cell cell
that spans n-tiles into n-cells.
split cell
Each of the n-cells completely copies content and style from the merged cell and coordinates from the corresponding tile.
Cell splitting is mostly needed to divide merged cells that contain entries. Since an entry can be associated with only one label in each category, when a cell consists of n-tiles and contains m-entries, then usually it should be considered as a container for n x m repeating entries. In these cases, cell splitting allows avoiding the complexity of associating entries from merged cells with labels from one category.
The table shown in the picture below (a) contains the merged cells (1
, 4
, and 5
).
We can split them, using the following rule:
rule #n
when
cell cc: cl == 1, rt == 1, blank
cell c: cl > cc.cr, rt > cc.rb
then
split c
end
As a result, the table (a) is transformed into the table (b).
There are two actions modifying cell content.
The first sets a new value string_value
to a cell cell
:
set text string_value to cell
Some string processing (e.g. regular expressions and string matching algorithms) implemented as Java-methods can be involved in the action.
The second one modifies the indent value integer_value
of a cell cell
:
set indent integer_value to cell
This stage aims to recover entries and labels as functional data items presented in tables. We also enable associating cells with user-defined tags that can assist in both role and structural analysis.
The action provides annotating a cell cell
with a tag word or phrase string_value
:
set tag string_value to cell
The assigned tag can substitute the corresponding constraints in subsequent rules. It allows using more laconic conditions in subsequent rules. The typical practice is to set a tag to all cells, which play the same role or are located in the same table functional region. Thereafter, we can use these tags in subsequent rules instead of repeating constraints on cell location in the regions.
Looking at the pivot tables shown in the picture below, we can assume that each of them has an empty cell (stub head region) located in the top-left corner. It can be considered as a critical cell (Nagy, 2012) which determines three functional regions: body, head, and stub.
The rule based on the assumption adds the tag word (body
) to each cell c
located in the body.
rule #n1
when
cell cc: cl == 1, rt == 1, blank
cell c: cl > cc.cr, rt > cc.rb
then
set tag "body" to c
end
Similarly, we can write rules for tagging the corresponding cells with the words: head
and stub
.
rule #n2
when
cell cc: cl == 1, rt == 1, blank
cell c: cl > cc.cr, rb <= cc.rb
then
set tag "head" to c
end
rule #n3
when
cell cc: cl == 1, rt == 1, blank
cell c: cr <= cc.cr, rt > cc.rb
then
set tag "stub" to c
end
As a result of matching these rules against the cells of the table shown in the picture below,
we recover the following facts:
c1=(cl=3, rt=3, cr=3, rb=3, value="1", tag="body"),...,
c12=(cl=6, rt=5, cr=6, rb=5, value="5", tag="body"),
c13=(cl=3, rt=1, cr=4, rb=1, value="a", tag="head"),...,
c18=(cl=6, rt=2, cr=6, rb=2, value="d", tag="head"),
c19=(cl=1, rt=3, cr=1, rb=4, value="e", tag="stub"),...,
c23=(cl=2, rt=5, cr=2, rb=5, value="g", tag="stub")
Two actions presented below generate entries and labels in a cell cell
,
using string expressions entry_value
and label_value
usually obtained as a result of string processing its textual content:
new entry cell as entry_value
new label cell as label_value
The following short form creates an entry and a label from the cell text:
new entry cell
new label cell
The bilingual table that is shown in the picture below duplicates labels in two languages (Greek and Latin symbols).
Assuming that the first label (word) in a cell is written in one language and the second in other, we can use the rule below to generate two labels from each cell located in the leftmost column or the topmost row:
rule #n
when
cell c: cl==1 || rt==1, !blank
then
new label c as token(c, 0)
new label c as token(c, 1)
end
In this example, we expect that the function token
is implemented as a Java-method and imported into the rules.
It returns a token (word) specified by an index from the text of a cell.
For the table this rule generates 8 labels:
l1=(value="α"), l2=(value="a"),..., l7=(value="𝛿"), l8=(value="d")
For tables similar to the one shown in the picture below,
where any cell under the topmost row contains a text as key=value
,
the following rule creates a label from the key-part and an entry from the value-part of the text:
rule #n
when
cell c: rt > 1
then
new label c as left(c, '=')
new entry c as right(c, '=')
end
The functions left
and right
are implemented as Java-methods.
In the presented case, they extract substrings before and after the character (=
) respectively.
For the table this rule generates 9 entries and 9 labels:
e1=(value="1"),..., e9=(value="9"),
l1=(value="a"),..., l9=(value="i")
This stage recovers pairs of two kinds: entry-label and label-label.
The action below binds an entry entry
with a label label
:
add label label to entry
There are two additional ways to create an entry-label pair.
The first associates an entry entry
with a label specified by its value (label_value
) from a category indicated by its name (category_name
):
add label label_value of category_name to entry
The second creates a pair between them similarly but using a defined category category
:
add label label_value of category to entry
In both cases, we try to find or create the label in the specified category.
The table shown in the picture below depicts the use of a cell background color (Color.GRAY
in the cell B2
) as a reference to the footnote (e
).
We can recover this relationship, using the style features as follows:
rule #n
when
entry e: cell.style.bgColor == Color.GRAY
then
add label "e" of "footnotes" to e
end
We assume that the following facts exist before executing the rule:
c1=(style.bgColor=Color.GRAY, entries={e1}),
e1=(value="1", cell=c1)
As a result, we generate the new facts after its execution:
e1=(value="1", cell=c1, labels={l1}), l1=(value="e", category=d1),
d1=(name="footnotes", labels={l1})
This action connects two labels label1
as a parent and label2
as its child:
set parent label1 to label2
A header located in the stub (leftmost column) often begins with an indent presented as a series of spaces, dots, or other padding characters. Usually, the indents denote hierarchical label-label pairs. For example, when each level in a label hierarchy augments the indents with two additional dots as shown in the picture below,
we can recover label-label pairs as follows:
rule #n
when
cell c1: cl == 1
cell c2: cl == 1, rt > c1.rt, indent == c1.indent + 2
no cells: cl == 1, rt > c1.rt, rt < c2.rt, indent == c1.indent
then
set parent c1.label to c2.label
end
As a result, we recover the following label-label pairs:
(c,c1), (c1,c11), (c1,c12), (c,c2), (c2,c21), (d,d1), (d1,d11)
The stage includes actions for recovering label-category pairs.
The action of label categorizing consists in associating a label label
with a category category
:
set category category to label
Furthermore, a string expression category_name
presenting the name of a category can also be used as an argument:
set category category_name to label
In the latter case, we try to find or create the category with this name.
Some tables contain category names among their headings.
The stub head of the table shown in the picture below contains two category names: A
is for the category of the column labels (a1
, a2
, and a3
) and B
is for the category of the row labels (b1
and b2
).
The names can be used to create corresponding categories and to categorize labels. In the case of tables similar to this one, we can assume that a cell in the top-left corner (stub head) contains two category names: the first one describes column labels and the second one addresses row labels.
The rule below creates a category from the first token (word) contained in the top-left corner cell and uses it to categorize column labels:
rule #n
when
cell cc: cl == 1, rt == 1
label l: cell.rt == 1
then
set category token(cc, 0) to l
end
In the case of the shown table, we generate the following facts:
l1=(value="a1", category=d1), l2=(value="a2", category=d1),
l3=(value="a3", category=d1), d1=(name="A", labels={l1, l2, l3})
Arbitrary tables often place all labels of one category in the same row or column.
Consequently, we can suppose that the labels belong to a category without defining its name.
In the cases, grouping two or more labels means that they all belong to an undefined category.
The action places two labels label1
and label2
in one group:
group label1 with label2
All labels of a group can be associated with only one category.
The stub of the pivot table shown in the picture below consists of two columns.
We can suppose that all stub labels originated from one column belong to the same undefined category.
The rule below arranges its stub labels (see also Example 2) into two groups ({h, i}
and {j, k, l}
):
rule #n
when
label l1: cell.tag == "stub"
label} l2: cell.tag == "stub", cell.rt == l1.cell.rt
then
group l1 with l2
end