-
Notifications
You must be signed in to change notification settings - Fork 6
Extraction
All the excel examples are located within the workbook called basic_examples.xlsx.
All the definitions are located within the extraction folder in a form of integration tests
We will start with the most simple case located in simple spreadsheet
— the way Excel spreadsheets should be actually used.
There is only a single table here that needs to be parsed, so let's do that!
The whole workbook definition looks like that
val definition = WorkbookParserDefinition(
spreadsheetParserDefinitions = listOf(
SheetParserDefinition(
sheetNameFilter = { it == "simple spreadsheet" },
tableDefinitions = listOf(
TableParserDefinition(
requiredColumns = setOf(
StringHeaderCell("team"),
StringHeaderCell("plays"),
StringHeaderCell("wins"),
StringHeaderCell("goal diff"),
StringHeaderCell("points"),
StringHeaderCell("date")
)
)
)
)
)
)
Let's tackle this big chunk of code object-by-object now!
WorkbookParserDefinition
is the entrypoint of all defintions. The most important thing to know for now is that it contains a list of spreadsheet definition. And each spreadsheet parser definition contains a list of table parser defitions.
As we have multiple spreadsheets within a workbook, we applied sheetNameFilter
to only parse a spreadsheet we needed. This is a lambda function that is simply checked against the name of the spreadsheet.
The most interesting part for us here is TableDefinition
TableParserDefinition(
requiredColumns = setOf(
StringHeaderCell("team"),
StringHeaderCell("plays"),
StringHeaderCell("wins"),
StringHeaderCell("goal diff"),
StringHeaderCell("points"),
StringHeaderCell("date")
)
)
This way you define the header row so that refinery
can locate it and further resolve the table location (start row, header row, end row).
Then we will use supply the WorkbookParser
with definition and workbook and invoke parse
method to get the list of parsed records back.
val workbook = WorkbookFactory.create(file)
val parsedRecords = WorkbookParser(definition, workbook).parse()
By default, all parsed records will be of type GenericRecord
which has a Map<String, Any>
data holder inside
In our case we got 4 records back, and in the assertion block you can see how they look like
Note, that apart from the data from the table another useful metadata is extracted such as spreadsheet name and row number.
You will notice that the Definition support optional collumns as well. These can be used to define the set of columns that are not always required to be there in order to identify a header row. However when they exist you still want to use them, for example in your parser (more on that later).
Let's look at a bit more complex example now in spreadsheet multiple tables
.
This spreadsheet contains 2 tables: one with the stats of the teams in group A, another one with all matches played within a group and fixtures.
To extract the data from both of the tables we need to define 2 TableParserDefinition
s and add both of them into the list of all table defintions
SheetParserDefinition(
tableDefinitions = listOf(teamStatsTableDefinition, teamPlaysTableDefinition),
...
)
As you can see we got 12 records back in parsedRecords
. 4 of them are related to team stat rows, another 8 are related to team plays.
Now as we know how to parse multiple tables, let's explore the data with the anchors.
Here we have the same headers but the defined twice related team stats in different groups (Group A
and Group B
).
To handle that case we need to add extra param anchor
in the table definition:
SheetParserDefinition(
tableDefinitions = listOf(
// 1st table
TableParserDefinition(
requiredColumns = headerColumns,
anchor = "Group A"
),
// 2nd table
TableParserDefinition(
requiredColumns = headerColumns,
anchor = "Group B"
)
),
...
)
Note that once the data is parsed the anchor information is appeared in each GenericRecord
as an extracted metadata.
Another way to represent multiple groups (weird though) is to have group as a divider within a single table, and refinery
is allowing to extract this data as well!
Simply add parameter hasDivider = true
and voila!
SheetParserDefinition(
tableDefinitions = listOf(
TableParserDefinition(
requiredColumns = headerColumns,
hasDivider = true
)
),
...
)
Now groups information is extracted as a value of divider
key
The hasDivider
option will use any single-cell row within a table as a divider. Sometimes we wish to limit which of these of we actually want to recognise. For this we can add the allowedDividers
argument, which uses AbstractHeaderCell
to do matching. Of all possible dividers found, only those that match one of these possible allowedDividers
will be used - anything else will be skipped.
SheetParserDefinition(
tableDefinitions = listOf(
TableParserDefinition(
requiredColumns = headerColumns,
hasDivider = true,
allowedDividers = setOf(StringHeaderCell("Group A"), StringHeaderCell("Group B"))
)
),
...
)
Merged cells are handled by refinery as well, so you will see plays
attribute in each GenericRecord
being equal to 3
.
Merged cells are a bit more tricky, as for a 1 merged cells header you might have N columns with different data in cells. To handle that refinery
adds the postfix with the column number. E.g. in our case score
is located in column 3 and 4, so the data is extracted into score_3
and score_4
keys.