Skip to content

Transformation

Stefan Mihaylov edited this page Nov 26, 2021 · 11 revisions

All the excel examples are located within the workbook called basic_examples.xlsx.

Transformation features

Now that we learned about extraction of the data from excel spreadsheets, let's look into how we can further consume and manipulate with this data.

In extraction examples all the parsed records were the instances of GenericRecord. It holds the data in the denormalized way as a Map<String, Any>. But what we need is a strict model with checks along the way.

E.g. let's look into into broken stats spreadsheet within basic_examples file. You can see that Club Brugge misses the data about the total points.

However, this is a critical information for us and we want to check non-nullability of the data along the extraction of the data (with capturing what rows do not comply with our model validation).

Apart from this, we want to store the data in a normalised way as a data class with statically typed fields. E.g. we know that points could only be integers, if it's something else — it should not pass the validation checks!

To support both of the features refinery allows you to define the data class with custom row parser to map the values to the fields of the data class.

By default refinery uses a GenericRowParser that will simply convert all entries to a Map. Although useful defining your own parser will allow you to fit the data into an actual data class with typing information.

First, let's define the data class where we want to store the data:

data class TeamStat(
   val team: String,
   val group: String,
   val plays: Int,
   val wins: Int?,
   val goalDiff: Int?,
   val points: Int,
   val date: LocalDate
 : ParsedRecord()

Please note, that class should be inherited from ParsedRecord.

Next, instead of putting all the HeaderCell instances directly into the definition, we need to extract them as separate variables as we will need to refer to them when we define a custom row parser:

val team = StringHeaderCell("team")
val plays = StringHeaderCell("plays")
val wins = StringHeaderCell("wins")
val goalDiff = StringHeaderCell("goal diff")
val points = StringHeaderCell("points")
val date = StringHeaderCell("date")

Now, we need to define the custom parser. This custom parser should be inherited from RowParser with RowParserData and ExceptionManager passed as params

class StatParser(rowParserData: RowParserData, exceptionManager: ExceptionManager) :
    RowParser(rowParserData, exceptionManager) {
    override fun toRecord(row: Row): TeamStat {
        return TeamStat(
            team = parseRequiredFieldAsString(row, team),
            group = rowParserData.metadata.getAnchor(),
            plays = parseOptionalFieldAsInteger(row, plays)!!,
            wins = parseOptionalFieldAsInteger(row, wins),
            goalDiff = parseOptionalFieldAsInteger(row, goalDiff),
            points = parseOptionalFieldAsInteger(row, points)!!, // todo: this should have a dedicated method called `parseRequiredFieldAsInteger`
            date = parseOptionalFieldAsDateTime(row, date)!!.toLocalDate()
        )
    }
}

You can see that now we are explicitly saying what extracted data should be passed to which fields. On top of this we have validation of the extracted data (e.g. parseRequiredFieldAsString will always return non-nullable String and will fail to parse the row if the value is absent).

Finally, we need to combine of all that pieces of logic to the adjusted schema definition:

val headerColumns = setOf(team, plays, wins, goalDiff, points, date)
val definition = WorkbookParserDefinition(
    spreadsheetParserDefinitions = listOf(
        SheetParserDefinition(
            sheetNameFilter = { it == "broken stats" },
            tableDefinitions = listOf(
                // 1st table
                TableParserDefinition(
                    requiredColumns = headerColumns,
                    anchor = "Group A",
                    rowParserFactory = ::StatParser // referencing a class that applies a custom logic
                ),
                // 2nd table
                TableParserDefinition(
                    requiredColumns = headerColumns,
                    anchor = "Group B",
                    rowParserFactory = ::StatParser // referencing a class that applies a custom logic
                )
            )
        )
    )
)

Note, that we have a separate parameter in TableParserDefinition called rowParserFactory that accepts a lambda (reference to our custom row parser).

HeaderCell types

Refinery supports various different HeaderCell types to help parse more complex headers. So far we've seen the StringHeaderCell which uses a String or Set of those and works on case incensetive substrings. There are also:

  • SimpleHeaderCell
  • RegexHeaderCell
  • MergedHeaderCell
  • OrderedHeaderCell

RowParser helper methods

You will find a set of methods to assist parsing the sheets into the correct datatypes e.g. parseOptionalFieldAsInteger. In case of a required field you can use the required counterpart of those methods e.g. parseRequiredFieldAsInteger, which in case of failiure or missing data will register a CellParserException and continue with the next row

Exception Management

You will notice that you can pass an ExceptionManager to the WorkbookParser. Any errors that occure during processing will be registered there, so as to not halt the progress of the parsing. In most cases you will find the Sheet and Row referenced in the exception as well. Details on what kind of exceptions are supported can be found here

Clone this wiki locally