Skip to content

Example 1. Basic UDTF function

jameskrobinson edited this page Aug 10, 2021 · 5 revisions

This page outlines the most basic use of the XSL UDTF by means of an example. All of the XSL templates used in this and subsequent examples, and the sample input data, is available in the repository.

Basic XSL template, demonstrating repeating groups

Assume we have XML in the format below in a table in Hive, and we can retrieve it with a select statement.

Swap with events

In order to pull back the "Event" records in tabular format, we've created an XSL template that loops through all event nodes and outputs the required data, one row per node. The template below will do this, and it also pulls back some additional data from two levels up in the node hierarchy in order to be able to identify where the events came from:

Swap events XSL

Take note of the following:

  • The header row defines the output column names and data type which will be returned to Hive;
  • The column and row separator values default to TAB and NEWLINE. These can be overridden if required (more on this later);
  • The level at which the template is going to operate is EVENT - this tells the template what node set to find and iterate over. It will find all occurrences of EVENT in the source XML;

Before using the function in anger, ensure it's been added to Hive, and look at its parameters:

Function describe

  • The first parameter, XML, will either be a column that contains XML, retrieved from a table, or supplied as a string
  • The second parameter is the path to the XSL file you want to use to tabulate the XML.
  • The third parameter is optional and allow us to supply arguments to Saxon to use when transforming the XML. More on this later;

Below is an example of the UDTF in action, using the above Trade XML and the above Trade XML -> Event XSL:

select xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') from summit.trade_xml t where t.trade_id = '123456LQ' limit 10

Basic event query

Notice that two extra columns have been added - "udtfRowStatus" and "udtfRowNarrative". These are used to record any issues the UDFT has when populating the columns for a given row. If the value of udtfRowStatus is zero, no problems were encountered.

The first thing you'll want to try is filtering the results. And, you'll be immediately disappointed:

where clause error

This is because Hive cannot possibly know about the column called asset_pors, as this column is being dynamically created by the UDTF. Thankfully, the LATERAL VIEW instruction is designed to cater for this, and modifying the query slightly allows us to filter on any field returned by the UDTF:

select x.* from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' and x.asset_pors = 'S';

lateral view

You can also use aggregate functions (like count), combine the columns with those of the underlying table and select subsets of columns, as shown in the examples below.

Count: select t.trade_id, count(*) from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' group by x.Tradeid, t.trade_id;

Combining with underlying table: select t.trade_id, x.* from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' and x.asset_pors = 'S';

Filtering columns: select t.trade_id, x.asset_pors, x.adate, x.type, x.ccy, x.amount from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' and x.asset_pors = 'S';

For the most part, the LATERAL VIEW allows the UDFT output to be used in exacty the same ways as a table.

misc basic examples

This concludes the overview of the basic UDFT functionality. The next example shows how the UDTF can be used for more interesting data analysis tasks.

Next section: Example 2