-
Notifications
You must be signed in to change notification settings - Fork 0
Example 1. Basic UDTF function
This page outlines the most basic use of the XSL UDTF by means of an example. All of the XSL templates used in this and subsequent examples, and the sample input data, is available in the repository.
Assume we have XML in the format below in a table in Hive, and we can retrieve it with a select statement.
In order to pull back the "Event" records in tabular format, we've created an XSL template that loops through all event nodes and outputs the required data, one row per node. The template below will do this, and it also pulls back some additional data from two levels up in the node hierarchy in order to be able to identify where the events came from:
Take note of the following:
- The header row defines the output column names and data type which will be returned to Hive;
- The column and row separator values default to TAB and NEWLINE. These can be overridden if required (more on this later);
- The level at which the template is going to operate is EVENT - this tells the template what node set to find and iterate over. It will find all occurrences of EVENT in the source XML;
Before using the function in anger, ensure it's been added to Hive, and look at its parameters:
- The first parameter, XML, will either be a column that contains XML, retrieved from a table, or supplied as a string
- The second parameter is the path to the XSL file you want to use to tabulate the XML.
- The third parameter is optional and allow us to supply arguments to Saxon to use when transforming the XML. More on this later;
Below is an example of the UDTF in action, using the above Trade XML and the above Trade XML -> Event XSL:
select xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') from summit.trade_xml t where t.trade_id = '123456LQ' limit 10
Notice that two extra columns have been added - "udtfRowStatus" and "udtfRowNarrative". These are used to record any issues the UDFT has when populating the columns for a given row. If the value of udtfRowStatus is zero, no problems were encountered.
The first thing you'll want to try is filtering the results. And, you'll be immediately disappointed:
This is because Hive cannot possibly know about the column called asset_pors, as this column is being dynamically created by the UDTF. Thankfully, the LATERAL VIEW instruction is designed to cater for this, and modifying the query slightly allows us to filter on any field returned by the UDTF:
select x.* from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' and x.asset_pors = 'S';
You can also use aggregate functions (like count), combine the columns with those of the underlying table and select subsets of columns, as shown in the examples below.
Count:
select t.trade_id, count(*) from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' group by x.Tradeid, t.trade_id;
Combining with underlying table:
select t.trade_id, x.* from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' and x.asset_pors = 'S';
Filtering columns:
select t.trade_id, x.asset_pors, x.adate, x.type, x.ccy, x.amount from summit.trade_xml t LATERAL VIEW xsltable(xml_text,'/home/james_k_robinson/xsl/EVENT_to_csv.xsl') x where t.trade_id = '123456LQ' and x.asset_pors = 'S';
For the most part, the LATERAL VIEW allows the UDFT output to be used in exacty the same ways as a table.
This concludes the overview of the basic UDFT functionality. The next example shows how the UDTF can be used for more interesting data analysis tasks.
Next section: Example 2