Skip to content

Example 2. Useful side effects: Name value pairs

jameskrobinson edited this page Aug 8, 2021 · 3 revisions

Name-value pairs

Converting XML to name-value pairs (NVP's) is something I regularly do, in order to simplify tasks such as data reconciliation (difficult with complex XML, easy with NVP's) and determining frequency of field usage.

In the past, this has been a cumbersome, ad-hoc process along the lines of:

  • Query Hive to get the source XML I am interested in (for example, trade data for today and yesterday);
  • Dump the data out to a file share;
  • Parse each XML record using an XSL template (using the Saxon command line, or a batch tool) to get a NVP representation;
  • Load that data BACK into Hive;
  • Perform further analysis (such as determining what fields have changes in the two populations, or counts of "name" fields or "value" fields;

Being able to slot a generic XML to NVP template into Hive collapses this whole process down to a single step, as I will illustrate below...

XML to NVP template

The snippet of the template above shows that it will output two columns - nameField and valueField. Here it is in action, using the same Summit XML from the first example:

select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from summit.trade_xml t where t.trade_id = '123456LQ';

NVP query

The next example illustrates using the NVP's to reconcile complex XML. In this example, the FPML for a swap trade is used - one original version and one I have made minor modifications to.

This is the original FPML swap as NVP's:

select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235';

FPML NVPs

Notice that this is using the same template as used previously for transforming Summit XML - the NVP XSL is completely generic and can be used with any XML data.

There are two types of reconciliation that can be undertaken - field populations and field values. The below shows the field population rec process, where it reveals that I deleted one of the notional fields from the modified XML:

select a.namefield from (select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235') a where a.namefield not in (select b.namefield from (select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235_amended') b);

FPML pop rec

For a field value rec, the SQL below does the trick:

select a.namefield, a.valuefield, b.valuefield from (select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235') a inner join (select summit.xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235_amended') b on a.namefield = b.namefield where a.valuefield <> b.valuefield;

FPML value rec

This shows that I changed the value of the TradeId and one of the cashflow start dates. You can further filter these results (for example if you're only looking for changes is a particular subset of the nodes) by adding items to the where clause, eg:

...on a.namefield = b.namefield where a.valuefield <> b.valuefield and a.namefield not like '%tradeId%';

Next section: Example 3