-
Notifications
You must be signed in to change notification settings - Fork 0
Example 2. Useful side effects: Name value pairs
Converting XML to name-value pairs (NVP's) is something I regularly do, in order to simplify tasks such as data reconciliation (difficult with complex XML, easy with NVP's) and determining frequency of field usage.
In the past, this has been a cumbersome, ad-hoc process along the lines of:
- Query Hive to get the source XML I am interested in (for example, trade data for today and yesterday);
- Dump the data out to a file share;
- Parse each XML record using an XSL template (using the Saxon command line, or a batch tool) to get a NVP representation;
- Load that data BACK into Hive;
- Perform further analysis (such as determining what fields have changes in the two populations, or counts of "name" fields or "value" fields;
Being able to slot a generic XML to NVP template into Hive collapses this whole process down to a single step, as I will illustrate below...
The snippet of the template above shows that it will output two columns - nameField and valueField. Here it is in action, using the same Summit XML from the first example:
select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from summit.trade_xml t where t.trade_id = '123456LQ';
The next example illustrates using the NVP's to reconcile complex XML. In this example, the FPML for a swap trade is used - one original version and one I have made minor modifications to.
This is the original FPML swap as NVP's:
select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235';
Notice that this is using the same template as used previously for transforming Summit XML - the NVP XSL is completely generic and can be used with any XML data.
There are two types of reconciliation that can be undertaken - field populations and field values. The below shows the field population rec process, where it reveals that I deleted one of the notional fields from the modified XML:
select a.namefield from
(select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235') a
where a.namefield not in
(select b.namefield from (select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235_amended') b);
For a field value rec, the SQL below does the trick:
select a.namefield, a.valuefield, b.valuefield from
(select xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235') a
inner join
(select summit.xsltable(xml_text,'/home/james_k_robinson/xsl/xml_to_nvp.xsl') from fpml.trade_xml t where t.trade_id = 'TW9235_amended') b
on a.namefield = b.namefield where a.valuefield <> b.valuefield;
This shows that I changed the value of the TradeId and one of the cashflow start dates. You can further filter these results (for example if you're only looking for changes is a particular subset of the nodes) by adding items to the where clause, eg:
...on a.namefield = b.namefield where a.valuefield <> b.valuefield and a.namefield not like '%tradeId%';
Next section: Example 3