Data & Analytics
Spark
October 31, 2019

Cost-effective XML storage and querying at scale

by Aleck Kulabukhov, Managing Consultant Sydney

Introduction

XML parsing is a compute-expensive exercise. XML is quite a popular way to store unstructured data so we often face a “Big XML Dilemma” – we can either:

pre-parse XML, extract the required fields and store them in a most efficient and fast-to-query way (such as Data Warehouse for example) or
store XMLs in a NoSQL database and process them on-the-fly while running a query.

There are several major drawbacks with each approach along with the obvious benefits.

Method	Pros	Cons
Pre-processing	Fast and efficient querying.	Only a limited set of XML attributes is usually extracted.
XML parsing is a part of non-time critical ETL process that can be optimized/scaled.	Bringing new attributes is a massive re-development effort.
	Changing XML schema can break the ETL.
Processing-on-fly	XML is captured as a whole.	Querying is either much slower or more expensive.
No complex ETL to extract attributes from the XML required.	Limited support by the current DBMS offerings for XML column types.
	Efficient indexing is problematic, slowing down the querying even further.

A proposed approach

The approach pioneered in Altis combines benefits of both paradigms with little, if any, added drawbacks.

Let’s take an extremely simple XML document as an example (we will omit XML header information for the simplicity and focus on the document contents instead):

<Note Importance=”high” Classification=”non-sensitive” Type=”Private”>
<To Bcc=”NSA”>Mike michaelj@company1.com.au</To>
<From>Jani janik@company2.com.au</From>
<Heading>Reminder</Heading>
<Body>Fishing this weekend!</Body>
<Meta>
<Sent Time=”20191018104902″/>
<Received/>
</Meta>
</Note>

If we take the pre-processing approach, then flattening this extremely simple tiny document is not a trivial task and adding support for additional fields (marked in yellow) that can be added to this kind of document later is full of grief and pain.

The solution discussed dissects XML as a parent-child tree structure with attributes being leaves.

For example, the mentioned above XML can be presented in a tabular form:

Document ID	ID	Parent ID	Item Type	Label	Value
0	0	0	0	Note
0	1	0	1	Type	Private
0	2	0	1	Classification	non-sensitive
0	3	0	1	Importance	high
0	4	0	0	To	Mike michaelj@company1.com.au
0	5	4	1	Bcc	NSA
0	6	0	0	From	Jani janik@company2.com.au
0	7	0	0	Heading	Reminder
0	8	0	0	Body	Fishing this weekend!
0	9	0	0	Meta
0	10	9	0	Sent
0	11	10	1	Time	20191018104902
0	12	9	0	Received

Item type is 0 for the nodes, and 1 for the attributes.

The Document ID allows us to use multiple XML documents in the same table. The root element is determined by both Parent ID and ID being equal 0.

Analysis

If we add the proposed approach to the analysis table presented above the benefits are quite striking:

Method	Pros	Cons
Pre-processing	Fast and efficient querying.	Only a limited set of XML attributes is usually extracted.
XML parsing is a part of non-time critical ETL process that can be optimized/scaled.	Bringing new attributes is a massive re-development effort.
	Changing XML schema can break the ETL.
Processing-on-fly	XML is captured as a whole.	Querying is either much slower or more expensive.
No complex ETL to extract attributes from the XML required.	Limited support by the current DBMS offerings for XML column types.
	Efficient indexing is problematic, slowing down the querying even further.
XML-flattening	XML pre-processing is done at ETL stage.	Querying within a hierarchy requires more complex queries with no massive impact on efficiency.
XML is captured fully, no extracts.
Indices can and should be added for a fast querying.
Any XML can be stored in the same DBMS, even in the same table if needed.
Any modern RDBMS can store the data, same applies to the Data Warehouse.
Data Lake ready.

Outcome

The resulting flattened data structured can be easily stored in RDBMS (or in a Data Warehouse), with full indexing for efficient querying or in a Data Lake in of the efficient storage formats (parquet, avro, orc, etc.).

Querying such structure is quite simple.

Retrieving full list of messages sent by Mike would be as simple as:

SELECT DISTINCT document_id FROM messages WHERE label=’From’ AND value=’ Mike michaelj@company1.com.au’

All messages matching ‘John’ in the From field and a word ‘fishing’ in the body:

SELECT DISTINCT document_id
FROM messages m1
JOIN messages m2 ON m1.document_id=m2.document_id
WHERE m1.label=’From’ AND POSITION(‘Mike’ IN m1.value)>0
AND m2.label=’Body’ AND POSITION(‘fishing’ IN m2.value)>0

If required, for even more efficient querying a set of materialised views can be created and easily maintained by most organisations.

Impact

Being able to store a potentially unlimited number of XMLs in a columnar storage system (potentially indexed if we consider RDBMS) with an ability to query it without incurring massive CPU-costs (compared to XML-processing on fly) and having access to all XML attributes at once makes a clear option for organisations acquiring/storing their as XMLs.

If you would like to know more about XML Storage or if you are interesting in using our services – contact us here.

Similar blog posts:

https://altisconsulting.com/nz/putting-together-a-roadmap-to-guide-your-data-analytics-journey/

https://altisconsulting.com/nz/greater-control-of-your-power-bi-datasets-with-the-power-bi-rest-api/

One Response

Pingback: Data on the Balance Sheet - Altis

Hyperscaler native vs best of breed services – what should you choose for your cloud data & analytics solution?

Blog Posts

Enhanced dbt Data Quality Observability at Speed

Blog Posts

Why you may need a dedicated stream to tackle source system onboarding as part of your data transformation program

Blog Posts

Modernising your legacy time-series solution using Cloud Data ecosystems

Connect with us

If you’d like to be kept in the loop on courses, events and other related topics, simply complete your details and we’ll add you to our list.

Consulting Services

Frameworks

Training

Cost-effective XML storage and querying at scale

Share

One Response

Leave a Reply Cancel reply

Recent

Hyperscaler native vs best of breed services – what should you choose for your cloud data & analytics solution?

Enhanced dbt Data Quality Observability at Speed

Why you may need a dedicated stream to tackle source system onboarding as part of your data transformation program

Modernising your legacy time-series solution using Cloud Data ecosystems

Connect with us

We are Altis

Quick Links

Contact us

Connect with us