by Aleck Kulabukhov, Managing Consultant Sydney
Introduction
XML parsing is a compute-expensive exercise. XML is quite a popular way to store unstructured data so we often face a “Big XML Dilemma” – we can either:
- pre-parse XML, extract the required fields and store them in a most efficient and fast-to-query way (such as Data Warehouse for example) or
- store XMLs in a NoSQL database and process them on-the-fly while running a query.
There are several major drawbacks with each approach along with the obvious benefits.
Method | Pros | Cons |
Pre-processing | Fast and efficient querying. | Only a limited set of XML attributes is usually extracted. |
XML parsing is a part of non-time critical ETL process that can be optimized/scaled. | Bringing new attributes is a massive re-development effort. | |
Changing XML schema can break the ETL. | ||
Processing-on-fly | XML is captured as a whole. | Querying is either much slower or more expensive. |
No complex ETL to extract attributes from the XML required. | Limited support by the current DBMS offerings for XML column types. | |
Efficient indexing is problematic, slowing down the querying even further. |
A proposed approach
The approach pioneered in Altis combines benefits of both paradigms with little, if any, added drawbacks.
Let’s take an extremely simple XML document as an example (we will omit XML header information for the simplicity and focus on the document contents instead):
<Note Importance=”high” Classification=”non-sensitive” Type=”Private”>
<To Bcc=”NSA”>Mike michaelj@company1.com.au</To>
<From>Jani janik@company2.com.au</From>
<Heading>Reminder</Heading>
<Body>Fishing this weekend!</Body>
<Meta>
<Sent Time=”20191018104902″/>
<Received/>
</Meta>
</Note>
If we take the pre-processing approach, then flattening this extremely simple tiny document is not a trivial task and adding support for additional fields (marked in yellow) that can be added to this kind of document later is full of grief and pain.
The solution discussed dissects XML as a parent-child tree structure with attributes being leaves.
For example, the mentioned above XML can be presented in a tabular form:
Document ID | ID | Parent ID | Item Type | Label | Value |
0 | 0 | 0 | 0 | Note | |
0 | 1 | 0 | 1 | Type | Private |
0 | 2 | 0 | 1 | Classification | non-sensitive |
0 | 3 | 0 | 1 | Importance | high |
0 | 4 | 0 | 0 | To | Mike michaelj@company1.com.au |
0 | 5 | 4 | 1 | Bcc | NSA |
0 | 6 | 0 | 0 | From | Jani janik@company2.com.au |
0 | 7 | 0 | 0 | Heading | Reminder |
0 | 8 | 0 | 0 | Body | Fishing this weekend! |
0 | 9 | 0 | 0 | Meta | |
0 | 10 | 9 | 0 | Sent | |
0 | 11 | 10 | 1 | Time | 20191018104902 |
0 | 12 | 9 | 0 | Received |
Item type is 0 for the nodes, and 1 for the attributes.
The Document ID allows us to use multiple XML documents in the same table. The root element is determined by both Parent ID and ID being equal 0.
Analysis
If we add the proposed approach to the analysis table presented above the benefits are quite striking:
Method | Pros | Cons |
Pre-processing | Fast and efficient querying. | Only a limited set of XML attributes is usually extracted. |
XML parsing is a part of non-time critical ETL process that can be optimized/scaled. | Bringing new attributes is a massive re-development effort. | |
Changing XML schema can break the ETL. | ||
Processing-on-fly | XML is captured as a whole. | Querying is either much slower or more expensive. |
No complex ETL to extract attributes from the XML required. | Limited support by the current DBMS offerings for XML column types. | |
Efficient indexing is problematic, slowing down the querying even further. | ||
XML-flattening | XML pre-processing is done at ETL stage. | Querying within a hierarchy requires more complex queries with no massive impact on efficiency. |
XML is captured fully, no extracts. | ||
Indices can and should be added for a fast querying. | ||
Any XML can be stored in the same DBMS, even in the same table if needed. | ||
Any modern RDBMS can store the data, same applies to the Data Warehouse. | ||
Data Lake ready. |
Outcome
The resulting flattened data structured can be easily stored in RDBMS (or in a Data Warehouse), with full indexing for efficient querying or in a Data Lake in of the efficient storage formats (parquet, avro, orc, etc.).
Querying such structure is quite simple.
Retrieving full list of messages sent by Mike would be as simple as:
SELECT DISTINCT document_id FROM messages WHERE label=’From’ AND value=’ Mike michaelj@company1.com.au’
All messages matching ‘John’ in the From field and a word ‘fishing’ in the body:
SELECT DISTINCT document_id
FROM messages m1
JOIN messages m2 ON m1.document_id=m2.document_id
WHERE m1.label=’From’ AND POSITION(‘Mike’ IN m1.value)>0
AND m2.label=’Body’ AND POSITION(‘fishing’ IN m2.value)>0
If required, for even more efficient querying a set of materialised views can be created and easily maintained by most organisations.
Impact
Being able to store a potentially unlimited number of XMLs in a columnar storage system (potentially indexed if we consider RDBMS) with an ability to query it without incurring massive CPU-costs (compared to XML-processing on fly) and having access to all XML attributes at once makes a clear option for organisations acquiring/storing their as XMLs.
If you would like to know more about XML Storage or if you are interesting in using our services – contact us here.
Similar blog posts:
One Response