by Garry Farrell, Managing Consultant- Altis Sydney
Data lakes are very popular, and I have been helping to extend the framework we use with our clients. Our framework is an important starting point for discussions with our clients. It also doubles as a best practice design for the consultants to get the build done efficiently, whilst also avoiding traps that we have encountered in the past.
The specific part of this framework that I will discuss here, which will make a data lake more robust, are 3 variations of a technique to validate the structure of the files using Azure Data Factory (ADF).
Why would we need to validate files? There are several reasons.
Now that we have established that it is a great idea to validate the files, let’s look at how you can achieve this using ADF. I will discuss the method I typically use and two other variations that I tested that may also interest you.
Method 1: Validate using a control file
My initial approach was to compare 2 files directly. File 1 would be a new file that has been stored and File 2 would represent our controlled schema file. So, this method simply validates new files by comparing to a known file.
- Retrieve the metadata from the newly ingested files using the ADF Get Metadata We will use the Structure attribute which will return a list of column names and column types in JSON format. The files will need to be stored in an Azure storage account.
- Compare to a known schema that we are expecting. In my example, I have used a file already in storage as the controlled schema to compare to the incoming files.
- Use the ADF If Condition to compare the metadata structures. This is the logic to use in the If Condition.
@equals(activity(‘Get Metadata1’).output.structure,activity(‘Get Metadata2’).output.structure)
The equals formula compares two objects, not just the content that the object contains.
- Depending upon to result of the IF condition, use the ADF Copy Data activity to copy the file to the RAW zone (accepting the file) or to the Quarantine zone (rejecting the file).
This is the resulting pipeline.
For a series of files, you would need to use the ADF ForEach activity to loop through the list of files.
Method 2: Validate using code
An alternate solution that I discovered when I was researching how to do this, was using an array to store the controlled schema as text and using the ADF json and createArray functions to create the correct structure. The example code only shows two columns, and this can be extended as required.
This is the code used in the If Condition.
@equals(activity(‘Get Metadata’).output.structure,
createArray(
json(‘{“name”: “Column1”, “type”: “String”}’),
json(‘{“name”: “Column N”, “type”: “String”}’)
)
)
Method 3: Validate using metadata stored in a database
My variation on Method 2 is using a database to store the controlled file structure. We would then load the known schema structure text from the database using the ADF Lookup activity and use that to compare to the new file. The format of the string that is stored in the database is a JSON Array object.
This is the string stored in the database.
[{“name”:”Column1″,”type”:”String”},{“name”:”Column2″,”type”:”String”},{“name”:”Column3″,”type”:”String”},{“name”:”Column4″,”type”:”String”},{“name”:”Column5″,”type”:”String”}]
This is what my table looks like.
ID | object_name | schema_text |
1 | Test.csv | [{“name”:”Column1″,”type”:”String”}, … |
I wrote a stored procedure to retrieve the value. The stored procedure takes one parameter, object_name, which is the schema we need to test. This can also be a table or a view as well if that is more convenient for your situation. The returned value is a string.
This is the code used in the If Condition.
@equals(activity(‘Get Metadata1’).output.structure,json(activity(‘Lookup1’).output.firstRow.schema_text))
Note in the code that schema_text is the name of the table column.
Summing Up
My preference is to make it easy to change the controlled schema by either changing the control file (Method 1) or updating the corresponding record in the database (Method 3). This avoids having to change the pipeline code and possibly causing a testing scenario and re-deployment in-line with the business change implementation processes.
File Types
For preparation for this blog I tested 2 files types which both work well when using the Get Metadata activity.
- Parquet
- Without header the default ADF column names are used, Prop_0, Prop_1 etc. When the file has a header and the dataset has the attribute First row as header ticked, the column names are extracted correctly from the file.
Other information
The ADF Get Metadata activity can also be used to inspect the structure of relational database tables.
For more information regarding the supported databases and compatible file storage system see the page, ADF Get Metadata Documentation