by Garry Farrell, Managing Consultant – Altis Sydney
Do you have files that you would like to combine and develop into reports?
I once worked with a client who was transforming and combining files, by hand, of different formats and structures. Her mission was to merge 7 files from 7 different financial institutions.
So, it wasn’t a simple case of specifying a structure and requesting those changes from a bunch of heavy-weight financial corporations. Really, I think my client was happy that she could get the data at all. She was taking half a day each month copying and pasting between files and re-organising columns and column names and then cleaning the data ready to load to the corporate CRM system.
I’m all for automating things for my clients, however, they didn’t want an expensive solution that would require my expertise to carry out any maintenance each time the financial institutions decided to change the file format or the content of the file.
So, the solution needed to be simple enough for my client to manage any changes that occurred, and so I want to share this utilised a little-known feature of Power BI Desktop.
I was already engaged by the client to provide reports, so I offered to help my client by showing her how to automate blending these files into one file. She would save more than 3.5 hours a month, but better still a mind-numbing copy-paste task would be gone thus allowing a highly-skilled financial professional get on with some higher value analysis work.
My solution was to set up a Power BI Desktop file that would use a folder on the computer as a source to import and process each file type separately and then combine them into one dataset. The resulting data set could then be exported ready for the CRM to import. Power BI Desktop cannot write data to the CRM since it is a reporting program that can only read data for reporting purposes.
Each month there would be a few quick steps to get things refreshed – add the new files to the input folder, open Power BI Desktop, press refresh and then export the data to a CSV file.
Power BI Desktop is a great tool for creating your own reports but in this blog, I wanted to highlight something else you can do which has been made easy for general business users. You can use this method to combine files or operate on a single file if you need to transform the data.
Free Download Power BI Desktop
The Power BI competitors do not have this functionality and Power BI has been maturing this technology since mid-2015. It’s simple and Excel-like, which will give you a head start.
I have a downloadable Power BI Desktop file available showing the step by step method including sample data.
I created the following 4 files which contain similar data and have different files types. The method I have used will work if you have many files to process that have the same structure and file type. You may have files from last month that you also want to be processed together with the current month files. So, as you add more files to the source folder, the more data you will import.
For this scenario, imagine these are files you have received from colleagues in your company. The sample data files are from Alaska, California and Florida.
Ok, let’s get to work. I will work through the CA.XLSX file and the steps to process the file. To process the other files, you can use a similar approach or look at my sample Power BI file for the actual steps.
To create the files I started with a sample dataset, I copied the data into different formats and then changed some of the data so we would have to fix some and manipulate other data until the data from each file has the same columns and data.
I have set up this example so new files can be added to the input folder and all files will be processed. I created a copy of AK.csv and renamed it AKMay.csv to simulate what would happen when we received a new monthly file.
I have used many different features of Power Query which you will learn as you follow the steps. The other nice thing is, there is no coding at all.
Download my file with the sample data here
Follow the steps below.
Open Power BI Desktop and click on ‘Get Data’ and then choose ‘All’ and then ‘Folder’.
This will be the folder where you have the source files. (Download my sample files if you would like to work through my solution) |
|
Browse the source folder. |
|
Now press the Transform Data button. This will take you into the Power Query Editor page where we will define each processing step. The screen shows all of the files in the folder and the file extensions. There is no need to select any of the files. |
|
Now filter based on the Extension to get only the files that are the same structure and type. Here I will be filtering on xlsx. (I am processing the CA.XLSX file) |
If you had multiple files of this type, then you could process them together. E.g. AK.csv and AKMay.csv |
As you complete each transformation you will see the “Applied Steps” build up on the right-hand side of the screen. | At any time, you can click through the steps and modify or change them. Power BI will show the results at each step so you can see the outcome of each step. Click on the cog icon to modify the step. Mouse over the step to see the delete icon.
You can also right-click on a step and rename it so you can remember what that step accomplished. |
Now click on the Extension file dropdown and filter for .xlsx only. This is easy, just like Excel!
Now click OK. Keep an eye on the applied steps as they build up. |
|
Right-click Filtered Rows and rename it to “Filter XLSX only”. |
|
We can also expand the Attributes of the file and filter the hidden files. Click on the double arrow icon on the Attributes column and select hidden and OK. |
|
Filter the hidden files by selecting ‘FALSE’ to show the files that are not hidden. |
|
Now we can load the data. Click on the double down arrow on the Content column. The Combine Files window will appear. Click on CA which is the name of the Tab in the Excel file. Click OK. |
|
Rename the Name column to “Source.Name”. Right-click on the orange header. We will keep filename in the dataset so we can filter on the individual files when all files have been processed. |
|
Click on the Name column and then Control+click the last column ”Transform File from PBI”. Right-Click one of the headers and choose ‘Remove Other Columns’.
|
Now we have two columns, the filename and the data. |
Now we have two columns, the filename and the data. Click on the double arrow to expand the table. |
This will expand the table and automatically change the data types. |
Rename the column “Country/Region” to “Country”. This is done to match the column name to the other files. |
|
Fixing a data quality issue.
We need to change “Forster” to “Foster” using replace values. Right click on the Town column and choose Replace Values. Type in the values and press OK. |
|
Now an example to split by delimiter. One postcode has dashes in it so we need to remove them.
(An alternative approach we could have used the ‘Replace Values’ function.) |
|
Now the postcode has been split into the 5 columns.
Note that the postcodes without dashes are still in Postcode.1 and the other columns are null. Notice the dashes have been removed by the split? |
|
Now we will merge the columns. Select all the 5 postcode columns and right-click and choose Merge Columns. The Merge Columns screen appears. We don’t want a separator in this case, and we will rename the new Merged column afterwards. |
|
Now we can remove the redundant postcode 1-5 columns. Select the columns shown, right-click and select remove columns. |
|
Drag the Merged column from the far-right hand end back to the right of the Town column where postcode was previously. Rename to PostCode. |
|
Locate the Total Revenue column and change the data type to Decimal Number. This will allow the revenue to be treated as a measure. |
|
Notice the colours under the Total Revenue? Mouse over this and it shows there are errors when converting from text to decimal. |
|
Right-click and select Remove Errors. Note that the entire row will be removed. An alternative would be to use Replace Errors with a value of zero. |
|
Now we will work on the Contact Phone column. There is a data quality issue. ‘+’ and ‘()’ characters should be removed to make the data consistent. Change the type to text so we can do text manipulations. |
|
Remove the ‘+1’, ‘(‘, ‘)’ characters and the spaces between the numbers to be left with a single string of numbers.
|
|
Now replace ‘GSC’ with ‘General Services Corporation’. |
|
Rename the PostCode to ‘Postal Code’ | |
We have two columns with the same information, Town/City. Remove the Town column. |
|
Power BI has automatically set the Customer column to a number. However, we need this as text so it will not be summed up. Change the Customer type to Text. |
|
Remove the Quarter column. Power BI can generate the quarters from the date. |
|
‘Shift + click’ each of the columns headers in order to create a date field. Day, Month, Year. Then right-click and merge the columns. Select a custom separator of ‘-‘ and set the new column name to Date. |
|
Change the Date type to date. | |
Now remove the ‘Mr’ and the ‘Mr.’ using the Replace Values. |
|
Now we will correct the names. Note that some are Surname, First Name. |
|
Right-click the Contact name and split the column. Use the comma to split on each occurrence of the delimiter. |
|
Now select Contact name.2 and shift and select Contact name.1 Right click and merge the columns to create Contact name. |
|
Replace ‘B.’ with ‘Brian’ in the Contact name column. | |
Replace the nulls in the Contact Phone with null. Right-click the column header and choose to Replace Values. Type in ‘null’ in the find and zero in the replace with box. |
|
Now the 3 files have been processed we can append them into one dataset. On the Home menu drop down the list and choose Append Queries as New. |
|
Select the CA query on the left-hand side of the screen. |
|
Select Three or more tables. Click on AK and add, then click on FL and add and press OK. |
|
Now you will see the final dataset that includes the data from all 3 files. |
|
On the Home menu click Close & Apply. This will close the Power Query editor and load the data. |
|
The Combined dataset is now available for reporting. The other datasets (AK, CA, FL) can be hidden as they are not required. To hide, right-click and select Hide. |
|