Data profiling views in Power query were first introduced in Power BI and has since been rolled out to the Power query editor in Excel 365. It offers a quick way of exploring data to identify bad data. Data profiling comes in useful when you are working with large sets of data and you want to quickly visualize the composition of that data. And even add quick fixes and filters before you load the data for further use!!
When you load data into Power Query, you will note at the top of the column a turquoise bar. By hovering over this bar, you will see a summary of the quality of the data contained in that column. You can see how much of the data is valid and how many errors there are, if any. In addition to this you will also see details of empty cells contained within the column.
By selecting the 3 dots at the bottom of the profile information you can quickly make changes to your data, like removing errors or duplicates.
If the column contains errors, this turquoise line will show broken with the error % in red.
You can get further Column quality details from the View tab on the ribbons and selecting Column quality. This will give you an expanded view of the column quality allowing you to quickly spot potential problems in the column.
When you are working with a large set of data, the benefit of this is that it allows you quickly find errors, empty or duplicates. These would be difficult to spot by scroll though the data. You can make the fixes to your data before you load it into your model and encounter problems.
In Power query, the load preview by default is 1000 row. By default, the column quality also only looks at the first 1000 row. You can verify this by the status bar at the bottom of the Power query window.
To change the profiling so it analyses the entire column of data, select the profiling status in the status bar. Then select Column profiling based on the entire data set.
Column Distribution can be found on the View tab of the ribbons.
It is used to add an additional layer of information to the column quality. Column Distribution shows you the number of distinct and unique values in a column. You are also given a visualization showing the distribution of these values.
This Name column contains 5 distinct values, each value is represented by a bar in the chart. From this chart we can see 4 of the distinct values have approximately the same frequency whereas the final value appears less often.
If a column contains an error, the column distribution will not show. You must first correct any errors before the profiling will show up.
Column Distribution use cases
Although this provides a nice user experience, unless you know what data you are expecting and the expected distribution, it is hard to see what actions can be taken from this feature. However, if you are expecting to see 5 unique values and you only have 4, then you know you are missing some data.
Suppose each month the area sales managers add a file to a shared folder. Its your role to consolidate and report on this data and you are expecting 100 files each month. As long as each file has a unique identifier when merged in Power query, you can quickly check to ensure you have the correct number of unique values. Without the need of visiting the folder.
Another use case would be when merging tables and you want to ensure you do not end up with duplicates in your final data set. Column distribution and profiling can really help keep track of this at a quick glance and you can even remove duplicates from within the profiling.
The Column Profile returns the most detailed information on data within a column. It can be found on the View tab of the ribbons.
Once selected, a new pane will open giving actionable details on a selected column.
On the left are column statistics. Values such as Errors, Empty Distinct and Unique we were able to get from the column distribution so there is not much exciting here. Except you can now also see the Min and the Max values in a column. For me, I have found this useful when I am working with a Date column and I quickly want to establish the earliest date.
On the right we have an interactive chart for Value distribution. The default is to show by value, however you can change how the values of the distribution chart are grouped.
Depending on the nature of the column, the grouping options will differ. The distribution chart above for dates looks considerably different when grouped by year.
In addition to just viewing different grouping distribution, by hovering over a column of the chart you can add a full filter to your data set.
By selecting Equals, all data that does not equal the value of that column will be filtered from your data. By selecting Does Not Equal, all data that does not equal the values of the column will be filtered from your data set.
These filters are added as an applied step in the transformations pane and can be removed by deleting the step.
Unfortunately, there is a limitation with the Value Distribution. You can not select more than one column at a time from the chart which means you can are limited on what you are filtering. However good use of the grouping can overcome some of this.
Conclusion – Data profiling views in Power Query
Data profiling views are for sure a great user experience. The ability to quickly identify columns with errors and blanks and fix these before you import the data to a model for further analysis is a great benefit and increases efficiency when working with larger data sets.
For a bird’s eye view of the data, the distribution information is useful, but its not very actionable unless you know what results you are expecting. If you do know what results you are expecting, then the distribution comes in very handy.
The idea of Data profiling is to ensure data loaded to a model is as clean as possible. It helps you identify and resolved the issues from bad data before they get into your model and become a problem. These new data profiling views in Power Query do their bit to make sure this process is as efficient as possible.
Become a Power Pivot Hero
GET and TRANSFORM DATA like a PRO
Power Query Excel 365
Learn DAX for Power Pivot and Power BI
Best Value Excel and Excel Power Tool Learning. Access All Areas, Unlimited Learning Subscription
SIGN UP FOR OUR NEWSLETTER TODAY – GET EXCEL TIPS TRICKS AND LEARN AND EARN ACTIVITIES TO YOUR INBOX
So tell me, what do you think of the data profiling views in Power Query? Have you tried them yet? Maybe you have some feedback on the article you would like to share?- please do drop a comment below.
Do you want to start collecting rewards quickly for learning Excel? Then you should try: