Why you need Data Transformation and/or Aggregation
You might want to transform your data for a number of reasons. Raw data from source systems are not always meant for Analytics reporting. Generally, businesses want to transform data to make it compatible with other data, move it to another system, join it with other data, or aggregate information in the data.
For example, consider the following scenario from ‘Xero’: The below is a screenshot of the invoices endpoint which has the following field – TrackingCategory ID, FromDate, ToDate, Periods and other attributes. However, the name of the Tracking category is available in another endpoint “TrackingCategory” . Given this scenario, it is best to create an enriched table after joining the “TrackingCategory” endpoint with the “Profit & Loss” Endpoint so that the Name of the tracking category is available in a single Table instead of several tables.
The above is a very simple scenario of where one may need such joins. In a practical situation, there could be several tables that need to be joined to get the required enriched column. Hence it becomes necessary to create a separate Datamart for reporting and Analytics purposes instead of working on the raw table coming out from the data sources.
Let us consider another situation: Let’s say there has been a merger between 2 organizations, and you need to combine information for the Human Resources departments. The purchased company uses a different database than the parent company, so you’ll need to do some work to ensure that these records match. Each of the new employees has been issued an employee ID, so this can serve as a key. But you’ll need to change the formatting for the dates, you’ll need to remove any duplicate rows, and you’ll have to ensure that there are no null values for the Employee ID field so that all employees are accounted for. All these critical functions are performed in a staging area before you load the data to the final target.
The tool used for Data Transformation
We can perform Data aggregation and Transformation in both Azure Data Factory and Azure SQL Database depending on the requirement. If there is any need to do aggregation directly at a source level, it is best performed in Azure Data factory. However, in case of any enrichment of data attribute needed, SQL Database can be helpful in creating separate views / stored procedures. Furthermore, Power BI (the visualization tool) also provides some flexibility to do data transformation at a reporting level using expression languages - M language query and Data Analysis Expression (DAX) which are mentioned below.
M and DAX are expression languages and are two building blocks of BI in the Microsoft Technology stack particularly Power BI, SSAS. They both are functional languages but they differ considerably and are used for very different purposes. These two languages are typically used at a reporting level before visualization so that the data is in a shape and form which can be visualized.
(a) M Language: Microsoft Power Query (or “M Language”) provides a powerful data import experience that encompasses many features. Power Query works with Analysis Services, Excel, and Power BI workbooks. A core capability of Power Query is to filter and combine, that is, to mash-up data from one or more of a rich collection of supported data sources. Any such data mashup is expressed using the Power Query M Formula Language.
In most cases, one can use the drag and drop feature of Power query to create steps for Data mashups. One can refer this link to know more on this https://docs.microsoft.com/en-us/powerquery-m/
While M language is a powerful query language to filter and combine different data sources, it is a recommended approach to do all combining and mashing-up of data in SQL Database before consuming the same in Power BI / SSAS. However, in some cases, due to complexity of reporting, sometimes, it is needed to create some additional mash-up with data at a report level and it gives additional flexibility to ta data analyst to do some quick