Data Load Guidelines
1. Abstract
While developers are focused on data model creation and data integration, they need to remember that in addition to the data model structure, they have to fill this structure with data. Entities and Cubes have to be filled with meaningful data to be seen by the users.
This document provides a standardized method for loading a customer’s dataset—master-data and transactional data, into a Board solution that is relevant, depending on the volume and quality of the dataset and the business requirements.
2. Context
The approach discussed in this document usually happens at the beginning of a Board implementation, when developers start loading data from different sources. It may seem obvious, but uploading the right amount of information is essential. Underestimating this importance can cause significant problems on the performance side, such as when Entities contain a large number of items.
3. Content
An Entity generally is a set of homogeneous items by which data can be viewed and summarized. A single occurrence of an Entity is called an "Entity Member,” "item" or simply "Member.” Entities in Board are typically used as dimensions of Cubes, but they can also be added as data blocks or in the Axes of a Layout.
To populate your Entity Members, you can use different ways:
- Data Reader
- Entity Editor
- Manual Entry
There are multiple ways to add Entities, including the Entity editor or manually entering a single Entity. However, to upload a large number of Entities at once, using the data reader—in the append mode is the recommended process for efficiency.
While it’s possible to upload the Entity Members from the ERP master data directly, it is not a best practice. The data typically contained in the registry master-data includes all the members created in the ERP but also includes information you don’t want to include, such as inactive, technical, or mistake elements. Uploading unrefined data into Board leads to a high total volume of data (primarily unnecessary) that negatively impacts the performance.
A best practice is to use the transactional data to guide the Entity Members uploaded in Board, so the only data referenced is active, used, and actual. This will reduce the Entity Members to the minimum number and guarantees the newest and correct data, without any overflow Entities or old, incorrect Members. Also, the information contained in the Cubes will be the most updated because you’ve filtered any old data.
When creating data reader protocols, it’s recommended to create different protocols for loading master data and transactional data. This is a natural organization since master data contains information that feeds into Entities and Relationships. For example, the customer’s master file contains customer names, city, state, sales representative, area manager, etc. Similarly, transactional data usually contain information that feeds into Cubes, such as sales amount, quantity, price, discount, etc.
The master files (feeding into Entities and Relationships) should be imported before importing the transaction files. It’s necessary first to populate Entities and hierarchies and then the Cubes so that aggregated versions (if present) can be correctly populated. For example, consider a “sales amount” Cube with two versions, one organized by Customer-Product-Month (the primary version) and a secondary version organized by City-Product-Month. The transactions file contains the following fields: date, customer code, product code, and sales amount. When reading this file, the secondary version can only be fed if the Customer-to-City relationship is already populated; it’s, therefore, necessary that the Customer master file that contains the Customer-to-City relationship is read first.
It’s recommended that you only import relevant Entity Members. Let’s consider, for example, a Board database where you need to load three years of historical data. Only customers and products with transactions during that date range are relevant. There’s no reason to include customers and products still in the master files, yet they don’t have any transactions during those three years.
Keeping all the above considerations in mind, here are the big steps you’ll take to import only the relevant data:
1. Pre-reading: create data reader protocols to import all the Entities from the transaction file or tables. This protocol should include all Entity codes fields from the transaction file, and the append option should be enabled for all fields. These types of protocols are called ”pre-reading” protocols. Run the pre-reading protocols first to append the codes of Entities in the transaction files—usually, these Entities are either standalone or at the base of a hierarchy. The time dimension in the transaction tables should be used as a filter to avoid loading Entity Members that have been used only in years and months outside the time range of the database.
2. Master files: import the master files to create the hierarchies. The data reader protocols to import hierarchies should have the append option enabled for all Entities except for the one at the base of the hierarchy (which has been appended in the pre-reading protocols). When these protocols are executed, records that were not present in transaction files are discarded; therefore, the hierarchy is built only on top of the elements loaded during the pre-reading. Usually, it is best practice to have one protocol for each hierarchy tree in Board so that all Relationships (direct and indirect) are automatically set.
3. Transactions: load into Cubes from the transaction files. These data readers will have no append since all relevant elements of the Entities have already been loaded.
3.1 Alternative approach – load the complete Master-data first
The approach described above is the best practice and is the most widely used method. It tries to limit, when possible, the quantity of Entity Members loaded. At the same time, it makes sure that the transaction data specified in the time range are all included.
An alternative approach is to load the entire master-data file and the hierarchy directly from the master-data table from your source system. Then you’d load the key figures matching this master data from the transaction tables.
This approach is useful when:
- Master-data tables are provided clean and contain only relevant Members—or at least there’s a file that identifies the relevant elements, for example, Active Y/N). This way, a key identifies what should and should not be loaded into Board.
- Transactions are not the primary driver of what is required to be loaded into the application, so they can contain information that does not match the master data.
In practice, this approach means there’s no pre-reading from the transaction tables of the Entities’ leaf levels, but the Entity Members are read directly from the master-data tables with append on all. Subsequently, transaction data are read as described above and discarded only if they do not match the Entity Members or the time range.
This approach has the following pros and cons:
PRO: the master data is always consistent; since we load it entirely and not only if there are transactions against a Member, we will easily maintain a clear set of master data with all parent-child relationships populated correctly
PRO: we have other data that are not used in transactions but could be readily available for future use (e.g., the budget of new products for which there aren’t actuals yet, but the product exists)
CON: the number of Members loaded into Entities could be more than necessary and reduce performance
CON: more difficult to reconcile since we are potentially not loading 100% of transactions
3.2 Further considerations
Regardless of which approach is used, it’s suggested that:
- The tables used with Board and imported through a data reader should always be consistent. A process to ensure that these tables contain consistent master data should already be implemented in the source system (ERP or Data warehouse).
- Whether it is a flag on the master-data tables or guided by the transactions from other sources, the mechanism to decide which data must be loaded into Board should be identified and shared amongst all relevant stakeholders.
- The transaction and master-data tables used with Board should contain only the fields relevant to the new application and (in particular, the data) be filtered by the time periods in the time range.
Comments
-
Thanks @Abhilash Sachidananda for the insights on this topic!
2