Dataflow common mistakes and best practices

Options
Center of Excellence
edited May 2023 in Best Practices

1. Abstract

The dataflow is the main action to perform calculations between cubes, it is used in all the processes of the application, therefore it is very important to know how to set up a dataflow in the best possible way, to obtain the correct results with the best performances.

2. Context

In this article, we will first focus on the most common best practices to keep in mind when defining dataflows and then in the appendix we will deep dive into each option with some practical use cases.

3. Content

3.1 Extend Domain setup

The Calculation Domain allows to explicitly define the range of Cube cells that will be processed, thus simplifying the definition of Dataflow regardless of the structure of the Cubes and, at the same time, optimizing performance by considering only the desired portion of the Cubes.

The Dataflow calculations are performed by default only on the non-null cells of the associated Layout. The Dataflow will apply the formula to each combination of the members of Entities that are in the structure of the Target Cube: not to every potential combination, though, but only to those which have a non-zero value on at least one of the Blocks of the Layout.

Each of those combinations is called a tuple. Through the Calculation Domain settings you can extend or reduce the range of tuples processed. The extend domain option in particular is there to allow developers to run calculations outside the standard domain of the dataflow.

The extend domain option should be clearly understood and not activated when not necessary. This is probably the most common mistake in dataflow setup and it can lead to severe performance issues if not managed properly. The reason why it will affect performance is that we are implicitly asking Board to perform calculations on a much wider range of cells than necessary.

“However, that said, how should be the extend option managed?”

The key word here is “tuple”. A tuple is identified by the combination of the specific dimensions of the target cube only. Then, with the default domain option, the formula is executed only on the tuples of the target cube. Finally, only the tuples which have a non-zero value on at least one of the Blocks of the Layout are processed.

So, the default option is in other words saying that, the calculations will run by default on all the tuples of the target cube which are non-zero on at least 1 of the source cubes, without considering the source cubes structures alignment. This is the default unless we activate the “Extend Domain” options.

This concept allows us to formulate a set of best practices based on whether the dimensions of the target cube are shared or not with the source cubes and whether we rely or not on the non-zero data of the source cubes:

  1. Set the extend option only on dimensions of the target cube which are not shared by any of the source cubes.
  2. IF the dimension is shared by one or more source cubes, do we need to write our calculation’ results only on the “non-zero” values of these cubes?
    1. YES - then the dimension does NOT require the extend option
    2. NO - then we want to include also the empty elements of this dimension into the calculation and therefore the extend IS required on this dimension
  3. Repeat the above steps for all the dimensions of the target cube listed

The best practice and concepts just explained make it clear that most of the times very few dimensions require to be extended.

The need to extend dimensions increases mainly because of 2 factors which should be carefully evaluated:

  • Usage of constants in calculations to generate new tuples:
    • it is technically true that if we want to use a constant to generate new tuples (i.e. common example dataflow a=1 or b=a+1) we need to extend all dimensions to be able to write into the target cube since we have no source cubes; however , if the cube we are populating with a constant contains many dimensions or large dimensions , we should always evaluate whether there is already into the system a cube that can help us generate the right tuples instead of writing a constant; in this way we can avoid to setup the extend options ( i.e. b=a/a where block a is the quantity cube and b is the cube that should contain all the used products/customer combinations);
    • if the constant is instead used to adjust an existing value (not generate a new tuple, for example b=a/12) then the extend is actually not necessary;
  • Structures of cubes involved in calculations very different: this is actually one of the most common causes of misunderstanding and incorrect setup in dataflows; if we structure our calculations to use cubes with very non-homogeneous dimensions we increase greatly the complexity of the calculation domain and likely we are forced to extend more dimensions than needed; we explore a bit more in detail this case in the following paragraph.

Additional Note: be aware that there is also defined limit to the domain extension. Excessive use of the extend domain option will increase the likelihood of incurring into this limit. When the limit is reached the dataflow will fail and return an error: “Extend limits exceeded (2147483647 tuples)” .

3.2 Cube structuring

Cubes structuring is also very important to improve performance and stability of DF calculations.

Cube structures are used by Board to decide the best calculation algorithm to use:

  • HBMP : is the fastest algorithm to copy data from a cube into another; so cubes with similar structures will lead to have more dataflows of this type;
  • Join : is the fastest to add a dimension ; enrich a dataset with an extra level of details ; they perform very well especially with many members in select; Board will use automatically this algorithm when the dataflow is enriching the source cube with 1 more dimension through a driver/mapping cube;
  • Tuples: is the fastest for general purposes; this is the algorithm applied when the conditions of the previous ones are not met; depending on how cubes are structured and under certain conditions this algorithm can be faster than the previous two;

The algorithm used in a specific dataflow is recorded in the database log. Be aware that the 3 categories above are a simplification, in reality there are many more sub-categories of algorithm but they are not relevant for the purposes of this article.

Knowing the definitions above a developer can decide on the best setup to obtain the best dataflow performances.

The best practice is to structure a dataflow in such a way that it uses the algorithm that is best fitted for the purposes of that specific calculation. This means that, if our purpose is to copy/enrich/map data we should aim for HBMP/JOIN algorithms which will give us better performances in most of the cases.

Of course, this best practice must be adopted only when necessary to avoid overengineering of otherwise simple calculations.

This means that we should apply it when we are in front of :

  • a complex dataflow
  • many source cubes involved
  • poor performances (usually as a result of many extend domain options flagged, see prev. paragraph)

In those cases, the suggestion is to break down the calculations in many simple steps instead of a single and complex one. This will allow to reduce the need for extend domain and the amount of structures alignment that Board has to perform before executing the calculations. Moreover, we will have a few simple dataflows falling into the HBMP/JOIN category rather than one single complex dataflow using a less performing algorithm.

In those cases, we can leverage also the procedure “Temporary” cubes which will allow to store partial calculations across the new DF steps.

3.3 Selection definition

Another quite common mistake in dataflows definition is not to restrict the selection enough. This means that dataflows are executed on a selection range wider than necessary even though, through that screen/dataview/part of the application, only a slice of the dataset has been changed and should be recalculated.

It is indeed best practice to always define consistently the selection range and avoid recalculation of unnecessary portions of the cubes. This has the effect of improving:

  • Performance - this is true in general but much more evident if we are placing a selection on the “dense dimensions” of the involved cubes;
  • Reliability - if the entire dataset is always recalculated, we unnecessarily expose our application’s data to potential changes or unavailability;
  • Concurrency - with many concurrent users recalculating the entire dataset it is highly likely to incur in exceptions, missing data, and unexpected behavior of the calculations;

A typical example is to make sure that the calculation process is executed within the Security perimeter of the user and, if more users share the same perimeter, have mechanisms in place to enable proper collaboration.

The opposite approach would be that the calculations are setup in such a way that the full dataset is always recalculated, however this is the wrong approach for the reasons explained above.

3.4 “Limit Calculation to Tuples of the cube”

Another option which is often overlooked is the “Limit Calculation to Tuples of the cube”. This option, like mentioned in its name, allows to limit calculations to only the tuples contained in a specific cube.

This option has been introduced to allow developers to restrict calculations within a specific selection on specific combinations of data. The selection is a great method to restrict calculations , however if you place a selection on multiple entities , Board will work on all the possible combinations of the selected members. With this option you can further restrict the domain of the dataflow at cell level, which is otherwise not possible with a selection.

The best practice is to use this feature to facilitate cell specific calculations and therefore limit the recalculation of data only to the relevant portion of the dataset.

Another very relevant use case is when the objective is to perform “Delta calculations” or in general calculate only a specific set of tuples, leaving the other pre-existing tuples unchanged.

This last type of use case requires additional configuration next to the “limit” because by default this option, even if activated, does not affect the clear cube which the dataflow executes implicitly at the start of the step. See below the macro steps of a dataflow action:

As you can understand from the picture above, by default the limit to option will achieve the following results:

  • Clear the entire selection: this is what we would like to prevent in certain use cases;
  • Improve domain calc. performances because the domain is derived directly from the “Limit to” cube
  • Improve the write performances and flexibility because the dataflow step will write only on a set of tuples ;

To address the use case of “writing only on a set of tuples, without affecting other tuples already existing in the target cube”, we would need to define the dataflow as in the following example:

  • the IF statement prevents the clear cube on tuples which should remain unchanged
  • the Calculation domain option "Limit calculations to Tuples" will boost the performances because the domain will be inherited from the limit cube and not calculated during virtualizations

With this technique you can run all your calculations in delta mode, within the "select based on" perimeter and further boost any low performing dataflow. 

Note that, in order to calculate the domain, the cube set in the limit to option is automatically extended to match the target cube. Therefore, if these two cubes are too different in terms of dimensions, the excess of extend might end up in bad performances and therefore this should be carefully evaluated.

Another final note, from version Summer 23 , the limit to cube can also be a temporary procedure cube. This allows much more flexibility and the delta/subset of tuples can be potentially stored in a temporary cube which is therefore automatically cleared at the end of the process.

3.6 Dataflows in high concurrency environments

Finally, another important discussion point related to common DF mistakes and best practices is how to setup dataflows in high concurrency environments.

This topic is very much linked to the previous paragraph about “Selection definition”, which is the main driver to avoid/limit issues due to users working in concurrency. However, to further improve the concurrency of our DF processes we need to adopt as much as possible “Temporary cubes”.

One common mistake is not to take this into account in applications with a lot of users and many users working within the same security. In these cases, the best practice is to execute the calculations on temporary cubes and only at the end of the process flow the final dataset into the physical cubes.

In this way we can :

  • avoid any kind of overlapping during calculations and we simplify the concurrency management, which is now relegated only to the final cubes storing the necessary information;
  • we will less likely have dataflows where the cube is both the source and the target of several calculations which affects negatively performance due to read and write locks mechanisms;
  • Simplify the data model and debugging process because we have less physical cubes in the application and cubes are dedicated to a procedure rather than shared across the data model

Of course, if part of the calculations must be shown to end users or is used in other calculation processes we are then forced to use physical cubes and this practice is not applicable.

Comments

  • Center of Excellence
    Options

    Thanks @Andrea C. for the insights on this topic!

  • Neha Hasija
    Options

    This was really helpful.

  • Julien CARDON
    Options

    Hello

    Could you please add some tech info about usage of block reference and impact on dataflows ?

    • If I'm doing block reference, is there any impact on the extend entity I should select / what are the rules ?
    • In case of a basic dataflow b=a where a & b are sharing the same entities but a has one more entity (eg scenario). We have 2 options :
      • Option 1 : Add a selection in procedure (eg scenario = actual) before the dataflow
      • Option 2: Do not make any selection in the procedure but adding a block reference scenario = actual directly in the block a of the dataflow
      • ⇒ Both are options are producing expected results. However, does the second option has negative impact on dataflow performance / is not recommended ?

    regards

    julien

  • Tommaso Riva
    Tommaso Riva Employee
    First Anniversary Level 100: Foundations of Building in Board Level 200: Leveraging Board for Business Insights First Answer
    edited August 2023
    Options

    Hi Julien,

    In the configuration of the dataflow step, it is clear that the "Refer to" function alters the aggregation or detail level of the cube involved.

    Moreover remember that the "Refer to" overrides the screen selection (Select) and the axes settings and be aware that with the "Refer to" enabled, any existing security selection applied on the Entity chosen in the "Refer to" function configuration will be ignored.

    As you might know the execution time (performance) of a single dataflow step depends on different factors: the dataflow configuration, the structure of the cubes, their size and the selection applied to the dimensions of Cubes.

    The dataflow step may use different types of algorithms for its calculations.

    One of the most relevant calculation algorithm is HBMP (Hybrid Bitwise Memory Patterns) and grants considerably higher performance.

    Among the several conditions to be met in order to use this calculation, there is one in particular: do not use the "Refer to" function in the layout of the dataflow.

    In conclusion, since in your example both options produce the expected result, it is always preferable to choose option 1 with the selection step in the procedure.

    Regards,

    Tommaso