Dataflow common mistakes and best practices Pt. 2

Center of Excellence · 2024-12-05T10:59:30+00:00

There was an error rendering this rich post.

1. Abstract

The dataflow is the main action to perform calculations between cubes, it is used in all the processes of the application, therefore it is very important to know how to set up a dataflow in the best possible way, to obtain the correct results with the best performances.

2. Context

In this article, we will first focus on the most common best practices to keep in mind when defining dataflows and then in the appendix we will deep dive into each option with some practical use cases.

3. Content

3.1 Replacement Mode

When trying to grasp how dataflow execution works, it's crucial to understand that dataflows typically operate in replacement mode. This means that, before the dataflow calculations are executed, the target cube (based on the specified selection) is cleared unless the target Cube is also referenced in the formula (i.e. it is a "factor Cube").

As a result, any prior actions to clear the cube are generally unnecessary and redundant, except in specific cases that may be considered exceptions.

3.2 Calculation Types

There are the following three basic transformation types within the dataflows:

Aggregation, i.e. removing entities (-) and/or write values to an aggregated level (^).
Adding new entities (+, +n)
Lift Down, i.e. writing from an aggregated level to a detailed level, e.g. from Year to Month (v)

In addition, there are also some other operations:

Apply rule, i.e. calculated values will be written to the cube.
Aggregation or Complex Time Functions, e.g. YTD, YMT, Last Value
Simple Time Functions, e.g. Previous Year, Previous Period

Tip: Do NOT combine these calculation types. Each transformation should be done in a separate dataflow, especially with bigger data models. Lift Downs need to be done entity by entity – e.g. for an aggregation dataflow all entities can be done at once. Following this rule usually gives a good performance.

In General The suggestion is to break down the calculations in many simple steps instead of a single and complex one. This will allow to reduce the need for extend domain and the amount of structures alignment that Board must perform before executing the calculations. Moreover, we will have a few simple dataflows falling into the JOIN category rather than one single complex dataflow using a less performing algorithm.

3.3 Dataflow Modes

To identify dataflows which might perform better, see the HBMP log. Identify long running dataflows and look for the dataflow modes. There are following types:

Straight (one source cube)
- E.g. 1:1 (target cube and source cube contain the same entities) (b=a) should be Straight. Check whether the same sparsity settings are applied and change it. For the 1:1 dataflow involving virtual cubes it can help to have different sparsity settings. In this case Board runs the faster aggregation calculation instead of the Tuples calculation.
- Also +n (add entities on the target cube) use the straight mode when the Extend is NOT set. An extend domain configured on any dimension would prevent Board from using the Straight mode and would kills the faster mode.
- Using Block References on the source cube is possible. (Please keep in mind that fixed selection on an entity should only be done if the order of the elements will never change. Also, a block reference is not visible in the overview. Therefore, comments on the procedure action must be used.)
JOIN (special calculation with two source cubes)
- The join only works when the requirements regarding the source cubes are met. E.g. if one source cube has an additional dimension which is not relevant for the calculation, remove the dimension in a previous step.
- An extend domain configured on any dimension would prevent Board from using the JOIN mode and would kills the faster mode.
Tuples/Intersect (e.g. multiplications (a*b) or when the extend is set).
Tuples/Union (e.g. additions): Union is faster when source and target cube share the same sparsity.
HBMP(+) see manual for further explanations. In some cases this mode consumed a lot of RAM. Also, this mode cannot be used in case of virtual target cubes. Usually, the Tuples/Intersect is faster. Therefore, most of the time, an extend of an entity should be set to prevent HBMP from running.
UnableToPerform (It may happen that, while analyzing Database logs, Dataflows operations sometimes generate log traces “UnableToPerform”. Although this message could be slightly misleading: this is not an execution error. This is only the indication that the DataFlow operation did not have populated cells to be evaluated or when the target of the operation is null (0 records/tuples written by the DataFlow itself). It can be typically due to a missing extend, empty source cube, an empty Select operation or effectively no data has to be written. In case dataflow outcome is expected to be different, this message warns to reconsider dataflow target extension.)
Constant: Only target cube exists into the layout.

Recommendation If possible, try to get the procedures to run in Straight or in Join mode (which is - of course - not always possible). The second choice will be Tuples+Intersect that is usually faster than a Tuples+Union.

3.3.1 Dataflow Expression

In previous Board versions the expression had a strong impact on the performance. Starting with Board 12 this behavior changed as the impact seems to be decreased.

There are no recommendations whether to use if or if-like expressions if(a<>0,a,b) vs. (a+(1-a/a)*b). When only one cube is used in the if-condition also the Straight mode might run.

Only recommendation is to use a/a instead of 1 in following expression:

Preferred: c= if(b<>0,a/a,0) instead of: c= if(b<>0,1,0).

3.4 Extend Domain setup

The Calculation Domain allows to explicitly define the range of Cube cells that will be processed, thus simplifying the definition of Dataflow regardless of the structure of the Cubes and, at the same time, optimizing performance by considering only the desired portion of the Cubes.

The Dataflow calculations are performed by default only on the non-null cells of the associated Layout. The Dataflow will apply the formula to each combination of the members of Entities that are in the structure of the Target Cube: not to every potential combination, though, but only to those which have a non-zero value on at least one of the Blocks of the Layout.

Each of those combinations is called a tuple. Through the Calculation Domain settings, you can extend or reduce the range of tuples processed. The extend domain option in particular allows developers to run calculations outside the standard domain of the dataflow.

The extend domain option should be clearly understood and not activated when not necessary. This is probably the most common mistake in dataflow setup, and it can lead to severe performance issues if not managed properly. The reason why it will affect performance is that we are implicitly asking Board to perform calculations on a much wider range of cells than necessary –or activating a slower dataflow mode.

“However, that said, how should be the extend option managed?”

The key word here is “tuple”. A tuple is identified by the combination of the specific entities of the target cube only. Then, with the default domain option, the formula is executed only on the tuples of the target cube. Finally, only the tuples which have a non-zero value on at least one of the Blocks of the Layout are processed.

So, the default option is in other words saying that, the calculations will run by default on all the tuples of the target cube which are non-zero on at least 1 of the source cubes, without considering the source cubes structures alignment. This is the default unless we activate the “Extend Domain” options.

3.4.1 Recommendation of use

This concept allows us to formulate a set of best practices based on whether the entities of the target cube are shared or not with the source cubes and whether we rely or not on the non-zero data of the source cubes:

Set the extend option only on entities of the target cube which are not shared by any of the source cubes, i.e. new entities or more detailed entities within a tree are in the target cube. In certain cases (see below) Board automatically detects the necessity of creating new tuples and therefore, the extend should NOT be set.
Join: Do NOT set it.
Simple +n: Do NOT set it. +n means having one source cube which doesn’t contain all the entities of the target cube. For multiple source cubes the extend doesn’t seem to affect the performance.
Aggregation (-, ^): Do NOT set extend for an entity when the data transformation for this entity is an aggregation.
IF the entities is shared by one or more source cubes, do we need to write our calculation’ results only on the “non-zero” values of these cubes?

YES → then the entity does NOT require the extend option
NO → then we want to include also the empty elements of this entity into the calculation and therefore the extend IS required on this dimension

6. Repeat the above steps for all the entities of the target cube listed

The best practice and concepts just explained make it clear that most of the time very few entities require to be extended.

The need to extend entities increases mainly because of 2 factors which should be carefully evaluated:

Usage of constants in calculations to generate new tuples:
- it is technically true that if we want to use a constant to generate new tuples (i.e. common example dataflow a=1 or b=a+1 (1 should be written always, independently from cube a)) we need to extend all entities to be able to write into the target cube since we have no source cubes to be used; however, if the cube we are populating with a constant contains many dimensions or large entities, we should always evaluate whether there is already into the system a cube that can help us generate the right tuples instead of writing a constant; in this way we can avoid to setup the extend options ( i.e. b=a/a where block a is the quantity cube and b is the cube that should contain all the used products/customer combinations);
- if the constant is instead used to adjust an existing value (not generate a new tuple, for example b=a/12) then the extend is not necessary.
Structures of cubes involved in calculations are very different: this is one of the most common causes of misunderstanding and incorrect setup in dataflows; if we structure our calculations to use cubes with very non-homogeneous entities we increase greatly the complexity of the calculation domain and likely we are forced to extend more entities than needed.

Additional Note: be aware that there is also a defined limit to the domain extension. Excessive use of the extend domain option will increase the likelihood of incurring this limit. When the limit is reached the dataflow will fail and return an error: “Extend limits exceeded (2147483647 tuples)” .

3.5 Sparsity

Running dataflows involving cubes sharing the same sparsity might accelerate the calculation. except in case of calculations involving Virtual cube. In that case the sparsity is a local object dedicated to the virtual cube and not the same share object as the standard sparsity.

Keep in mind Virtual cubes use virtual sparsities which exist only during the runtime of a procedure. Usually, the first creation/filling of a virtual/physical sparsity consumes more time than when it is filled already. I.e. in certain cases, it might be better to work with physical cubes for calculations. But this is a very specific case.

Extremely large sparsity containing a lot of entries slowdown in their performance. Therefore, a virtual cube for calculation might outperform physical cubes most of the time.

Attention Board parallelize calculations over 8 maps, which means a dataflow is split into 8 processes to accelerate the calculation. The sparse entity with the highest physical name will be used to drive this splitting. Therefore, it could make sense to set entities to dense when they are the last position of the entity list and only containing a few entries.

128-Bit sparsity Try to prevent if it is possible. It is slower than the 64-bit.

3.6 Temporary Cubes

Virtual Cubes (as temporary could be misleading, the term virtual is used within this document) are created and used as part of the Procedure in which they are defined, which means that they can only be used within the Procedure in which they are created and are not visible in the Cubes section of the Data model where only physical cubes are visible. Virtual Cubes are created when the execution of the Procedure starts and are deleted when the Procedure ends. They are isolated in the procedure execution; i.e. multiple parallel executions do not interfere with each other. Therefore, usually, no User entity is required within a Virtual Cube. By their nature, Virtual Cubes only occupy RAM during the execution of the Procedure and are never physically stored on disk.

Virtual cubes use their own virtual sparsity within the procedure execution.

Virtual cubes can reduce:

the database size
the sparsity size
the amount of physical cubes
or prevent the usage of user entities
the risk of problems during parallel executions of a procedure

4. Special Calculations

Certain calculations run well with small datasets but are too slow with large datasets. Therefore, it might be necessary to apply another approach – as described in following sub-chapters.

Of course, this best practice must be adopted only when necessary to avoid overengineering of otherwise simple calculations.

This means that we should apply it when we are in front of:

complex dataflow
many source cubes involved
poor performances (usually because of many extend domain options flagged, see prev. paragraph)

4.1 Recursive Calculations

Try to prevent recursive calculations when they consume too much time. There are cases where Yearly Moving Total functions (with cycle setting of e.g. 9999, i.e. total over many years) can achieve the same results without the iteration month by month.

E.g. for calculating the open and the closing stock/inventory, only the starting stock is needed and with the infinite yearly moving total of the stock changes it is possible to calculate all the starting and closing stocks without any iteration.

4.2 Rule Calculations

By applying the rule on the source cube, calculated values can be written to the cube. For small cubes this is fast enough. In cases when this consumes a lot of time, it might be necessary to create a mapping calculation – if the rules allow it, e.g. if-conditions, offsets, and data pickers might make it impossible. In the case of simple sums (+/-) it is possible.

Mapping Calculation Prerequisites:

Create a pseudo replicated KPI entity.
Create a mapping cube which contains the 1:1 mapping between the KPI entity and the pseudo replicated entity – and fill it.
Create a mapping cube which contains the Sum mapping between the pseudo replicated KPI entity and the KPI entity – and fill it.

Calculation pattern:

Copy values from KPI to the pseudo replicated KPI by using the first mapping cube. This is a join and therefore, no extend should be set.
Remove the KPI entity.
Add the KPI entity by using the Sum mapping. This is a join and therefore, no extend should be set.
Remove the pseudo replicated KPI entity.

4.3 (Multiple) Lift Down

A lift-down calculation is used when a value is written from one or more aggregated entities to a detailed level. For this calculation, it is required to set the extend of the detailed entity.

Depending on the size of the dataset, this calculation could consume a lot of time. In this case, a mapping calculation is required.

Mapping Calculation Prerequisites:

Create a pseudo replicated entity of the detailed entity when it is a multiple lift down. Otherwise create a pseudo replicated entity of the aggregated entity.
Create a mapping cube which contains the mapping between the aggregated entity/entities and the pseudo replicated entity – and fill it.
Create a mapping cube which contains the mapping between the pseudo replicated entity and the detailed entity – and fill it.

Calculation pattern:

Copy values from aggregated entity/entities to the pseudo replicated entity by using the first mapping cube. This is a join and therefore, no extend should be set.
Remove the aggregated entity/entities.
Add the detailed entity by using the second mapping. This is a join and therefore, no extend should be set.
Remove the pseudo replicated entity.

4.4 Aggregation Time Functions

Using the aggregation time functions Yearly Cumulated Value or Yearly Moving Total could perform, on big datasets, slower than a mapping calculation. Therefore, it is recommended to use a mapping calculation on big datasets.

Mapping Calculation Prerequisites:

Create a pseudo replicated entity of the aggregated time entity.
Create a mapping cube which contains the 1:1 mapping between the aggregated time entity and the pseudo replicated entity – and fill it.
Create a mapping cube which contains the mapping between the pseudo replicated time entity and the detailed time entity – and fill it.

Calculation pattern:

Copy values from aggregated entity/entities to the pseudo replicated entity by using the first mapping cube. This is a join and therefore, no extend should be set.
Remove the aggregated entity/entities.
Add the detailed entity by using the second mapping. This is a join and therefore, no extend should be set.
Remove the pseudo replicated entity.

4.5 Join

A Join dataflow consists of two source cubes:

One source cube has one entity less than the target cube.
The other source cube also has less entities than the target cube but contains the entity which is missing in the first source cube.
No other entities are allowed.
Extend MUST NOT be set.

A join is typically used in a mapping calculation. It is widely used for fast calculations and needs to be understood by the developer to maintain or improve the performance of the calculations.

4.6 Summation Dataflows

A summation dataflow (e.g. d=a+b-c+d) consists of two or more cubes, preferably with the same cube structure. Board uses the Tuples+Union mode for this dataflow. An alternative (for big data models) is the usage of a summation entity. Instead of having one dataflow, the summation is divided into several dataflows which are in total faster than the one dataflow.

Steps:

Create summation entity with e.g. 2, 3, 4 elements (1, 2, 3, 4, ...).
Create summation cube, which contains the summation entity, in addition.
Each summation component is written to a slice of the summation entity in the summation cube. Use selections steps before, or the block reference on the dataflow itself. Dataflow mode should be straight.
The summation cube is written to the target cube. Dataflow mode should be straight.

4.7 Execute calculation on existing sparse combination of target cube

The recommendation is not to use this mode because it was created for retro-compatibility purposes and to make database migration to B12/B14 easier. New dataflows shouldn’t rely on this option and should be built with the other dataflow options.

The mode Execute calculation on existing sparse combination of target cube presents the way how Board calculated dataflows with Board 10. This required a deep understanding of this mode and sparsity/density settings were highly relevant for the success of a dataflow. Also, this caused additional steps in which sparsities were changed to other sparsities.

4.8 Time Offsets

Time Offsets like Previous Period Value, Previous Year Value or the Offset setting don’t slow down the performance. It is advised to separate them from other calculations.

4.9 Total by, Block Reference, Entity Block vs virtual cubes

Another common mistake, which is inherited by earlier versions of Board, is to rely on functions like “Total by”, “ReferTo” or “Entity Block” instead of using virtual cubes.

“Total by” is used to aggregate a source cube to a subset of its entities when running calculations, “ReferTo” is used to focus a source cube on a specific element of an entity, “Entity block” is used to display entity codes/desc or entity relationships inside a dataflow.

The data displayed through this feature is used for the Dataflow calculations.

In all cases we can build an ad hoc (procedure) virtual cube and store the data there before executing the final calculations. This is the suggested best practice.

The reason why this is suggested as best practice is because this approach usually guarantees better performance. These are the reasons why:

Lower dataflow complexity: like explained in the previous paragraphs we are breaking down 1 complex dataflow into simpler pieces and the overall performance will improve; technically Board will run these less complex dataflows in much more performant calculation algorithms, so even if we have a few more steps the overall execution time is lower.
Domain expansion: since these features act during cube’s entities alignment, they do not generate real tuples; for this reason, they make the evaluation of the domain more complex and in general increase the need to extend more entities.
Debugging: The results of source blocks/cubes can be better understood in debugging situations.

Regarding the Block Reference: This setting is not visible in the procedure overview and it is a fixed entity selection. Therefore, it should be used if it is really necessary, e.g. improving the performance because of saving dataflows, in all other cases the virtual cubes should be used instead.

Find more posts tagged with

Dataflow

Time range

Performance

Comments

Center of Excellence

Thanks @Jan Kaufmann for the insights on this topic!

Helmut Heimann

Great insights, @Jan Kaufmann , really appreciated!

Kevin Kowalski

Thank you @Jan Kaufmann !

Fethi ZERARA

What a great insights @Jan Kaufmann !

Julien CARDON

great, we definitely need more and more tech stuff like that

Leone Scaburri

Thanks @Jan Kaufmann, an excellent article filled with valuable insights and thoughtful perspectives

Federico Cazzalini

thanks @Jan Kaufmann - great work!

AmirAbed

Very interesting, thank you !

Lorenzo Aiani

very helpful. well done!

Andrea Scazzosi

Thanks Jan. Very useful.

g.vialetto

Really useful! I am wondering is still available JOIN algorithm, in version 10 I extensively use it to improve dataflow.

Is it available on 12/14 versions? Is it necessary to define JOIN on the dataflow?

Bettina Clausen

https://community.board.com/discussion/comment/19194#Comment_19194

Hi,

the Join-mode is still available and also described in the article. The difference to B10 is that Board automatically applies the mode in case all requirements are given. You do not need to write "join(a*b)".

Helmut Heimann

https://community.board.com/discussion/comment/19195#Comment_19195

Hi @g.vialetto ,

additionally—keep in mind that to make Board use the join: don't extend on new tuples because that would kill the join.

Best,
Helmut

Giacomo Pegoraro

Ciao, fantastic article! Only one question: What's a "Pseudo-replicated entity"? It is really not clear to me and i don't see where it is explained in this text. I get the idea of designing the DF with a matrixial perspective to enable faster algorithms, but i do not understand how a "pseudo-replicated entity" has to be set to achieve this objective. Maybe in my mind i am too tied to the last project that i worked on, where a complex Bill of Materials required me to explode the Final products list to the component one (+n)…

Federica Antonelli

Hi @Giacomo Pegoraro - Pseudo-replicated means creating a new entity that is identical to an existing one. In this case, you create another "KPI entity" as a copy of the original. This allows you to manually manage it without using the native "replicated entity" feature that has pros and cons. For example, you can create just the one you need for the mapping without replicating the entire tree…

Hope it's helpful.

Giacomo Pegoraro

https://community.board.com/discussion/comment/19925#Comment_19925

Ciao @Federica Antonelli, ok then, it was easier than i thought! Now i get the idea of having a "mapping" entity to ensure a faster matrixial calculation instead of an extend setting in the dataflow that may slow down the process. Thanks a lot!