Sparsity

Options
Center of Excellence
edited June 2023 in Best Practices

1. Abstract

During the creation of a new cube, the question is usually about whether dimensions should be dense or sparse. A cube can store a value for any combination of its dimension’s members: for example, a three-dimensional Cube (e.g. month, product and customer), has a cell for any triplet. If every customer purchases every product every month, we will have 100% dense data.

However, in a real situation, only a small fraction of the total Cube cells really contains data. The ratio between the actual number of cells containing data and the total number of cells of the Cube (obtained by multiplying the number of members of each dimension) is the Cube density.

2. Context

Sparsity plays a central role in cube’s sizing and dataflow’s layout’s execution speed, so are quite relevant for wide Data Model involving many dimensions and high data volume.

3. Content

A sparse structure is a combination of entities (not hierarchically related) for which the number of distinct combinations of values is minor compared to the total number of potential combinations. When a sparse structure is defined, disk space is allocated only for the sparse combinations created while loading, inputting, or calculating, therefore disk space overhead is minimal.

A sparsity is conceptually very similar to an index, so a pointer to data in a cube, that improves the speed of data retrieval operations on a board cube at the cost of additional writes and storage space to maintain the index data structure. Is important to evaluate the trade-off between index size and data retrieval optimization.

3.1 Sparsity Definition

When creating a cube (or a cube’s version) Board suggests us the definition of a dense/sparse structure, it will put as many entities as possible in sparse mode, thus trying to keep the sparse structure pointer to 64-bit based on the automatic or manual max item number set for each entity. In other words, Board will put all entities in sparse mode as long as the product of max item numbers stays below the 64-bit volume limit (=2^64, ~1.8*10^19). In the case of automatic max item numbers, affected entities will be considered as having a corresponding max item number greater than the current one while keeping the value below the 64-bit limit, so always keep limited the use of auto as max item number in order to optimize the sparsity sizing.

The automatic definition of the cube structures (sparse and dense dimensions) allows to create in an easy way the optimal cube structure, reducing the effort on the data model definition. The automatic configuration always try to define the sparse structure into the 64-bit array. If the 64-bit is not enough to address all the selected entities into the sparse, the system starts to switch the smallest entities in dense. The automatic configuration never creates a 128-bit cause is generally better to have a 64-bit. Using the 128-bit sparsity the trade-off between index size and data retrieval optimization is not balanced anymore.

The automatic optimal structure is defined based on the cube itself. An experienced board developer should be able to elaborate and deepen the analysis by considering other cubes structure involved in dataflow or dataview and manually review the sparsity to align it with similar cubes that are part of the same module and/or calculations. Overall performances are better if calculations and reports are executed with cubes that share similar sparsity.

When you create a Cube version, by clicking on the "EDIT SPARSITY" button, you can manually set entities as sparse\dense. Use the following guidelines to optimize a sparse structure:

  • Dimensions that are large and sporadically combined within themselves should be kept as sparse. You should define a dimension as sparse whenever it has a considerable number of members compared to the other entities in the cube. Note that it is better to define a sparse structure when it is not needed than the opposite therefore when you are in doubt, set the entities as sparse.
  • Dimensions that are small and always populated (Version, Scenario or Currency are typical examples) should be set as dense.These dimensions are usually filtered on screen and procedures select to show or work on a single element at a time.

When the size (number of members) of entities is not truly representing the real case (for example you have loaded a partial sub-set of data) then we suggest using a manual definition to instruct the system.

The actual size of a sparse structure is the number of distinct combinations of its entities. This number is visible from cubes impact analysis. Keep this number monitored and be ready to review the sparse structure whether it reaches many millions of combinations.

3.2 Attention Points

Since Sparsity is a shared object, you should keep your sparsities as similar as possible to reconcile all cubes to the same sparsity structure to optimize dataflow/dataview execution by involving fewer sparsities and reducing the creation of new one.

The density is a measure of how much sparsity is effective: a low level indicates that thanks to the Sparsity the Number of Cells is much lower than the total number of cells given by a non-sparse structure. A density close to 100 indicates that the sparsity is useless as much as it can affect database performance.

The sparse structure does not have any impact on the calculation domain of the dataflows, dataflows work only on the existing combination of the calculation domain and do not consider the existing combinations in the sparsity.

The main purpose of a sparse structure influences the performance (general execution time) but structure definition does not have an impact on the result of the execution of dataflows any longer.

Comments

  • Samir Jones
    Samir Jones Active Partner
    5 Up Votes 5 Likes First Anniversary First Answer
    Options

    @Center of Excellence

    Hello,

    I don't understand the last part of the Attention Points.
    The meaning of "main purpose" and "structure definition" is not clear to me.

    Could you please explain it with other words. :-)

    Thank you

  • Leone Scaburri
    Options

    Dear @Samir Jones,

    this attention point is related to the new paradigm in Dataflow execution on the domain currently in place and needs to be considered for everyone used to Board 10.

     Dataflows in Board 10.x:

    1. Detect target dimensions
    2. Perform calculations on all possible dense combinations, combined with the existing sparse combination in the Data model

     

    Dataflows in Board 12.x:

    1. Detect target dimensions
    2. Perform calculations only on the existing combinations of cubes involved in the calculation despite being dense or sparse

    In order to write new combinations, domain options have to be used:

    • - Extend: Writes on all combinations of the entities detected as extend
    • - Limit: Uses the existing combinations of a given cube to perform the calculation

  • Fethi ZERARA
    Fethi ZERARA Active Partner
    May Badge of the Month Community Captain First Anniversary Level 200: Leveraging Board for Business Insights
    Options

    What a thought-provoking post and discussion! I'm interested in understanding the best approach for situations in which the sparsity of a cube transitions from 64 bits to 128 bits.

    This typically occurs when we increase the maximum number of items within the entities that make up the cube.

    One notable observation when the cube expands to 128 bits is a significant increase in processing time. How can we mitigate this issue effectively? Thanks in advance for your response! 😉

  • Samir Jones
    Samir Jones Active Partner
    5 Up Votes 5 Likes First Anniversary First Answer
    Options

    @Leone Scaburri, Thank you! :-)

  • Leone Scaburri
    Options

    Dear @Fethi ZERARA,

    we should keep in mind switching to a 128-bit sparse is like moving from an index to an index of indexes, as described within the article "The automatic configuration never creates a 128-bit cause is generally better to have a 64-bit. Using the 128-bit sparsity the trade-off between index size and data retrieval optimization is not balanced anymore."

    In case expanding max item numbers, you get a 128-bit sparse you could consider switching a dimension to dense to reduce your sparsity size considering the side effect on the physical size.

    The same trade-off has been discussed in this thread:

  • Fethi ZERARA
    Fethi ZERARA Active Partner
    May Badge of the Month Community Captain First Anniversary Level 200: Leveraging Board for Business Insights
    Options

    Dear @Leone Scaburri ,

    thank you for these insights!

    things are clearer now !