Best Practices for Performance Tuning on big data models

Document created by ggallo Employee on Jul 26, 2017
Version 1Show Document
  • View in full screen mode

BIG DATA MODELS in BOARD

in order to define a BOARD Solution based on a big data model the following must be true:

  • The amount of elements composing the most granular level entities (and their number) is important
    • >5 entities with > 50.000 elements each
    • daily level data
    • sparsities over 5-10 million elements
    • ...
  • The most used Infocubes are sized around 1-2 Gb each (physical file size)
  • Hierarchies are very complex
    • with > 20 entities for every tree
    • with different branches and different/multiple levels

 

The following guidelines represent best practices aiming to optimize and correctly maintain BOARD solutions relying on such big data and metadata, or any BOARD solution which only partially compares to the conditions above.

 

 

1. ENTITY MAX ITEM NUMBER

Be sure to define a max item number for every entities so that it considers regular and periodical elements increase, without overestimating it. It should match as much as possible the real case.

This parameter is important and has effect throughout the solution, especially when entities are used in infocubes as dense dimensions: a correct maximum item number definition would greatly speed up the cube indexing and avoid a useless increase in size in the cubes that might lead to poor performance (both in rendering and calculations).

 

 

Example:

Imagine having an infoube structured by:

  • Day (3 years time range) = [DENSE by Default] c.a. 1.100 elements,
  • Entity 1: [DENSE] 50.000 elements
  • Entity 2: [SPARSE] 150.000 elements
  • Entity 3: [SPARSE] 50.000 elements
  • Entity 4: [DENSE] 3.000 elements

Defining a max item number of 100.000 in the Entity 4, will lead your cube to have a huge lack of performance due to the fact that the cube optimization over this dense entity is based on the cube max item number.

When it comes to sparse dimension, max item number has a lower impact on the cube optimization, but these should be sized correctly as well to aim for best design and solution performance.

You can also refer to How to change the Max item Number: suggested approach  for more information about how to maintain and change max item number

 

Suggested approach:

  • Set a max item number based on an estimate of the size at the end of the year of this entity, then schedule a yearly maintenance window to check and increase max item number, and maybe the time range as well (1 year)
  • Set up proper maintenance screens to verify entity usage and extract-reload procedures to help you quickly change the entity max item number if the limit is reached.
  • Do not leave the maximum item number set to "0" for any entity, even for entities not used in infocubes, as you might later want to add versioning to your infocubes.

Without max item number representation and calculation performance will decrease

 

 

2. ENTITY CODE WIDTH

Setting an improper code width for an entity that is used as an infocube dimension, will lead to unused reserved space and therefore a bigger size. As a consequence, the time needed to retrieve data when the infocube is used (in reports or calculations) will be higher due to its size.

 

Suggested approach:

  • Fix the code width to the current entity item coding.

  • As an alternative, you can keep the code width to 0, and let BOARD automatically update the code width based on the longest element during data reading

The automatic code width defined by BOARD always rounds up to the next 5 digits (a code of 6 would determine BOARD to define a max code of 10, a code of 11 would determine BOARD to define a max code of 15...)

 

 

3. INFOCUBE VERSIONING

Appropriate infocube versioning provides faster and optimized access to information. In complex solutions, versioning allows you to keep "copies" of the same information at more aggregate levels than the one at which it is loaded, optimizing performance.

You can refer also to Version Sub-Optimized warning message for checking risks related to infocube versioning

 

Suggested approach:

  • Do not create a single version for every report you have: try to aggregating information on a meaningful and "common" logical level, which considers the users most common navigation and selection mechanisms

  • Execute data loading procedures in a logical way so that your cube versions are always aligned and up to date.

    • always load first the new keys in the database,

    • subsequently populate all the hierarchies correctly ensuring there are no missing elements

    • only at this point load the infocubes incrementally and align them, so that the information will correctly flow to the correct members of the aggregated entities.

  • Perform periodic checks on infocubes versions alignment

if any problem is found with infocube versions alignment, it will most likely be a consequence of a hierarchy problem or a wrong loading and alignment sequence.

 

 

4. PAGE SIZE 

When you have big amounts of data and you want to reduce the infocube sizes to speed up queries, you can try changing the page size for your infocubes.

Changing the infocubes file pagesize will require a clear all cubes to be performed

The Page size represents the "size" of the "filepage" in which the infocube information is physically written and with which the information is organized when loaded in RAM.

Selecting a "LARGE" page size re-write and organize information in large pages and in big data models will help BOARD retrieve information quicker and will reduce the infocubes file size.

This is because BOARD has to "cycle" and "search" for the desired information through a smaller number of “pages”, and will perform "better".

The "paging" logic works on the index generated on the day dimension (so doesn’t provide any performance improvement when you do not use day as time dimension)

 

Suggested approach:

  • Run some tests in order to find out which is the correct page sizing for your applications.
  • Periodically check performance and test performance improvement by testing page size
  • Paging is dependent on the cube structure, so it is not always guaranteed that setting up large page size of your cubes will automatically improve performance.

 

 

 

5. AVOID ORPHANS / FATHERS WITHOUT CHILDREN IN THE HIERARCHIES. 

In BOARD it is a best practice to always AVOID orphan elements in your hierarchies.

So for example, if you have an order (lowest level entity) that is not connected to any customer (where the customer has a relationship set to be the father of the order), you will risk getting into loops and problems with data, risking even performance lacks. BOARD relationships should be ALWAYS normalized, so there should be no missing link on the elements of a hierarchy.

For more information an details about how to avoid these risks you can read How to repair and prevent missing relationships in hierarchies  

 

 

 

6. DATA READER TUNING

Data readers should always run on indexed tables, and contain an order by date statement (where date is the field you are loading as day in board).

This is because every Data Reader protocol creates infocube pages: each page contains a certain number of days (as you can read above at point 4, pages are indexed by day). If data is already sorted by day when reading, BOARD can work on the same page in subsequent record loading, minimizing page changes when writing data into an infocube.

More details and information about optimizing data loading can be found How to: Optimize Datareader performance when loading large files 

 

 

 

7. DATA FLOW TUNING

BOARD dataflows may use one of 5 different types of algorithms when performing calculations (ordered by best to worst performance):

  1. HBMP

  2. Pattern

  3. LargeMatrix

  4. SmallMatrix

  5. Cellbased

But how board decides the type of algorithm to be used?

The following rules must be followed to obtain HBMP:

  1. The target cube must be the last one of the dataflow layout and it has to be at least at block b. (i.e. c=b*a can run in HBMP+, a=b*c won't run in HBMP)

  2. Cubes structures must be as close as possible, if the target cube structure is a subset of one of the source cube's structures the dataflow will be an HBMP, if it's a subset of the superset of all the structures of the source cubes you will have a Pattern algorithm

  3. Sums and differences must not be in the formula: c=a+b is not HBMP, c=a-b is not HBMP, c=a*b may be HBMP depending on the cubes structures. (basically only * and / operator can be used to have HBMP)

  4. Refer to, Total by, Rules, should not be used to obtain HBMP.

  5. The dataflow Layout contains only cubes (no intermediate blocks with algorithms)

  6. The cubes are only of numerical data types (no text, no dates)

If you break one of these rules you will fall into cellbased, if you are using a ROLAP cube you will have at most a LargeMatrix algorithm.

Always check the database log to understand which algorithm you are using, then optimize your dataflows following the previously described rules.

For more information or additional examples on how to obtain HBMP calculation method, please read Dataflow execution methods: high performance in depth 

14 people found this helpful

Attachments

    Outcomes