[TECH] Dataflow concurrency in DB with a lot of users
Hello,
I have a budgeting datamodel with 300 users (facility manager) where we are sometime facing DB crash / data loss. Since it is a budgeting model, we have a lof of dataflows launched by user to perform calculations.
- For an unknown reason sometime a BOARD is not able to complete a simple / basic dataflow, so the dataflow is stuck in the running process screen and all subsequent users are stuck and will no longer be able to launch dataflows
- So we are forced to kill board engine thus we are loosing data / cubes being corrupted
I'm trying to optimize the performance the data model and reduce cube tech lock / user congestion when several users are launching dataflows on the same cube at the same time.
Here an example :
Current situation : In all our users procedure, we have a dataflow where we are feeding one final consolidated P&L cube (approx 50mo) :
- Month : 100 members - Dense
- Facility : 300 members - Sparse
- P&L Item / Accounts : 300 members - Sparse
- Version : 10 members - Sparse
Tomorrow : We are thinking to split / allocates users to specific cube. So instead of having 1 conso cube, I will have 3 working cubes to allocate users + 1 conso cube.
The 3 cubes will have the exact same entities and same sparsity as the original cube
Before :
- 300 users launching dataflows to feed the unique P&L cube
After :
- 100 users launching dataflows to feed cube P&L 1
- 100 users launching dataflows to feed cube P&L 2
- 100 users launching dataflows to feed cube P&L 3
Then we will have an admin procedure (not launched by users but maybe only 5 times per day to feed the original P&L conso cube) :
- Conso P&L cube = P&L cube 1 + P&L cube 2 + P&L cube 3
Question 1
⇒ I would like to know if this architecture will help BOARD to support more user dataflow concurrency or not. If yes, could you explain why ?
⇒ I have the feeling that the data stored in the sparsity file and that it this sparsity file is shared among all cubes that have the same entities & sparse configuration.
⇒ Does it means that spreading users to specific cube will not help BOARD DB to support more dataflow concurrency since those cubes are sharing the same sparsity file ?
Question 2
Entity "Facility" is SPARSE while most of the time 1 user is dedicated to the budget of 1 facility member so in the end , each 300 users will launch a lot of dataflows but each user working only on 1 dedicated facility member.
Do you think it useful that switching that entity to DENSE instead of SPARSE may help dataflow concurrency issue ?
thanks,
julien
Answers
-
Dear @Julien CARDON,
Be aware since the introduction of B12 concurrent actions on a data model have been greatly improved in terms of performance and overall stability, and with the most recent release temporary cubes are now supported not just in Data Flow but also in Data readers reducing possible concurrency issues, for example when multiple users update the same cube via multiple instances of the same procedure.
Board can handle concurrent actions, could be that your dataflow gets stuck in a moment of high concurrency due to resource saturation, the server CPU consumption is the most critical parameter in terms of application responsiveness and performance in a concurrent scenario. Data loss or any inconsistency at cube/sparsity level could be caused by a service restart while a dataflow is still running.
The solution you propose has just a mitigation effect and you are right, a sparsity is a shared object among cubes, so even assigning your users to different cubes the sparsity would remain the same one accessed by all.
As you understood the choice of dense vs sparse has a great impact especially if your users work on a specific slice of the cube, different per each of them. Switching the facility dimension to dense would help to reduce the size of the sparsity but on the other side could increase the overall cube size, so you need to find the right balance.
Another dimension you can test to switch to dense would be the Version since we can assume your users work on a single version while running the data flow.
1