1. What is your idea?
Board has multiple steps to extract data from cubes:
- Extract Cube
- Export Data View to file
- Export Dataset
- Extract all cubes
All these steps save data as text or csv-files. We could save disk-space and increase performance by using parquet-files.
Head-to-head comparison
CSV
| Parquet
|
---|
Row-based storage format.
| A hybrid of Row-based and column-based storage formats.
|
It consumes a lot of space as no default compression option is available. For example, a 1TB file will occupy the same space when stored on Amazon S3 or any other cloud.
| Compresses data while storing, thus consuming less space. A 1 TB file stored in Parquet format will take up only 130GB of space.
|
Query run time is slow because of the row-based search. For each column, every row of data has to be retrieved.
| Query time is about 34 times faster because of the column-based storage and presence of metadata.
|
More data has to be scanned per query.
| About 99% less data is scanned for the execution of the query, thus optimizing performance .
|
Most storage devices charge based on the storage space, so CSV format means the high storage cost.
| Less storage cost as data is stored in compressed, encoded format.
|
File schema has to be either inferred (leading to errors) or supplied (tedious).
| File schema is stored in the metadata.
|
The format is suitable for simple data types.
| Parquet is suitable even for complex types like nested schemas, arrays, dictionaries.
|
Please add option to extract Data as parquet file.
More information:
https://geekflare.com/parquet-csv-data-storage/
2. What specific problem are you trying to find a solution to, or what new scenario would this idea respond to?
Having a new format to save data for Board.
3. What workaround have you found and used so far (if any)?
No workaround available.