1. What is your idea?
Board support a lot of datasources. Probably the most used are csv files. However, with ever increasing amounts of data, working with csv files becomes a pain. Parquet files can be an alternative.
Head-to-head comparison
CSV | Parquet |
---|
Row-based storage format. | A hybrid of Row-based and column-based storage formats. |
It consumes a lot of space as no default compression option is available. For example, a 1TB file will occupy the same space when stored on Amazon S3 or any other cloud. | Compresses data while storing, thus consuming less space. A 1 TB file stored in Parquet format will take up only 130GB of space. |
Query run time is slow because of the row-based search. For each column, every row of data has to be retrieved. | Query time is about 34 times faster because of the column-based storage and presence of metadata. |
More data has to be scanned per query. | About 99% less data is scanned for the execution of the query, thus optimizing performance. |
Most storage devices charge based on the storage space, so CSV format means the high storage cost. | Less storage cost as data is stored in compressed, encoded format. |
File schema has to be either inferred (leading to errors) or supplied (tedious). | File schema is stored in the metadata. |
The format is suitable for simple data types. | Parquet is suitable even for complex types like nested schemas, arrays, dictionaries. |
Please add a new DataReader for Parquet-Files.
More information:
https://geekflare.com/parquet-csv-data-storage/
2. What specific problem are you trying to find a solution to, or what new scenario would this idea respond to?
Having a new DataSource for Board.
3. What workaround have you found and used so far (if any)?
No workaround available.