Parquet file generation - Printable Version

Parquet file generation - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Parquet file generation (/thread-43655.html)

Parquet file generation - woliveiras - Dec-07-2024

I hope this message finds you well. I'm encountering an intriguing issue with our data processing pipeline and would greatly appreciate your insights.
Our current process involves reading CSV files and converting them to Parquet format. When I load these CSV files into DataFrames, they appear to have nearly identical sizes. However, upon conversion to Parquet, I've noticed a significant discrepancy: one of the resulting Parquet files is approximately three times larger than the other.
For context, these files contain monthly snapshot data, and there isn't substantial variance between them. This size difference is puzzling, given the similarity of the source data.
Key points:
CSV files are of similar size when loaded into DataFrames (same number of columns, almost same number of rows, same datatypes)
After conversion to Parquet, one file is roughly 3x larger
Data represents monthly snapshots with minimal variance
I'm keen to understand the underlying cause of this size disparity and would welcome any suggestions or insights you might have.
Thank you in advance for your assistance.

RE: Parquet file generation - deanhystad - Dec-07-2024

Why are you asking? Do you think you are losing info during compression? You can extract the information and check. Or are you just wondering why some files compress down smaller than others? Compressibility is highly dependent on content, and 3X difference doesn’t surprise me at all.