Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parquet file generation
#1
I hope this message finds you well. I'm encountering an intriguing issue with our data processing pipeline and would greatly appreciate your insights.
Our current process involves reading CSV files and converting them to Parquet format. When I load these CSV files into DataFrames, they appear to have nearly identical sizes. However, upon conversion to Parquet, I've noticed a significant discrepancy: one of the resulting Parquet files is approximately three times larger than the other.
For context, these files contain monthly snapshot data, and there isn't substantial variance between them. This size difference is puzzling, given the similarity of the source data.
Key points:
CSV files are of similar size when loaded into DataFrames (same number of columns, almost same number of rows, same datatypes)
After conversion to Parquet, one file is roughly 3x larger
Data represents monthly snapshots with minimal variance
I'm keen to understand the underlying cause of this size disparity and would welcome any suggestions or insights you might have.
Thank you in advance for your assistance.
Reply
#2
Why are you asking? Do you think you are losing info during compression? You can extract the information and check. Or are you just wondering why some files compress down smaller than others? Compressibility is highly dependent on content, and 3X difference doesn’t surprise me at all.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Read TXT file in Pandas and save to Parquet zinho 2 1,196 Sep-15-2024, 06:14 PM
Last Post: zinho
  Allure Report Generation rotemz 0 1,377 Jan-24-2023, 08:30 PM
Last Post: rotemz
  Write sql data or CSV Data into parquet file mg24 2 4,026 Sep-26-2022, 08:21 AM
Last Post: ibreeden
Question PDF generation / edit SpongeB0B 2 2,790 Jul-28-2021, 05:59 AM
Last Post: SpongeB0B
  Parquet format conversion problem Bilhardas 1 2,276 Nov-19-2019, 11:06 AM
Last Post: baquerik
  New .txt file Generation in Python Nirmal 1 2,986 Sep-10-2018, 01:29 PM
Last Post: ichabod801
  Pyarrow - parquet-cpp dennispoulos 3 3,878 Aug-20-2018, 07:44 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020