Write sql data or CSV Data into parquet file

mg24

Hi Team,

How to write SQL Table data to parquet file. and compress to gzip.
SQL Table data size is 60gb.

or from writing sql data to csv first and then writing to parquet.

Pandas is slow or Cant handle data of 60 gb I think.
below is my attempted code.

---------Attempt to parquet sql table-----------------

for chunk in pd.read_sql_table('employee',connection,chunksize=1000):
	mylist.append(chunk)
df = pd.concat(mylist)
df.to_parquet('{path}.gzip', compression='gzip', index=False)


def to_parquet(data, path):
    df = pd.DataFrame(data)
    df.to_parquet('{path}.gzip', compression='gzip', index=False)

Pedroski55 · Sep-25-2022, 11:39 PM

60GB is a big file! I can't imagine a 60GB csv file!

When you export the data from MySQL, SELECT 10 000 rows or something like that and export that as csv, then you have nice manageable file size.

ibreeden · Sep-26-2022, 08:21 AM

I don't understand everything from the code you show us. I see you are using variables that are not defined. I guess the code produces an error message. You should always include the full error message.

Although don't understand everything, I will show you the lines I think give problems.

mylist.append(chunk)
df = pd.concat(mylist)

You are appending chunk to mylist, but you did not define mylist.
By saving all the data in mylist, you will have all the data in memory. As you know, you don't have enough RAM to store 60GB. You should write each chunk immediately to the destination and not save it.
You are also adding mylist to df. Again, the used memory accumulates. Again this step is not needed when you process the chunk immediately.

df.to_parquet('{path}.gzip', compression='gzip', index=False)

You did not define {path}. If you do so, you also need to make an f-string of the parameter.
I think you can immediately write the chunk to parquet, like this:
```
chunk.to_parquet(f'{path}.gzip', compression='gzip', index=False)
```

def to_parquet(data, path):

Here it seems you are redefining pandas to_parquet() function. This seems a dangerous action to me.

[\python]

You used the backslash instead of the slash. That is why your code does not look good.

Please let us know if this helps you.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How can I write formatted (i.e. bold, italic, change font size, etc.) text to a file?	JohnJSal	12	28,300	Feb-13-2025, 04:48 AM Last Post: tomhansky
	How to write variable in a python file then import it in another python file?	tatahuft	4	994	Jan-01-2025, 12:18 AM Last Post: Skaperen
	Python: How to import data from txt, instead of running the data from the code?	Melcu54	1	667	Dec-13-2024, 06:50 AM Last Post: Gribouillis
	Parquet file generation	woliveiras	1	706	Dec-07-2024, 02:52 AM Last Post: deanhystad
	Write json data to csv	Olive	6	1,415	Oct-22-2024, 06:59 AM Last Post: Olive
	[SOLVED] [Linux] Write file and change owner?	Winfried	6	1,620	Oct-17-2024, 01:15 AM Last Post: Winfried
	Read TXT file in Pandas and save to Parquet	zinho	2	1,321	Sep-15-2024, 06:14 PM Last Post: zinho
	JSON File - extract only the data in a nested array for CSV file	shwfgd	2	1,115	Aug-26-2024, 10:14 PM Last Post: shwfgd
	To fetch and iterate data from CSV file using python	vyom1109	3	1,075	Aug-05-2024, 10:05 AM Last Post: Pedroski55
	Reading an ASCII text file and parsing data...	oradba4u	2	1,493	Jun-08-2024, 12:41 AM Last Post: oradba4u

Write sql data or CSV Data into parquet file

User Panel Messages

Announcements