Python Forum

Full Version: Write sql data or CSV Data into parquet file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi Team,

How to write SQL Table data to parquet file. and compress to gzip.
SQL Table data size is 60gb.

or from writing sql data to csv first and then writing to parquet.

Pandas is slow or Cant handle data of 60 gb I think.
below is my attempted code.

---------Attempt to parquet sql table-----------------
for chunk in pd.read_sql_table('employee',connection,chunksize=1000):
	mylist.append(chunk)
df = pd.concat(mylist)
df.to_parquet('{path}.gzip', compression='gzip', index=False)


def to_parquet(data, path):
    df = pd.DataFrame(data)
    df.to_parquet('{path}.gzip', compression='gzip', index=False)
60GB is a big file! I can't imagine a 60GB csv file!

When you export the data from MySQL, SELECT 10 000 rows or something like that and export that as csv, then you have nice manageable file size.
I don't understand everything from the code you show us. I see you are using variables that are not defined. I guess the code produces an error message. You should always include the full error message.

Although don't understand everything, I will show you the lines I think give problems.
mylist.append(chunk)
df = pd.concat(mylist)

  1. You are appending chunk to mylist, but you did not define mylist.
  2. By saving all the data in mylist, you will have all the data in memory. As you know, you don't have enough RAM to store 60GB. You should write each chunk immediately to the destination and not save it.
  3. You are also adding mylist to df. Again, the used memory accumulates. Again this step is not needed when you process the chunk immediately.

df.to_parquet('{path}.gzip', compression='gzip', index=False)

  1. You did not define {path}. If you do so, you also need to make an f-string of the parameter.
  2. I think you can immediately write the chunk to parquet, like this:
    chunk.to_parquet(f'{path}.gzip', compression='gzip', index=False)





def to_parquet(data, path):

  1. Here it seems you are redefining pandas to_parquet() function. This seems a dangerous action to me.

[\python]
  1. You used the backslash instead of the slash. That is why your code does not look good.

Please let us know if this helps you.