Python Forum
Write sql data or CSV Data into parquet file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Write sql data or CSV Data into parquet file
#1
Hi Team,

How to write SQL Table data to parquet file. and compress to gzip.
SQL Table data size is 60gb.

or from writing sql data to csv first and then writing to parquet.

Pandas is slow or Cant handle data of 60 gb I think.
below is my attempted code.

---------Attempt to parquet sql table-----------------
for chunk in pd.read_sql_table('employee',connection,chunksize=1000):
	mylist.append(chunk)
df = pd.concat(mylist)
df.to_parquet('{path}.gzip', compression='gzip', index=False)


def to_parquet(data, path):
    df = pd.DataFrame(data)
    df.to_parquet('{path}.gzip', compression='gzip', index=False)
Reply
#2
60GB is a big file! I can't imagine a 60GB csv file!

When you export the data from MySQL, SELECT 10 000 rows or something like that and export that as csv, then you have nice manageable file size.
Reply
#3
I don't understand everything from the code you show us. I see you are using variables that are not defined. I guess the code produces an error message. You should always include the full error message.

Although don't understand everything, I will show you the lines I think give problems.
mylist.append(chunk)
df = pd.concat(mylist)

  1. You are appending chunk to mylist, but you did not define mylist.
  2. By saving all the data in mylist, you will have all the data in memory. As you know, you don't have enough RAM to store 60GB. You should write each chunk immediately to the destination and not save it.
  3. You are also adding mylist to df. Again, the used memory accumulates. Again this step is not needed when you process the chunk immediately.

df.to_parquet('{path}.gzip', compression='gzip', index=False)

  1. You did not define {path}. If you do so, you also need to make an f-string of the parameter.
  2. I think you can immediately write the chunk to parquet, like this:
    chunk.to_parquet(f'{path}.gzip', compression='gzip', index=False)





def to_parquet(data, path):

  1. Here it seems you are redefining pandas to_parquet() function. This seems a dangerous action to me.

[\python]
  1. You used the backslash instead of the slash. That is why your code does not look good.

Please let us know if this helps you.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How can I write formatted (i.e. bold, italic, change font size, etc.) text to a file? JohnJSal 12 27,877 Feb-13-2025, 04:48 AM
Last Post: tomhansky
  How to write variable in a python file then import it in another python file? tatahuft 4 875 Jan-01-2025, 12:18 AM
Last Post: Skaperen
  Python: How to import data from txt, instead of running the data from the code? Melcu54 1 600 Dec-13-2024, 06:50 AM
Last Post: Gribouillis
  Parquet file generation woliveiras 1 625 Dec-07-2024, 02:52 AM
Last Post: deanhystad
  Write json data to csv Olive 6 1,305 Oct-22-2024, 06:59 AM
Last Post: Olive
  [SOLVED] [Linux] Write file and change owner? Winfried 6 1,489 Oct-17-2024, 01:15 AM
Last Post: Winfried
  Read TXT file in Pandas and save to Parquet zinho 2 1,206 Sep-15-2024, 06:14 PM
Last Post: zinho
  JSON File - extract only the data in a nested array for CSV file shwfgd 2 1,031 Aug-26-2024, 10:14 PM
Last Post: shwfgd
  To fetch and iterate data from CSV file using python vyom1109 3 969 Aug-05-2024, 10:05 AM
Last Post: Pedroski55
  Reading an ASCII text file and parsing data... oradba4u 2 1,386 Jun-08-2024, 12:41 AM
Last Post: oradba4u

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020