Python Forum
Write sql data or CSV Data into parquet file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Write sql data or CSV Data into parquet file
#1
Hi Team,

How to write SQL Table data to parquet file. and compress to gzip.
SQL Table data size is 60gb.

or from writing sql data to csv first and then writing to parquet.

Pandas is slow or Cant handle data of 60 gb I think.
below is my attempted code.

---------Attempt to parquet sql table-----------------
for chunk in pd.read_sql_table('employee',connection,chunksize=1000):
	mylist.append(chunk)
df = pd.concat(mylist)
df.to_parquet('{path}.gzip', compression='gzip', index=False)


def to_parquet(data, path):
    df = pd.DataFrame(data)
    df.to_parquet('{path}.gzip', compression='gzip', index=False)
Reply
#2
60GB is a big file! I can't imagine a 60GB csv file!

When you export the data from MySQL, SELECT 10 000 rows or something like that and export that as csv, then you have nice manageable file size.
Reply
#3
I don't understand everything from the code you show us. I see you are using variables that are not defined. I guess the code produces an error message. You should always include the full error message.

Although don't understand everything, I will show you the lines I think give problems.
mylist.append(chunk)
df = pd.concat(mylist)

  1. You are appending chunk to mylist, but you did not define mylist.
  2. By saving all the data in mylist, you will have all the data in memory. As you know, you don't have enough RAM to store 60GB. You should write each chunk immediately to the destination and not save it.
  3. You are also adding mylist to df. Again, the used memory accumulates. Again this step is not needed when you process the chunk immediately.

df.to_parquet('{path}.gzip', compression='gzip', index=False)

  1. You did not define {path}. If you do so, you also need to make an f-string of the parameter.
  2. I think you can immediately write the chunk to parquet, like this:
    chunk.to_parquet(f'{path}.gzip', compression='gzip', index=False)





def to_parquet(data, path):

  1. Here it seems you are redefining pandas to_parquet() function. This seems a dangerous action to me.

[\python]
  1. You used the backslash instead of the slash. That is why your code does not look good.

Please let us know if this helps you.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  encrypt data in json file help jacksfrustration 1 190 Mar-28-2024, 05:16 PM
Last Post: deanhystad
  Help with to check an Input list data with a data read from an external source sacharyya 3 402 Mar-09-2024, 12:33 PM
Last Post: Pedroski55
  Last record in file doesn't write to newline gonksoup 3 404 Jan-22-2024, 12:56 PM
Last Post: deanhystad
  write to csv file problem jacksfrustration 11 1,501 Nov-09-2023, 01:56 PM
Last Post: deanhystad
  python Read each xlsx file and write it into csv with pipe delimiter mg24 4 1,428 Nov-09-2023, 10:56 AM
Last Post: mg24
  Input network device connection info from data file edroche3rd 6 1,003 Oct-12-2023, 02:18 AM
Last Post: edroche3rd
  Convert File to Data URL michaelnicol 3 1,150 Jul-08-2023, 11:35 AM
Last Post: DeaD_EyE
  How do I read and write a binary file in Python? blackears 6 6,499 Jun-06-2023, 06:37 PM
Last Post: rajeshgk
Video doing data treatment on a file import-parsing a variable EmBeck87 15 2,813 Apr-17-2023, 06:54 PM
Last Post: EmBeck87
  Reading data from excel file –> process it >>then write to another excel output file Jennifer_Jone 0 1,089 Mar-14-2023, 07:59 PM
Last Post: Jennifer_Jone

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020