Python Forum
Working with large volume of data (RAM is not enough)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Working with large volume of data (RAM is not enough)
#1
Hi everyone,
Can you please point me to materials / tutorials on how to work with data that does not fit into the memory ?
what are best practices on this? I'm planning one project where I would face huge volumes of strings in GBs where one hot representation will be needed meaning huge arrays millions x millions.
Before I even start I'd like to prepare for the task.

Thank you in advance
Regards
Evo
Reply
#2
Hello and Welcome to the forum!

Can you split the data into smaller pieces and work with them?

Take a look at Dask.

You also will have to learn about iterator and generators in Python.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
That is one of reasons for databases, SQLite in Python. You are only limited by the amount of disk For large amounts of data. It may take a while to insert and index, so you want to become familiar with write many.
Reply
#4
Thank you both for your valuable advices.
@woooee - so what you are suggesting is to use database, load text e.g. sentence per row and do the operations row by row. I'll explore this definitely. What do you mean 'write many'. Unfortunately uncle google doesn't offer much help as search 'python write many' is too popular term referring to whole bunch of other questions.

Thanks again
Evo
Reply
#5
Reading a file line by line (if text) will only use enough memory for the actual record (line):
Note: none of this code has been tested.
Instead of:
with open('Myfile.txt') as fp:
    buffer = fp.readlines()
for line in buffer:
   ... do stuff
which will read in the entire file,

use to read record by record:
with open('Myfile.txt') as fp:
    for line in fp:
       ... do stuff
only one record at a time.

This however doesn't help if it's a binary file. In this instance, you can read in chunks:
In which case open file as 'rb' and read chunk by chunk (keep in mind last chunk can be any size up to chunksize, including 0):
fp.read(chunksize)
Reply
#6
An SQLite tutorials http://zetcode.com/db/sqlitepythontutorial/ (search for executemany) Generally executemany and then a commit takes less time than inserting records one at a time.
Reply
#7
woooee.

It is true that databases will ease access after creation. The time of loading say 20 billion records is however usually is prohibitive, especially if the data is volatile. My background is telecommunications, and huge files are coming in and out steadily, never stopping. The only way to process data of this type is serially, and requires special handling. Typically these files are captured at set intervals, sort of like putting a pitcher under a waterfall. fill the pitcher, process the data, get another pitcher, in a never ending chain.

So a relational database sounds like a great idea, but only works where the volume is small enough to process it in a set amount of time.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  mute spotify by the mixer of volume garze23 0 261 Feb-27-2024, 05:42 PM
Last Post: garze23
  Looping Through Large Data Sets JoeDainton123 10 4,369 Oct-18-2020, 02:58 PM
Last Post: buran
  Extract data from large string pzig98 1 2,132 Jul-20-2020, 12:39 AM
Last Post: Larz60+
  Create a 3D volume with some properties. Rosendo 0 1,423 Jul-18-2020, 08:20 PM
Last Post: Rosendo
  Moving large amount of data between MySql and Sql Server using Python ste80adr 4 3,403 Apr-24-2020, 01:24 PM
Last Post: Jeff900
  alternative to nested loops for large data set JonnyEnglish 2 2,571 Feb-19-2020, 11:26 PM
Last Post: JonnyEnglish
  Working with CSV data and iterating through a file skoobi 1 1,570 Aug-13-2019, 03:28 PM
Last Post: Gribouillis
  Windows Volume Control using python Arun 1 4,751 May-17-2019, 02:50 PM
Last Post: Larz60+
  how to load large data into dataframe. sandy 0 2,648 Feb-01-2019, 06:19 PM
Last Post: sandy
  Avoid output buffering when redirecting large data (40KB) to another process Ramphic 3 3,402 Mar-10-2018, 04:49 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020