Python Forum

Hi everyone,
Can you please point me to materials / tutorials on how to work with data that does not fit into the memory ?
what are best practices on this? I'm planning one project where I would face huge volumes of strings in GBs where one hot representation will be needed meaning huge arrays millions x millions.
Before I even start I'd like to prepare for the task.

Thank you in advance
Regards
Evo

Hello and Welcome to the forum!

Can you split the data into smaller pieces and work with them?

Take a look at Dask.

You also will have to learn about iterator and generators in Python.

That is one of reasons for databases, SQLite in Python. You are only limited by the amount of disk For large amounts of data. It may take a while to insert and index, so you want to become familiar with write many.

Thank you both for your valuable advices.
@woooee - so what you are suggesting is to use database, load text e.g. sentence per row and do the operations row by row. I'll explore this definitely. What do you mean 'write many'. Unfortunately uncle google doesn't offer much help as search 'python write many' is too popular term referring to whole bunch of other questions.

Thanks again
Evo

Reading a file line by line (if text) will only use enough memory for the actual record (line):
Note: none of this code has been tested.
Instead of:

with open('Myfile.txt') as fp:
    buffer = fp.readlines()
for line in buffer:
   ... do stuff

which will read in the entire file,

use to read record by record:

with open('Myfile.txt') as fp:
    for line in fp:
       ... do stuff

only one record at a time.

This however doesn't help if it's a binary file. In this instance, you can read in chunks:
In which case open file as 'rb' and read chunk by chunk (keep in mind last chunk can be any size up to chunksize, including 0):

fp.read(chunksize)

An SQLite tutorials http://zetcode.com/db/sqlitepythontutorial/ (search for executemany) Generally executemany and then a commit takes less time than inserting records one at a time.

woooee.

It is true that databases will ease access after creation. The time of loading say 20 billion records is however usually is prohibitive, especially if the data is volatile. My background is telecommunications, and huge files are coming in and out steadily, never stopping. The only way to process data of this type is serially, and requires special handling. Typically these files are captured at set intervals, sort of like putting a pitcher under a waterfall. fill the pitcher, process the data, get another pitcher, in a never ending chain.

So a relational database sounds like a great idea, but only works where the volume is small enough to process it in a set amount of time.

evonevo

wavic

woooee

evonevo

Larz60+

woooee

Larz60+