Python Forum - Memory Use in array

Pages: 1 2

Hi all.
I have a loop that populates an array via a DB query of keys (short text) .... raw data size is [1580990412 bytes] => 1.5GB

    mydocsarray=[]
    for eachrow in cbs.my_query(q):
        mydocsarray.append(eachrow)

    print (len(mydocsarray))
    print (sys.getsizeof(mydocsarray))

I find that the memory for the code when executed blows to over 14GB on my windows PC in taskmanager.
[attachment=759]

The print of the array length and size shows the following ....
49381034
424076264

Is this a coding issue or a pycharm issue ?

Quote:Is this a coding issue or a pycharm issue ?

If you use sys.getsize on a list, you'll see only how much the list itself consumes in memory.
Each element in the list is just a reference to the object, which lives somewhere in memory.
Each element has also a size, but sys.getsize is not recursive. You get only the memory consumption
about the object itself, but not of the references.

(Jan-22-2020, 07:48 PM)DeaD_EyE Wrote: [ -> ]
Quote:Is this a coding issue or a pycharm issue ?

If you use sys.getsize on a list, you'll see only how much the list itself consumes in memory.
Each element in the list is just a reference to the object, which lives somewhere in memory.
Each element has also a size, but sys.getsize is not recursive. You get only the memory consumption
about the object itself, but not of the references.

Thanks.
But if the base dataset size is ~1.5GB ..... how to I use the array to be somewhat close to that...... not 10x that :)
How to avoid this overhead.

Anyone with any ideas/suggestions ?

thanks

For what is worth: using tuples (list of tuples)instead of lists should decrease memory consumption by ~15%

(Jan-22-2020, 04:31 PM)fakka Wrote: [ -> ]Is this a coding issue or a pycharm issue ?

is there difference when run it from command line, not from pycharm?
what libraries do you use? what db?

There is built-in tracemalloc which could be used to trace memory allocation.

(Jan-23-2020, 05:14 PM)perfringo Wrote: [ -> ]There is built-in tracemalloc which could be used to trace memory allocation.

Thanks I will google it.

Is there a suggestion on doing it a different way ? Meaning these are just raw text fields ... assume people hit this everyday when pulling larger datasets (~1.5GB) from a DB ?

Did confirm its not pycharm. Occurs in regular linux python also.

I dumped some data into a text file - 40 million lines
$ cat /tmp/datadump.txt | wc -l
49568121

]$ du -sh /tmp/datadump.txt
2.2G /tmp/datadump.txt

{"metaid": "2018-09-19T18:22:12.577::2c6dba80-31b0-48ab-b40e-560822c46321"}
{"metaid": "addressbook:us:q:0000000002:steviejobs4"}
{"metaid": "addressbook:us:q:0000000002:steviejobs5"}
{"metaid": "addressbook:us:q:0000000007:itcinf1"}
{"metaid": "addressbook:us:q:0000000022:jj"}
{"metaid": "addressbook:us:q:0000000123:test4"}
{"metaid": "addressbook:us:q:0000000183:snake"}
{"metaid": "addressbook:us:q:0000000200:godofthunder3"}
{"metaid": "addressbook:us:q:0000000200:load test2"}
{"metaid": "addressbook:us:q:0000000430:delivery"}

I then ran this ....

if __name__ == '__main__':

        myarray=[]

        f= open("/tmp/datadump.txt","r+")
        for eachrow in f:
                myarray.append(eachrow)

And monitored memory use on OS in another session

MemFree: 6231624 kB
MemFree: 6231164 kB
MemFree: 6197600 kB
MemFree: 5962440 kB
MemFree: 5623356 kB
MemFree: 5265816 kB
MemFree: 4894304 kB
MemFree: 4530768 kB
MemFree: 4200192 kB
MemFree: 3876088 kB
MemFree: 3544328 kB
MemFree: 3169744 kB
MemFree: 2871492 kB
MemFree: 2595768 kB
MemFree: 2276404 kB
MemFree: 1986584 kB
MemFree: 1705416 kB
MemFree: 1471016 kB

So went from 6.2GB down to 1.5GB .... or 4.7GB for a 2.2gb dataset.

Any ideas here ?

Not sure how to take this any further ?

Pages: 1 2