Hi all.
I have a loop that populates an array via a DB query of keys (short text) .... raw data size is [1580990412 bytes] => 1.5GB
mydocsarray=[]
for eachrow in cbs.my_query(q):
mydocsarray.append(eachrow)
print (len(mydocsarray))
print (sys.getsizeof(mydocsarray))
I find that the memory for the code when executed blows to over 14GB on my windows PC in taskmanager.
[
attachment=759]
The print of the array length and size shows the following ....
49381034
424076264
Is this a coding issue or a pycharm issue ?
Quote:Is this a coding issue or a pycharm issue ?
If you use sys.getsize on a list, you'll see only how much the list itself consumes in memory.
Each element in the list is just a reference to the object, which lives somewhere in memory.
Each element has also a size, but sys.getsize is not recursive. You get only the memory consumption
about the object itself, but not of the references.
(Jan-22-2020, 07:48 PM)DeaD_EyE Wrote: [ -> ]Quote:Is this a coding issue or a pycharm issue ?
If you use sys.getsize on a list, you'll see only how much the list itself consumes in memory.
Each element in the list is just a reference to the object, which lives somewhere in memory.
Each element has also a size, but sys.getsize is not recursive. You get only the memory consumption
about the object itself, but not of the references.
Thanks.
But if the base dataset size is ~1.5GB ..... how to I use the array to be somewhat close to that...... not 10x that :)
How to avoid this overhead.
Anyone with any ideas/suggestions ?
thanks
For what is worth: using tuples (list of tuples)instead of lists should decrease memory consumption by ~15%
(Jan-22-2020, 04:31 PM)fakka Wrote: [ -> ]Is this a coding issue or a pycharm issue ?
is there difference when run it from command line, not from pycharm?
what libraries do you use? what db?
There is built-in
tracemalloc which could be used to trace memory allocation.
(Jan-23-2020, 05:14 PM)perfringo Wrote: [ -> ]There is built-in tracemalloc which could be used to trace memory allocation.
Thanks I will google it.
Is there a suggestion on doing it a different way ? Meaning these are just raw text fields ... assume people hit this everyday when pulling larger datasets (~1.5GB) from a DB ?
Did confirm its not pycharm. Occurs in regular linux python also.
I dumped some data into a text file - 40 million lines
$ cat /tmp/datadump.txt | wc -l
49568121
]$ du -sh /tmp/datadump.txt
2.2G /tmp/datadump.txt
{"metaid": "2018-09-19T18:22:12.577::2c6dba80-31b0-48ab-b40e-560822c46321"}
{"metaid": "addressbook:us:q:0000000002:steviejobs4"}
{"metaid": "addressbook:us:q:0000000002:steviejobs5"}
{"metaid": "addressbook:us:q:0000000007:itcinf1"}
{"metaid": "addressbook:us:q:0000000022:jj"}
{"metaid": "addressbook:us:q:0000000123:test4"}
{"metaid": "addressbook:us:q:0000000183:snake"}
{"metaid": "addressbook:us:q:0000000200:godofthunder3"}
{"metaid": "addressbook:us:q:0000000200:load test2"}
{"metaid": "addressbook:us:q:0000000430:delivery"}
I then ran this ....
if __name__ == '__main__':
myarray=[]
f= open("/tmp/datadump.txt","r+")
for eachrow in f:
myarray.append(eachrow)
And monitored memory use on OS in another session
MemFree: 6231624 kB
MemFree: 6231164 kB
MemFree: 6197600 kB
MemFree: 5962440 kB
MemFree: 5623356 kB
MemFree: 5265816 kB
MemFree: 4894304 kB
MemFree: 4530768 kB
MemFree: 4200192 kB
MemFree: 3876088 kB
MemFree: 3544328 kB
MemFree: 3169744 kB
MemFree: 2871492 kB
MemFree: 2595768 kB
MemFree: 2276404 kB
MemFree: 1986584 kB
MemFree: 1705416 kB
MemFree: 1471016 kB
So went from 6.2GB down to 1.5GB .... or 4.7GB for a 2.2gb dataset.
Any ideas here ?
Not sure how to take this any further ?