Python Forum
huge list of whole numbers - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: huge list of whole numbers (/thread-18817.html)



huge list of whole numbers - Skaperen - Jun-02-2019

i have a massively huge list of numbers and i am trying to think of efficient ways to process these with Python. there are about 360 billion numbers in this list. i cannot store them all in virtual memory (obviously). fortunately the numbers are already sorted. there are known to be many ranges where the numbers leave no gaps, or few gaps, and many more ranges where there are no numbers, or very few. i am wanting to scan these numbers sequentially to determine if i could encode the ranges somehow to compact them enough to keep them in memory for other processing that accesses them in random order to test if a given number is in the list or not. are there any existing object classes that could do this? i don't know how long these ranges could span but i do know that the numbers go up to around 2**48 in value. they are currently stored compressed on an external 2 TB USB hard drive taking up about 1 TB. uncompressed it is one number per each 8 bytes in little-endian binary and is more than 2 TB.


RE: huge list of whole numbers - ichabod801 - Jun-02-2019

https://stackoverflow.com/questions/283299/best-compression-algorithm-for-a-sequence-of-integers


RE: huge list of whole numbers - perfringo - Jun-02-2019

(Jun-02-2019, 06:56 AM)Skaperen Wrote: to keep them in memory for other processing that accesses them in random order to test if a given number is in the list or not

I have no experience with such huge lists but my simplistic approach would be to convert list of integers into list of ranges (as list is sorted) and then using built-in any() to find whether number in list:

>>> lst = [range(1, 1000000000), range(2000000000, 18000000000)]
>>> any(True for r in lst if 12345 in r)
True
>>> any(True for r in lst if 1100000000 in r)
False



RE: huge list of whole numbers - Skaperen - Jun-02-2019

yes, i was thinking to convert a cluster into a range and subranges or a bitmap for exceptions. but whether range of on followed by range of off followed by range of on is better vs. big range of on and a subrange of exceptions for off. the latter case may be more efficient if it does not go deep. then to figure out if subranges are so small that a bitmap would be better. i may well do this in Python as a prototype then redo it in C.