Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Which is Faster, List or SQL
#1
I have a list of over 150,000 urls.
each time the code runs it selects the next url in the list and then generates a few urls from it.

It then has to compare these generated urls to every oth url in the list to make sure they are not the same.

Simply put which would be faster, a list in python. or using an sql database, or is there an alternative which is much faster.


There may be a few million item in the list eventually and it will keep growing every time the code runs.
Quote
#2
If there isn't a special ordering, a set (instead of a list) should work very well.
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Quote
#3
This is an interesting question
A list and RDBMS are two different animals.

The data must already be in a format that you can load into a list,
which would be a lot less work that creating a database, and would
be faster, although I don't think that matters as you will only use one
url at a time, doing some stuff with it.

Fetching from a list (or table) will be a trivial part of the whole operation.

A database table has advantages if the data has to be modified, because
the data that you originally load remains stationary until modified, added to,
or deleted through SQL.
Quote
#4
(Mar-16-2017, 06:33 PM)micseydel Wrote: If there isn't a special ordering, a set (instead of a list) should work very well.

isnt there a hard limit on python arrays or sets ?
i will be using this on ARM CPUs and about 1GB-2GB RAM and the list can go from 150k to over 1 million in less than an hour and needs to run for at least 8 hours non stop. (over 100 million items easily)
Quote
#5
I ran the experiment on my machine without issues
Output:
>>> count = 100000000 >>> print "{:,}".format(count) 100,000,000 >>> myset = set(xrange(count)) >>> myset = set(xrange(count * 10)) >>>
If you're concerned, you should try running the experiment as well (though be mindful of the memory, swap, and thrashing issues).

Another alternative: https://github.com/Ezibenroc/PyRoaringBitMap

Lists will not be efficient though, and a database seems unnecessary unless you *really* run out of memory (in which case a list doesn't work anyway).
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Quote
#6
Numpy/SciPy has very efficient arrays with a low footprint. But I can't tell more cause I barely know it.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Quote
#7
An array doesn't support the kind of lookup desired here though.
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Quote
#8
With 100 millions urls you can forget about using list/set on 1-2GB arm. As string takes something 40ish + len or more bytes, and list/set needs space per item too, you are probably looking on 100+ bytes per url, roughly 10GB+? Even micseydel's set of 100,000,000 integers took 7.1GB of ram on my pc, so I wonder what monster he uses to fit set with 1,000,000,000 integers into memory/swap.

Maybe you can "partition" your list somehow and process it in smaller batches?
nilamo likes this post
Quote
#9
(Mar-16-2017, 09:13 PM)zivoni Wrote: Maybe you can "partition" your list somehow and process it in smaller batches?

map/reduce with external computers?  *giddy excitement*

If you're doing this more than once, then you're probably storing the data somewhere.  Like a file maybe?  If you're reading it every time, adding new things, and repeating, every single day, with whatever you generate needing to be available for the future... I think you should go with a db.  Even something small like sqlite.  Then the db can handle picking a few at random for you to use to generate new urls, you can quickly check if those generated urls already exist, add them to the existing tables, and move on.  

This sounds a lot like the sort of problem a database is designed to solve.
Quote
#10
(Mar-16-2017, 09:26 PM)nilamo Wrote:
(Mar-16-2017, 09:13 PM)zivoni Wrote: Maybe you can "partition" your list somehow and process it in smaller batches?

map/reduce with external computers?  *giddy excitement*

If you're doing this more than once, then you're probably storing the data somewhere.  Like a file maybe?  If you're reading it every time, adding new things, and repeating, every single day, with whatever you generate needing to be available for the future... I think you should go with a db.  Even something small like sqlite.  Then the db can handle picking a few at random for you to use to generate new urls, you can quickly check if those generated urls already exist, add them to the existing tables, and move on.  

This sounds a lot like the sort of problem a database is designed to solve.

I do store the results in a csv file, however i only read once and simply append any non matched urls (however this doesnt always work)

Its ok for the first 1000 URLs but after that it starts missing a few and duplicating a few, i run a second code that uses sets to clean it up after and i dropped ftom 154k to 150k when i last ran it.

I will look into creating a db as it will be easier to access multiple times rather than using a file.

to give you more of a scope this is for a demo of network usage on a cluster computer. going to use 10-20 Raspberry Pis' and generate as much traffic from differenct sources and protocols as possible.
Quote

Top Page

Forum Jump:


Users browsing this thread: 1 Guest(s)