Python Forum

Full Version: Fastest Method for Querying SQL Server with Python Pandas
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am querying a sql table with about 3.5 MM rows and 400 columns - I am estimating it is less than 5 GB in size.

If I use pd.read_sql to select * from this table, it takes many, many hours to execute. (Over 10 hours sometimes....)

I would like to read in the entire table and not cut it into pieces....I know folks will suggest otherwise, I get that. But I am still looking for a better solution that can handle reading in the entire table. We are not talking about terabytes here...

Also, even if I use chunks, the sum of the time the individual chunks take to process is similar amount of time to execute the entire table query.

Is there any alternative approach that would work faster?

Thanks,

Brian
Have you tried to read the table through the DB-API instead of pandas for comparison?
(Jul-31-2020, 08:27 AM)Gribouillis Wrote: [ -> ]Have you tried to read the table through the DB-API instead of pandas for comparison?

I haven't. I will see what I can find based on that suggestion. Are you expecting it to be faster?

Thanks!
Depends.
What SQL server are you using? Is it on the same machine or across the network?

By offloading some of the work and storage to the server, you can see a performance boost.
BuJayBelvin Wrote:Are you expecting it to be faster?
Yes. In fact I'm a complete begginner in pandas but I read that pandas uses sqlalchemy and I have some bad memories of sqlalchemy being much slower than the DB-API in certain situations. It could really change something.
(Jul-31-2020, 06:00 PM)Gribouillis Wrote: [ -> ]
BuJayBelvin Wrote:Are you expecting it to be faster?
Yes. In fact I'm a complete begginner in pandas but I read that pandas uses sqlalchemy and I have some bad memories of sqlalchemy being much slower than the DB-API in certain situations. It could really change something.

Thanks! That's good to know. I am limited to SQL Server 2005 (which is insane) and I am querying the remove server from home. I assume the connection causes a portion of the delay but I think there may be more efficient solutions other than read_sql on pandas....hence, the inquiry. I need to look into the DB-API recommendation....

Thanks!
You could perhaps use something such as pyodbc, there is an example here, even with a pandas dataframe. Again, I'm only guessing, but pyodbc implements the db-api.
Different database than what you are using, but HEREis a performance comparison. Pandas is convenient but slow for such things.