Python Forum
Python Based ETL Using Pandas
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Based ETL Using Pandas
#1
Hello all,

I am currently using a Python based ETL which uses Pandas.

I won't post all my code here because most of it isn't relevant, but I will briefly explain my process.

Step 1: Getting the source data
query = pd.read_sql_query(
'''
    SELECT * FROM SourceTable
''', se)
source = pd.DataFrame(query)
Step 2: Get the IDs from the target data
query = pd.read_sql_query(
'''
    SELECT Id FROM TargetTable
''', te)
target = pd.DataFrame(query)
Step 3: Check for new rows
new = source.merge(target.drop_duplicates(), on=['Id'], how='left', indicator=True)
new = new[new['_merge']=='left_only']
Step 4: Insert new rows into TargetTable
new.to_sql(
    'TargetTable',
    te,
    schema = 'dbo',
    if_exists = 'append',
    chunksize = 1000,
    index = False
)
The process goes on to include updates and deletes.

It's worth pointing out here that there is no "rowversion" or "datetime" columns on the source tables, so it's not possible to identify new/changed rows at source. As far as I know it's necessary to bring all the data into memory to calculate what's new, changed, or deleted. This all works fine on small tables. However, it's all starting to become a problem because when the source tables has over 2 million rows, the Python script completely fails, as shown below.

Quote:ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel

ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

After some investigation work, this error is not caused by the connection being refused because it works fine on 1.5 million rows. 1 million rows of data is roughly 160 MB in RAM, and even 10 times this wouldn't exceed the server's memory capacity. As some of the tables have tens of millions of rows, I need a way to move this data quickly and successfully.

Any suggestions?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas Dataframe Filtering based on rows mvdlm 0 1,430 Apr-02-2022, 06:39 PM
Last Post: mvdlm
  update values in one dataframe based on another dataframe - Pandas iliasb 2 9,247 Aug-14-2021, 12:38 PM
Last Post: jefsummers
  Pandas Data frame column condition check based on length of the value aditi06 1 2,689 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  Pandas - Dynamic column aggregation based on another column theroadbacktonature 0 3,055 Apr-17-2020, 04:54 PM
Last Post: theroadbacktonature
  Pandas fillna based on conditions amyd 1 15,886 May-03-2019, 11:27 AM
Last Post: scidam
  Splitting values in column in a pandas dataframe based on a condition hey_arnold 1 4,168 Jul-24-2018, 02:18 PM
Last Post: hey_arnold

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020