Python Forum

Full Version: Two dataframes merged
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Situation:
Learning Pandas, using Nederlands OV CHipkaart data (mine) for a month to test pandas and practice
Currently can read the file into a dataframe.
Quote:Date, Check-in, Departure, Check-out, Destination, Amount, Transaction, Class, Product, Comments, Name, Card number
I am splitting the data into two dataframes, one for checking in, one for checking out.
Quote:Date, Time, Departure, Type
Date, Time, Destination, Cost, Type
My code...
import os
import re
import datetime as dt
import numpy
import pandas as pd

data = None
try:
    df = pd.read_csv('privatefiles/t.csv',';')
    #df = pd.read_csv('...','\t')
    #xlsx = Excel_File('t.xlsx')
    #df = pd.read_excel(xlsx,0)
    #print(df.head(3))
    #print(len(df))
except Exception as err:
    print('Problem with reading the file.<br/> '+err)

checkin = df[df['Transaction']=='Check-in'].rename(columns={'Check-in':'Time'})
checkin = checkin[['Date','Time','Departure']].reset_index()

checkout = df[df['Transaction']=='Check-out']
checkout = checkout[['Date','Check-out','Destination','Amount']].rename(columns={'Check-out':'Time','Amount':'Cost'}).reset_index()
#checkout['Type','Departure','Duration']=['Check-out','',0.]

cleaned = pd.DataFrame(columns=['Type','Date','Time','Departure','Destination','Cost','Duration','IsWork'])

cleaned += checkin
Complexity:
I'm still learning so working through this step by step.
Just as I get to understand I lose or forget how.

Problem:
I cannot seem to get the two dataframes to merge.
Can you suggest a way, or several ways to do this better?
or point to a reference to do this better?
At first glance it seems as though you can't get the dataframes to merge because you don't have a line in your script that calls for them to be merged. Read this article and see if it helps you out. https://chrisalbon.com/python/data_wrang...dataframe/
updated python coding... still doesnt work.... :)
import os
import re
import datetime as dt
import numpy
import pandas as pd

data = None
try:
    df = pd.read_csv('privatefiles/t.csv',';')
    #df = pd.read_csv('...','\t')
    #xlsx = Excel_File('t.xlsx')
    #df = pd.read_excel(xlsx,0)
    print(df.head(3))
    print(len(df))
except Exception as err:
    print('Problem with reading the file.<br/> '+err)


checkin = df[df['Transaction']=='Check-in']
checkin = checkin.rename(columns={'Check-in':'time','Date':'date','Departure':'departure'}).reset_index()
checkin = checkin[['date','time','departure']]
checkin['type']='checkin'
#checkin.head(5)

checkout = df[df['Transaction']=='Check-out']
checkout = checkout.rename(columns={'Check-out':'time','Amount':'cost','Destination':'arrival','Date':'date'}).reset_index()
checkout = checkout[['date','time','arrival','cost']]
checkout['type']='checkout'
#checkout.head(5)

cleaned = pd.DataFrame(columns=['type','date','time','departure','arrival','cost','duration','iswork'])
#cleaned

cleaned = pd.concat( [cleaned,checkin] , ignore_index=False, sort=False)
#cleaned.head(5)

cleaned.sort_values(by=['date','time'],ascending=[True,False],inplace=True)
checkout.sort_values(by=['date','time'],ascending=[True,True],inplace=True)
for i,r in checkout.iterrows():
    print(i)
    print(r)
    print('r - date - '+r['date'])
    print(cleaned.loc(['date']==r['date'] and ['time']<r['time']))
    break
#cleaned.head(4)
#cleaned.loc(['date']==r['date'])
Is your goal to combine the data in the "checkin" and "checkout" frames? If that is the case, first off use the "merge" function - not "concat". Also, when using "merge" you need to specify which column you want to merge on (identify the column with identical values in the "checkin" and "checkout" frame).

If you attach the file you're working with I'll write out the script for you but it would be more helpful long term if you understand the concepts yourself.
Ahh almost got it... but not quite...
import os
import re
import datetime as dt
import numpy
import pandas as pd

data = None
try:
    df = pd.read_csv('privatefiles/t.csv',';')
    #df = pd.read_csv('...','\t')
    #xlsx = Excel_File('t.xlsx')
    #df = pd.read_excel(xlsx,0)
    print(df.head(3))
    print(len(df))
except Exception as err:
    print('Problem with reading the file.<br/> '+err)


checkin = df[df['Transaction']=='Check-in']
checkin = checkin.rename(columns={'Check-in':'dtime','Date':'date','Departure':'departure'}).reset_index()
checkin = checkin[['date','dtime','departure']]
#checkin.head(5)

checkout = df[df['Transaction']=='Check-out']
checkout = checkout.rename(columns={'Check-out':'atime','Amount':'cost','Destination':'arrival','Date':'date'}).reset_index()
checkout = checkout[['date','atime','arrival','cost']]
#checkout.head(5)

cleaned = pd.DataFrame(columns=['date','dtime','departure','atime','arrival','cost','duration','iswork'])
#cleaned

cleaned = pd.concat( [cleaned,checkin] , ignore_index=False, sort=False)
#cleaned.head(5)

cleaned.sort_values(by=['date','dtime'],ascending=[True,True],inplace=True)
checkout.sort_values(by=['date','atime'],ascending=[True,True],inplace=True)
#print(cleaned.head(5))
#print(checkout.head(3))

for i,r in checkout.iterrows():
    print(cleaned['date'][i])
    print(r['date'])
    print(cleaned['dtime'][i])
    print(r['atime'])
    f = cleaned[(cleaned['date']==r['date']) & (cleaned['dtime']<cleaned['atime'])]
    print(f)
    break
#cleaned.head(4)
Quote:04-05-2019
04-05-2019
09:53
10:02
Empty DataFrame
Columns: [date, dtime, departure, atime, arrival, cost, duration, iswork]
Index: []
I am obviously doing something wrong in the filter on the pandas. but the result is an empty dataframe... :/
Oh, and if it does find one (with no data already in) would it be updatable ?
Or do I need to update and send to another output dataframe ?

(Jun-06-2019, 07:02 PM)chisox721 Wrote: [ -> ]Is your goal to combine the data in the "checkin" and "checkout" frames? If that is the case, first off use the "merge" function - not "concat". Also, when using "merge" you need to specify which column you want to merge on (identify the column with identical values in the "checkin" and "checkout" frame).

If you attach the file you're working with I'll write out the script for you but it would be more helpful long term if you understand the concepts yourself.
Hi

I cannot use merged as you require a unique field. I do not have one.
I am trying to get the check out for each check in to produce a trip/journey, duration, cost etc...
But I need the check out that is immediately after the checkin (assuming I have checked in and out properly). There is a small chance I checked in and forgot to check out... Unlikely but possible.

Any ideas on why its not bringing back at least one record when filtered on date and time ?
Weirdly if I use one or the other to filter the dataframe, it runs. But both together and it errors.
for i,r in checkout.iterrows():
    ind = -1
    f = cleaned[cleaned['atime'].isnull()]
    print('1',len(f))
    #print(f.head(4))
    if f is not None:
        f = f[f['date']==r['date']]
    print('2',len(f))
    #print(f.head(4))
    if f is not None:
        f = f[f['dtime']<r['atime']]
    print('3',len(f))
    #print(f.head(4))
    if f is not None:
        ind = iloc(f)
    print('4',f.head(4))
    if ind>=0:
        print(ind)
        cleaned[ind]['atime','arrival','cost']=r['atime','arrival','cost']
    break
Still testing - this almost worked but errors as it cannot find the index of the cleaned to update the values...
Ok. Managed to get the index back and find the original row via the index. However it is still not doing it (still erroring) so I must still be doing it wrong.... any ideas (I have a headache now)
Final loop is now :
for i,r in checkout.iterrows():
    ind = -1
    f = cleaned[cleaned['atime'].isnull()]
    if f is not None:
        f = f[f['date']==r['date']]
    if f is not None:
        f = f[f['dtime']<r['atime']]
    if f is not None:
        cleaned.iloc[f.index[0]]['atime','arrival','cost']=r['atime','arrival','cost']
for i,r in checkout.iterrows():
    ind = -1
    f = cleaned[cleaned['atime'].isnull()]
    if f is not None:
        f = f[f['date']==r['date']]
    if f is not None:
        f = f[f['dtime']<r['atime']]
    if (f is not None) & (len(f)>0):
        ind = f.index[0]
    if ind is not None:
        cleaned.iloc[ind][['atime','arrival','cost']] = r[['atime','arrival','cost']]
        #lower three lines combined into one above
        #cleaned.iloc[ind]['atime']   = r['atime']
        #cleaned.iloc[ind]['arrival'] = r['arrival']
        #cleaned.iloc[ind]['cost']    = r['cost']
Updated the dataframe - although with iterations...
(Jun-06-2019, 08:52 PM)Ecniv Wrote: [ -> ]Hi

I cannot use merged as you require a unique field. I do not have one.
I am trying to get the check out for each check in to produce a trip/journey, duration, cost etc...
But I need the check out that is immediately after the checkin (assuming I have checked in and out properly). There is a small chance I checked in and forgot to check out... Unlikely but possible.

Any ideas on why its not bringing back at least one record when filtered on date and time ?

I'm still not really understanding what you're looking to accomplish and your codes are very confusing. One question I have is how do you not have a unique column to merge on? I would think that if you are checking in/checking out of a hotel that would mean you have a corresponding transaction number or something similar that could be used to properly identify the trip. Even a separate column that includes a reference to the specific trip like "SpainVacation" would work. That is unless I'm completely wrong in my understanding of what trying to accomplish.

Like I said before it would be much easier for me to help you if I could see the contents of the file you're working with - or at least a file with a similar structure.
Pages: 1 2