Posts: 107
Threads: 42
Joined: Oct 2016
May-04-2017, 02:01 AM
(This post was last modified: May-04-2017, 06:21 AM by klllmmm.)
I have a dataset which i need to select specific rows.
1. Select-
1. If DrAccount or CrAccount contain 9000
eg-
-If the line contain DrAccount 1500 & CrAccount as 9000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry).
-If so select both rows.
-If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering.
-If there are no reversal entry ignore main row.
- Then go to next row, but ignore if the row already selected as reversal entry.
2. If DrAccount or CrAccount not contain 9000
eg-
-If the line contain DrAccount 5000 & CrAccount as 2000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry from 9000).
-If so select both rows.
-If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering.
- If there are no reversal entry ignore main row.
- Then go to next row, but ignore if the row already selected as reversal entry.
Sort - Keep the selected transaction paires as it was selected
Not sure how to create a function to do this. Hope someone can help on this.
My dataset - Attached a CSV file
DateTime Amount Ref DrCode CrCode
01-01-17 16:30 1000 Deduct 1500 9000
01-01-17 16:31 2000 Deduct 1400 9000
02-01-17 08:45 1000 Add 9000 1500
02-01-17 08:45 1000 Add 9000 1400
02-01-17 10:40 50000 Add 9000 3000
02-01-17 16:40 4000 Tranfer 1600 2000
02-01-17 16:41 5000 Tranfer 1700 2000
02-01-17 16:42 9000 Deduct 2000 9000
03-01-17 08:45 4000 Add 9000 1600
03-01-17 08:45 5000 Add 9000 1700
03-01-17 10:48 20000 Deduct 4000 9000
Result I'm expecting is a table similar to this
DateTime Amount Ref DrCode CrCode
01-01-17 16:30 1000 Deduct 1500 9000
02-01-17 08:45 1000 Add 9000 1500
01-01-17 16:31 2000 Deduct 1400 9000
02-01-17 08:45 1000 Add 9000 1400
02-01-17 16:40 4000 Tranfer 1600 2000
02-01-17 16:41 5000 Tranfer 1700 2000
03-01-17 08:45 4000 Add 9000 1600
03-01-17 08:45 5000 Add 9000 1700
Attached Files
Dataset1.csv (Size: 442 bytes / Downloads: 69)
Posts: 12,030
Threads: 485
Joined: Sep 2016
show the code that you have so far, and where you think the problem lies.
Posts: 107
Threads: 42
Joined: Oct 2016
May-07-2017, 07:58 AM
(This post was last modified: May-07-2017, 08:03 AM by klllmmm.)
# dataframe
import pandas as pd
table = pd.DataFrame(data = {'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'],
'Amount':[1000,2000,1000,1000,50000,4000,5000,9000,4000,5000,20000],
'Ref':['Deduct','Deduct','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'],
'DrCode':[1500,1400,9000,9000,9000,1600,1700,2000,9000,9000,4000],
'CrCode':[9000,9000,1500,1400,3000,2000,2000,9000,1600,1700,9000],})
# convert the DateTime field from an string/object to datetime
table['DateTime']= pd.to_datetime(table['DateTime']) i'm no where near the test of the crterias such as datetime & Amount, through a loop.
below code just gives me similar values in two coulmns.
table[table.CrCode.isin(table.DrCode) & table.DrCode.isin(table.CrCode)]
Out[210]:
Amount CrCode DateTime DrCode Ref
0 1000 9000 2017-01-01 16:30:00 1500 Deduct
1 2000 9000 2017-01-01 16:31:00 1400 Deduct
2 1000 1500 2017-02-01 08:45:00 9000 Add
3 1000 1400 2017-02-01 08:45:00 9000 Add
5 4000 2000 2017-02-01 16:40:00 1600 Transfer
6 5000 2000 2017-02-01 16:41:00 1700 Transfer
7 9000 9000 2017-02-01 16:42:00 2000 Deduct
8 4000 1600 2017-03-01 08:45:00 9000 Add
9 5000 1700 2017-03-01 08:45:00 9000 Add I hope there should be a way to do this.
Thanks for your interest in this regard.
Posts: 7,317
Threads: 123
Joined: Sep 2016
May-07-2017, 01:01 PM
(This post was last modified: May-07-2017, 01:01 PM by snippsat.)
Set columns.
df = pd.DataFrame(table, columns=['DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode']) [Image: xW66Og.jpg]
Eg convert to pandas datetime 64.
in [3]: df['DateTime'] = pd.to_datetime(df['DateTime'])
In [4]: df.dtypes
Out[4]:
DateTime datetime64[ns]
Amount int64
Ref object
DrCode int64
CrCode int64
dtype: object
Posts: 107
Threads: 42
Joined: Oct 2016
Its great if someone can help me to reproduce the below T-SQL join using panadas dataframe join.
SELECT *
FROM [TEST].[dbo].[table] t1
join
[TEST].[dbo].[table] t2
ON (t2.CrCode = t1.DrCode and t2.Amount <= t1.Amount)
WHERE t1.Ref = 'Deduct' or
t1.Ref = 'Transfer' and
t2.Ref = 'ADD' and
t1.Ref != 'Transfer' Thanks for your interest
Posts: 331
Threads: 2
Joined: Feb 2017
Pandas does not support "conditional join". It can be done by using ordinary (or even cartesian) join first and then applying your conditions.
Output: In [2]: df = pd.merge(table, table, left_on='DrCode', right_on='CrCode')
In [3]: df.query("Amount_y <= Amount_x and (Ref_x == 'Deduct' or Ref_x=='Transfer' and Ref_y=='Add')")
dataframe.query is handy construct to apply boolean condition to a dataframe. I have ignored last condition of your sql query, as
t1.Ref = 'Transfer' and t2.Ref = 'ADD' and t1.Ref != 'Transfer' is always false, likely there was some typo?
BTW, this is much simplified compared to your original task. With original vague conditions like "If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount" or "go to next row, but ignore if the row already selected as reversal entry" it would likely need to iterate over your original dataframe, and searching/removing "candidate" row in copy of dataframe while checking all conditions.
Posts: 107
Threads: 42
Joined: Oct 2016
May-11-2017, 01:46 PM
(This post was last modified: May-11-2017, 01:51 PM by klllmmm.)
Thank you very much for the solution & the points noted.
1. "and t1.Ref != 'Transfer'" was a mistake. It has to be (t2.Ref = 'ADD' or t2.Ref = 'Transfer') and (t1.Ref = 'Deduct' or t1.Ref = 'Transfer')
2. I admit the fact that the conditions were actually vague in my initial quiz.
3. Follow up or additional quiz
In my main dataset(not this 10 test items above), i have to take out some records by keeping only
lowest datetime difference values between datetime_X & datetime_Y may be by ranking the date differences for each records matched in left table.
below resulting df illustrates the scenario.
import pandas as pd
import datetime as dt
table = pd.DataFrame(data = {'ID':[1,2,3,4,5,6,7,8,9,10,11,12],
'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','03-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'],
'Amount':[1000,2000,1000,500,1000,50000,4000,5000,9000,4000,5000,20000],
'Ref':['Deduct','Deduct','Add','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'],
'DrCode':[1500,1400,9000,9000,9000,9000,1600,1700,2000,9000,9000,4000],
'CrCode':[9000,9000,1500,1500,1400,3000,2000,2000,9000,1600,1700,9000],})
table['DateTime'] = table['DateTime'].apply(lambda x:
dt.datetime.strptime(x,'%d-%m-%y %H:%M'))
df = pd.DataFrame(table, columns=['ID','DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode'])
df = pd.merge(df, df, left_on='DrCode', right_on='CrCode', how='inner')
df = df.query("Amount_y <= Amount_x and (Ref_x == 'Deduct' or Ref_x=='Transfer') and (Ref_y=='Add' or Ref_y=='Transfer')")
df['DateDiff'] = df['DateTime_y']-df['DateTime_x'] I want to keep only the records with lowest absolute value of date difference for each 'ID_x' records where the 'DrCode_y' is not equal to 9000.
I'm expecting to do some thing like below SQL (using MS-SQL 2014)
SELECT *
FROM
(select *
,row_number() over (partition by(ID_x) order by ABS(datediff)) 'DateIDRank'
from df ) as t1
WHERE t1.DateIDRank = 1 or
t1.DrCode_y != 9000 A help on how to do this with pandas is highly appreciated.
Attached Files
Thumbnail(s)
Posts: 331
Threads: 2
Joined: Feb 2017
Selecting rows with row number=1 over partition can be emulated with .groupby() followed by .first() :
df['absdatediff'] = abs(df.DateDiff)
first_ones = df.reset_index().sort_values('absdatediff').groupby('ID_x').first().set_index('index') reset_index() and set_index is used to preserve original index for joining with rows with df.CrCode_y == 9000
To get dataframe containing all "first_ones" rows and all rows with df.CrCode == 9000 without eventual duplicities pd.Index.union() can be used:
df.loc[pd.Index.union(df[df.DrCode_y == 9000].index, first_ones.index)]
Posts: 107
Threads: 42
Joined: Oct 2016
May-11-2017, 06:58 PM
(This post was last modified: May-12-2017, 01:36 AM by klllmmm.)
Thanks again for the effort & guidance.
I only changed "==" to "!=", rest of the solution is perfect.
(May-11-2017, 06:10 PM)zivoni Wrote: df.DrCode_y == 9000
|