Select rows based on a criteria

klllmmm · (This post was last modified: May-04-2017, 06:21 AM by klllmmm.)

I have a dataset which i need to select specific rows.

1. Select-
   1. If DrAccount or CrAccount contain 9000
        eg-
       -If the line contain DrAccount 1500 & CrAccount as 9000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry).
     -If so select both rows.
       -If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering.
       -If there are no reversal entry ignore main row.

   - Then go to next row, but ignore if the row already selected as reversal entry.

2. If DrAccount or CrAccount not contain 9000
         eg-
-If the line contain DrAccount 5000 & CrAccount as 2000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry from 9000).
    -If so select both rows.
       -If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering.
       - If there are no reversal entry ignore main row.
   - Then go to next row, but ignore if the row already selected as reversal entry.

Sort - Keep the selected transaction paires as it was selected

Not sure how to create a function to do this. Hope someone can help on this.

My dataset - Attached a CSV file
DateTime Amount Ref DrCode CrCode
01-01-17 16:30 1000 Deduct 1500 9000
01-01-17 16:31 2000 Deduct 1400 9000
02-01-17 08:45 1000 Add 9000 1500
02-01-17 08:45 1000 Add 9000 1400
02-01-17 10:40 50000 Add 9000 3000
02-01-17 16:40 4000 Tranfer 1600 2000
02-01-17 16:41 5000 Tranfer 1700 2000
02-01-17 16:42 9000 Deduct 2000 9000
03-01-17 08:45 4000 Add 9000 1600
03-01-17 08:45 5000 Add 9000 1700
03-01-17 10:48 20000 Deduct 4000 9000

Result I'm expecting is a table similar to this
DateTime Amount Ref DrCode CrCode
01-01-17 16:30 1000 Deduct 1500 9000
02-01-17 08:45 1000 Add 9000 1500
01-01-17 16:31 2000 Deduct 1400 9000
02-01-17 08:45 1000 Add 9000 1400
02-01-17 16:40 4000 Tranfer 1600 2000
02-01-17 16:41 5000 Tranfer 1700 2000
03-01-17 08:45 4000 Add 9000 1600
03-01-17 08:45 5000 Add 9000 1700

**Larz60+** · May-04-2017, 04:04 AM

show the code that you have so far, and where you think the problem lies.

klllmmm · (This post was last modified: May-07-2017, 08:03 AM by klllmmm.)

# dataframe

import pandas as pd

table = pd.DataFrame(data = {'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'],
                             'Amount':[1000,2000,1000,1000,50000,4000,5000,9000,4000,5000,20000],
                             'Ref':['Deduct','Deduct','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'],
                             'DrCode':[1500,1400,9000,9000,9000,1600,1700,2000,9000,9000,4000],
                             'CrCode':[9000,9000,1500,1400,3000,2000,2000,9000,1600,1700,9000],})

# convert the DateTime field from an string/object to datetime 
table['DateTime']= pd.to_datetime(table['DateTime'])

i'm no where near the test of the crterias such as datetime & Amount, through a loop.

below code just gives me similar values in two coulmns.

table[table.CrCode.isin(table.DrCode) & table.DrCode.isin(table.CrCode)]
Out[210]: 
   Amount  CrCode            DateTime  DrCode       Ref
0    1000    9000 2017-01-01 16:30:00    1500    Deduct
1    2000    9000 2017-01-01 16:31:00    1400    Deduct
2    1000    1500 2017-02-01 08:45:00    9000       Add
3    1000    1400 2017-02-01 08:45:00    9000       Add
5    4000    2000 2017-02-01 16:40:00    1600  Transfer
6    5000    2000 2017-02-01 16:41:00    1700  Transfer
7    9000    9000 2017-02-01 16:42:00    2000    Deduct
8    4000    1600 2017-03-01 08:45:00    9000       Add
9    5000    1700 2017-03-01 08:45:00    9000       Add

I hope there should be a way to do this.

Thanks for your interest in this regard.

***snippsat*** · (This post was last modified: May-07-2017, 01:01 PM by snippsat.)

Set columns.

df = pd.DataFrame(table, columns=['DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode'])

[Image: xW66Og.jpg]
Eg convert to pandas datetime 64.

in [3]: df['DateTime'] = pd.to_datetime(df['DateTime'])
In [4]: df.dtypes
Out[4]:
DateTime    datetime64[ns]
Amount      int64
Ref         object
DrCode      int64
CrCode      int64
dtype:      object

klllmmm · May-08-2017, 07:56 PM

Its great if someone can help me to reproduce the below T-SQL join using panadas dataframe join.

SELECT *
FROM [TEST].[dbo].[table] t1 
join
[TEST].[dbo].[table] t2
ON (t2.CrCode = t1.DrCode and t2.Amount <= t1.Amount)
WHERE t1.Ref = 'Deduct' or
            t1.Ref = 'Transfer' and
           t2.Ref = 'ADD' and
           t1.Ref != 'Transfer'

Thanks for your interest

***zivoni*** · May-11-2017, 10:34 AM

Pandas does not support "conditional join". It can be done by using ordinary (or even cartesian) join first and then applying your conditions.

Output:In [2]: df = pd.merge(table, table, left_on='DrCode', right_on='CrCode')

In [3]: df.query("Amount_y <= Amount_x and (Ref_x == 'Deduct' or Ref_x=='Transfer' and Ref_y=='Add')")

dataframe.query is handy construct to apply boolean condition to a dataframe. I have ignored last condition of your sql query, as

t1.Ref = 'Transfer' and t2.Ref = 'ADD' and t1.Ref != 'Transfer'

is always false, likely there was some typo?

BTW, this is much simplified compared to your original task. With original vague conditions like "If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount" or "go to next row, but ignore if the row already selected as reversal entry" it would likely need to iterate over your original dataframe, and searching/removing "candidate" row in copy of dataframe while checking all conditions.

klllmmm · (This post was last modified: May-11-2017, 01:51 PM by klllmmm.)

Thank you very much for the solution & the points noted.

1. "and t1.Ref != 'Transfer'" was a mistake. It has to be (t2.Ref = 'ADD' or t2.Ref = 'Transfer') and (t1.Ref = 'Deduct' or t1.Ref = 'Transfer')
2. I admit the fact that the conditions were actually vague in my initial quiz.
3. Follow up or additional quiz
In my main dataset(not this 10 test items above), i have to take out some records by keeping only
lowest datetime difference values between datetime_X & datetime_Y may be by ranking the date differences for each records matched in left table.
below resulting df illustrates the scenario.

import pandas as pd
import datetime as dt
 
table = pd.DataFrame(data = {'ID':[1,2,3,4,5,6,7,8,9,10,11,12],
                              'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','03-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'],
                             'Amount':[1000,2000,1000,500,1000,50000,4000,5000,9000,4000,5000,20000],
                             'Ref':['Deduct','Deduct','Add','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'],
                             'DrCode':[1500,1400,9000,9000,9000,9000,1600,1700,2000,9000,9000,4000],
                             'CrCode':[9000,9000,1500,1500,1400,3000,2000,2000,9000,1600,1700,9000],})

table['DateTime'] = table['DateTime'].apply(lambda x: 
                                    dt.datetime.strptime(x,'%d-%m-%y %H:%M'))

df = pd.DataFrame(table, columns=['ID','DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode'])

df = pd.merge(df, df, left_on='DrCode', right_on='CrCode', how='inner')
df = df.query("Amount_y <= Amount_x and (Ref_x == 'Deduct' or Ref_x=='Transfer') and (Ref_y=='Add' or Ref_y=='Transfer')")

df['DateDiff'] = df['DateTime_y']-df['DateTime_x']

I want to keep only the records with lowest absolute value of date difference for each 'ID_x' records where the 'DrCode_y' is not equal to 9000.

I'm expecting to do some thing like below SQL (using MS-SQL 2014)

SELECT *
FROM
      (select *
                 ,row_number() over (partition by(ID_x) order by ABS(datediff))  'DateIDRank'
       from df ) as t1
WHERE t1.DateIDRank = 1 or
             t1.DrCode_y != 9000

A help on how to do this with pandas is highly appreciated.

***zivoni*** · May-11-2017, 06:10 PM

Selecting rows with row number=1 over partition can be emulated with .groupby() followed by .first():

df['absdatediff'] = abs(df.DateDiff)
first_ones = df.reset_index().sort_values('absdatediff').groupby('ID_x').first().set_index('index')

reset_index() and set_index is used to preserve original index for joining with rows with df.CrCode_y == 9000

To get dataframe containing all "first_ones" rows and all rows with df.CrCode == 9000 without eventual duplicities pd.Index.union() can be used:

 df.loc[pd.Index.union(df[df.DrCode_y == 9000].index, first_ones.index)]

klllmmm · (This post was last modified: May-12-2017, 01:36 AM by klllmmm.)

Thanks again for the effort & guidance.

I only changed "==" to "!=", rest of the solution is perfect.

(May-11-2017, 06:10 PM)zivoni Wrote: df.DrCode_y == 9000

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Merging rows and adding columns based on matching index	pythonnewbie78	3	1,755	Dec-24-2023, 11:51 AM Last Post: Pedroski55
	Pandas Dataframe Filtering based on rows	mvdlm	0	2,063	Apr-02-2022, 06:39 PM Last Post: mvdlm
	Extracting rows based on condition on one column	Robotguy	2	3,906	Aug-07-2020, 02:27 AM Last Post: Robotguy
	Dropping Rows From A Data Frame Based On A Variable	JoeDainton123	1	2,864	Aug-03-2020, 02:05 AM Last Post: scidam
	Counting Criteria in Pandas Question	Koenig	1	2,749	Sep-30-2019, 05:16 AM Last Post: perfringo
	select data based on indice	Staph	4	3,430	Jul-15-2019, 02:05 AM Last Post: scidam
	Removing rows at random based on the value of a specific column	Mr_Keystrokes	4	7,577	Aug-24-2018, 11:15 AM Last Post: Mr_Keystrokes
	Updating df rows based on 2 conditions	stretch	1	3,718	May-02-2018, 09:15 AM Last Post: volcano63
	drop duplicate values based on a criteria	klllmmm	1	6,186	Apr-28-2017, 10:17 AM Last Post: zivoni
	Insert values into a column in the same table based on a criteria	klllmmm	3	5,033	Apr-13-2017, 10:10 AM Last Post: zivoni

Select rows based on a criteria

User Panel Messages

Announcements