Python Forum
Select rows based on a criteria
Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Select rows based on a criteria
#1
I have a dataset which i need to select specific rows.

1. Select-
   1. If DrAccount or CrAccount contain 9000
        eg-
       -If the line contain DrAccount 1500 & CrAccount as 9000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry).
       -If so select both rows.
       -If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering.
       -If there are no reversal entry ignore main row.

       - Then go to next row, but ignore if the row already selected as reversal entry.
 
  2. If DrAccount or CrAccount not contain 9000
         eg-
       -If the line contain DrAccount 5000 & CrAccount as 2000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry from 9000).
        -If so select both rows.
       -If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering.
       - If there are no reversal entry ignore main row.
       - Then go to next row, but ignore if the row already selected as reversal entry.

Sort - Keep the selected transaction paires as it was selected

Not sure how to create a function to do this. Hope someone can help on this.

My dataset - Attached a CSV file
  DateTime           Amount     Ref         DrCode   CrCode
01-01-17 16:30  1000          Deduct   1500       9000
01-01-17 16:31  2000          Deduct   1400       9000
02-01-17 08:45  1000         Add          9000      1500
02-01-17 08:45  1000        Add           9000     1400
02-01-17 10:40  50000      Add           9000     3000
02-01-17 16:40   4000       Tranfer     1600      2000
02-01-17 16:41   5000       Tranfer     1700     2000
02-01-17 16:42   9000       Deduct     2000    9000
03-01-17 08:45  4000        Add          9000   1600
03-01-17 08:45  5000       Add           9000    1700
03-01-17 10:48  20000    Deduct       4000   9000


Result I'm expecting is a table similar to this
  DateTime             Amount   Ref       DrCode   CrCode
01-01-17 16:30    1000       Deduct  1500      9000
02-01-17 08:45    1000       Add       9000      1500
01-01-17 16:31    2000       Deduct  1400      9000
02-01-17 08:45    1000       Add       9000      1400
02-01-17 16:40    4000      Tranfer   1600      2000
02-01-17 16:41    5000      Tranfer   1700      2000
03-01-17 08:45   4000       Add        9000      1600
03-01-17 08:45   5000       Add        9000      1700






Attached Files

.csv   Dataset1.csv (Size: 442 bytes / Downloads: 69)
Reply
#2
show the code that you have so far, and where you think the problem lies.
Reply
#3
# dataframe 
import pandas as pd

table = pd.DataFrame(data = {'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'],
                             'Amount':[1000,2000,1000,1000,50000,4000,5000,9000,4000,5000,20000],
                             'Ref':['Deduct','Deduct','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'],
                             'DrCode':[1500,1400,9000,9000,9000,1600,1700,2000,9000,9000,4000],
                             'CrCode':[9000,9000,1500,1400,3000,2000,2000,9000,1600,1700,9000],})

# convert the DateTime field from an string/object to datetime 
table['DateTime']= pd.to_datetime(table['DateTime'])
i'm no where near the test of the crterias such as datetime & Amount, through a loop.

below code just gives me similar values in two coulmns.
table[table.CrCode.isin(table.DrCode) & table.DrCode.isin(table.CrCode)]
Out[210]: 
   Amount  CrCode            DateTime  DrCode       Ref
0    1000    9000 2017-01-01 16:30:00    1500    Deduct
1    2000    9000 2017-01-01 16:31:00    1400    Deduct
2    1000    1500 2017-02-01 08:45:00    9000       Add
3    1000    1400 2017-02-01 08:45:00    9000       Add
5    4000    2000 2017-02-01 16:40:00    1600  Transfer
6    5000    2000 2017-02-01 16:41:00    1700  Transfer
7    9000    9000 2017-02-01 16:42:00    2000    Deduct
8    4000    1600 2017-03-01 08:45:00    9000       Add
9    5000    1700 2017-03-01 08:45:00    9000       Add
I hope there should be a way to do this.

Thanks for your interest in this regard.
Reply
#4
Set columns.
df = pd.DataFrame(table, columns=['DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode'])
[Image: xW66Og.jpg]
Eg convert to pandas datetime 64.
in [3]: df['DateTime'] = pd.to_datetime(df['DateTime'])
In [4]: df.dtypes
Out[4]:
DateTime    datetime64[ns]
Amount      int64
Ref         object
DrCode      int64
CrCode      int64
dtype:      object 
Reply
#5
Its great if someone can help me to reproduce the below T-SQL join using panadas dataframe join.

SELECT *
FROM [TEST].[dbo].[table] t1 
join
[TEST].[dbo].[table] t2
ON (t2.CrCode = t1.DrCode and t2.Amount <= t1.Amount)
WHERE t1.Ref = 'Deduct' or
            t1.Ref = 'Transfer' and
           t2.Ref = 'ADD' and
           t1.Ref != 'Transfer'
Thanks for your interest
Reply
#6
Pandas does not support "conditional join". It can be done by using ordinary (or even cartesian) join first and then applying your conditions.
Output:
In [2]: df = pd.merge(table, table, left_on='DrCode', right_on='CrCode') In [3]: df.query("Amount_y <= Amount_x and (Ref_x == 'Deduct' or Ref_x=='Transfer' and Ref_y=='Add')")
dataframe.query is handy construct to apply boolean condition to a dataframe. I have ignored last condition of your sql query, as
t1.Ref = 'Transfer' and t2.Ref = 'ADD' and t1.Ref != 'Transfer'
is always false, likely there was some typo?

BTW, this is much simplified compared to your original task. With original vague conditions like "If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount" or "go to next row, but ignore if the row already selected as reversal entry" it would likely need to iterate over your original dataframe, and searching/removing "candidate" row in copy of dataframe while checking all conditions.
Reply
#7
Thank you very much for the solution & the points noted.

1.  "and t1.Ref != 'Transfer'" was a mistake. It has to be (t2.Ref = 'ADD' or  t2.Ref =  'Transfer') and (t1.Ref = 'Deduct' or t1.Ref = 'Transfer')
2.  I admit the fact that the conditions were actually vague in my initial quiz.
3.  Follow up or additional quiz
    In my main dataset(not this 10 test items above), i have to take out some records by keeping only
    lowest datetime difference values between datetime_X & datetime_Y may be by ranking the date differences for each records matched      in left table.
    below resulting df illustrates the scenario.
import pandas as pd
import datetime as dt
 
table = pd.DataFrame(data = {'ID':[1,2,3,4,5,6,7,8,9,10,11,12],
                              'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','03-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'],
                             'Amount':[1000,2000,1000,500,1000,50000,4000,5000,9000,4000,5000,20000],
                             'Ref':['Deduct','Deduct','Add','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'],
                             'DrCode':[1500,1400,9000,9000,9000,9000,1600,1700,2000,9000,9000,4000],
                             'CrCode':[9000,9000,1500,1500,1400,3000,2000,2000,9000,1600,1700,9000],})

table['DateTime'] = table['DateTime'].apply(lambda x: 
                                    dt.datetime.strptime(x,'%d-%m-%y %H:%M'))

df = pd.DataFrame(table, columns=['ID','DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode'])

df = pd.merge(df, df, left_on='DrCode', right_on='CrCode', how='inner')
df = df.query("Amount_y <= Amount_x and (Ref_x == 'Deduct' or Ref_x=='Transfer') and (Ref_y=='Add' or Ref_y=='Transfer')")

df['DateDiff'] = df['DateTime_y']-df['DateTime_x']
I want to keep only the records with lowest absolute value of date difference for each 'ID_x' records where the 'DrCode_y' is not equal to 9000.

I'm expecting to do some thing like below SQL (using MS-SQL 2014)
SELECT *
FROM
      (select *
                 ,row_number() over (partition by(ID_x) order by ABS(datediff))  'DateIDRank'
       from df ) as t1
WHERE t1.DateIDRank = 1 or
             t1.DrCode_y != 9000 
A help on how to do this with pandas is highly appreciated.

Attached Files

Thumbnail(s)
   
Reply
#8
Selecting rows with row number=1 over partition can be emulated with .groupby() followed by .first():
df['absdatediff'] = abs(df.DateDiff)
first_ones = df.reset_index().sort_values('absdatediff').groupby('ID_x').first().set_index('index')
reset_index() and set_index is used to preserve original index for joining with rows with df.CrCode_y == 9000

To get dataframe containing all "first_ones" rows and all rows with df.CrCode == 9000 without eventual duplicities pd.Index.union() can be used:
 df.loc[pd.Index.union(df[df.DrCode_y == 9000].index, first_ones.index)]
Reply
#9
Thanks again for the effort & guidance.

I only changed "==" to "!=", rest of the solution is perfect.  
(May-11-2017, 06:10 PM)zivoni Wrote: df.DrCode_y == 9000
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Merging rows and adding columns based on matching index pythonnewbie78 3 748 Dec-24-2023, 11:51 AM
Last Post: Pedroski55
  Pandas Dataframe Filtering based on rows mvdlm 0 1,396 Apr-02-2022, 06:39 PM
Last Post: mvdlm
  Extracting rows based on condition on one column Robotguy 2 2,167 Aug-07-2020, 02:27 AM
Last Post: Robotguy
  Dropping Rows From A Data Frame Based On A Variable JoeDainton123 1 2,186 Aug-03-2020, 02:05 AM
Last Post: scidam
  Counting Criteria in Pandas Question Koenig 1 2,134 Sep-30-2019, 05:16 AM
Last Post: perfringo
  select data based on indice Staph 4 2,450 Jul-15-2019, 02:05 AM
Last Post: scidam
  Removing rows at random based on the value of a specific column Mr_Keystrokes 4 5,526 Aug-24-2018, 11:15 AM
Last Post: Mr_Keystrokes
  Updating df rows based on 2 conditions stretch 1 3,107 May-02-2018, 09:15 AM
Last Post: volcano63
  drop duplicate values based on a criteria klllmmm 1 5,581 Apr-28-2017, 10:17 AM
Last Post: zivoni
  Insert values into a column in the same table based on a criteria klllmmm 3 4,176 Apr-13-2017, 10:10 AM
Last Post: zivoni

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020