Select rows based on a criteria - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Select rows based on a criteria (/thread-3189.html) |
Select rows based on a criteria - klllmmm - May-04-2017 I have a dataset which i need to select specific rows. 1. Select- 1. If DrAccount or CrAccount contain 9000 eg- -If the line contain DrAccount 1500 & CrAccount as 9000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry). -If so select both rows. -If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering. -If there are no reversal entry ignore main row. - Then go to next row, but ignore if the row already selected as reversal entry. 2. If DrAccount or CrAccount not contain 9000 eg- -If the line contain DrAccount 5000 & CrAccount as 2000 check whether it has another row with CrAccount 9000 & DrAccount 1500(like a reversal entry from 9000). -If so select both rows. -If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount & which is most recent by datetime to the main row considering. - If there are no reversal entry ignore main row. - Then go to next row, but ignore if the row already selected as reversal entry. Sort - Keep the selected transaction paires as it was selected Not sure how to create a function to do this. Hope someone can help on this. My dataset - Attached a CSV file DateTime Amount Ref DrCode CrCode 01-01-17 16:30 1000 Deduct 1500 9000 01-01-17 16:31 2000 Deduct 1400 9000 02-01-17 08:45 1000 Add 9000 1500 02-01-17 08:45 1000 Add 9000 1400 02-01-17 10:40 50000 Add 9000 3000 02-01-17 16:40 4000 Tranfer 1600 2000 02-01-17 16:41 5000 Tranfer 1700 2000 02-01-17 16:42 9000 Deduct 2000 9000 03-01-17 08:45 4000 Add 9000 1600 03-01-17 08:45 5000 Add 9000 1700 03-01-17 10:48 20000 Deduct 4000 9000 Result I'm expecting is a table similar to this DateTime Amount Ref DrCode CrCode 01-01-17 16:30 1000 Deduct 1500 9000 02-01-17 08:45 1000 Add 9000 1500 01-01-17 16:31 2000 Deduct 1400 9000 02-01-17 08:45 1000 Add 9000 1400 02-01-17 16:40 4000 Tranfer 1600 2000 02-01-17 16:41 5000 Tranfer 1700 2000 03-01-17 08:45 4000 Add 9000 1600 03-01-17 08:45 5000 Add 9000 1700 RE: Select rows based on a criteria - Larz60+ - May-04-2017 show the code that you have so far, and where you think the problem lies. RE: Select rows based on a criteria - klllmmm - May-07-2017 # dataframe import pandas as pd table = pd.DataFrame(data = {'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'], 'Amount':[1000,2000,1000,1000,50000,4000,5000,9000,4000,5000,20000], 'Ref':['Deduct','Deduct','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'], 'DrCode':[1500,1400,9000,9000,9000,1600,1700,2000,9000,9000,4000], 'CrCode':[9000,9000,1500,1400,3000,2000,2000,9000,1600,1700,9000],}) # convert the DateTime field from an string/object to datetime table['DateTime']= pd.to_datetime(table['DateTime'])i'm no where near the test of the crterias such as datetime & Amount, through a loop. below code just gives me similar values in two coulmns. table[table.CrCode.isin(table.DrCode) & table.DrCode.isin(table.CrCode)] Out[210]: Amount CrCode DateTime DrCode Ref 0 1000 9000 2017-01-01 16:30:00 1500 Deduct 1 2000 9000 2017-01-01 16:31:00 1400 Deduct 2 1000 1500 2017-02-01 08:45:00 9000 Add 3 1000 1400 2017-02-01 08:45:00 9000 Add 5 4000 2000 2017-02-01 16:40:00 1600 Transfer 6 5000 2000 2017-02-01 16:41:00 1700 Transfer 7 9000 9000 2017-02-01 16:42:00 2000 Deduct 8 4000 1600 2017-03-01 08:45:00 9000 Add 9 5000 1700 2017-03-01 08:45:00 9000 AddI hope there should be a way to do this. Thanks for your interest in this regard. RE: Select rows based on a criteria - snippsat - May-07-2017 Set columns. df = pd.DataFrame(table, columns=['DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode'])[Image: xW66Og.jpg] Eg convert to pandas datetime 64. in [3]: df['DateTime'] = pd.to_datetime(df['DateTime']) In [4]: df.dtypes Out[4]: DateTime datetime64[ns] Amount int64 Ref object DrCode int64 CrCode int64 dtype: object RE: Select rows based on a criteria - klllmmm - May-08-2017 Its great if someone can help me to reproduce the below T-SQL join using panadas dataframe join. SELECT * FROM [TEST].[dbo].[table] t1 join [TEST].[dbo].[table] t2 ON (t2.CrCode = t1.DrCode and t2.Amount <= t1.Amount) WHERE t1.Ref = 'Deduct' or t1.Ref = 'Transfer' and t2.Ref = 'ADD' and t1.Ref != 'Transfer'Thanks for your interest RE: Select rows based on a criteria - zivoni - May-11-2017 Pandas does not support "conditional join". It can be done by using ordinary (or even cartesian) join first and then applying your conditions.
dataframe.query is handy construct to apply boolean condition to a dataframe. I have ignored last condition of your sql query, ast1.Ref = 'Transfer' and t2.Ref = 'ADD' and t1.Ref != 'Transfer'is always false, likely there was some typo? BTW, this is much simplified compared to your original task. With original vague conditions like "If there are many rows with CrAccount 9000 & DrAccount 1500, select the similar amount" or "go to next row, but ignore if the row already selected as reversal entry" it would likely need to iterate over your original dataframe, and searching/removing "candidate" row in copy of dataframe while checking all conditions. RE: Select rows based on a criteria - klllmmm - May-11-2017 Thank you very much for the solution & the points noted. 1. "and t1.Ref != 'Transfer'" was a mistake. It has to be (t2.Ref = 'ADD' or t2.Ref = 'Transfer') and (t1.Ref = 'Deduct' or t1.Ref = 'Transfer') 2. I admit the fact that the conditions were actually vague in my initial quiz. 3. Follow up or additional quiz In my main dataset(not this 10 test items above), i have to take out some records by keeping only lowest datetime difference values between datetime_X & datetime_Y may be by ranking the date differences for each records matched in left table. below resulting df illustrates the scenario. import pandas as pd import datetime as dt table = pd.DataFrame(data = {'ID':[1,2,3,4,5,6,7,8,9,10,11,12], 'DateTime':['01-01-17 16:30','01-01-17 16:31','02-01-17 08:45','03-01-17 08:45','02-01-17 08:45','02-01-17 10:40','02-01-17 16:40','02-01-17 16:41','02-01-17 16:42','03-01-17 08:45','03-01-17 08:45','03-01-17 10:48'], 'Amount':[1000,2000,1000,500,1000,50000,4000,5000,9000,4000,5000,20000], 'Ref':['Deduct','Deduct','Add','Add','Add','Add','Transfer','Transfer','Deduct','Add','Add','Deduct'], 'DrCode':[1500,1400,9000,9000,9000,9000,1600,1700,2000,9000,9000,4000], 'CrCode':[9000,9000,1500,1500,1400,3000,2000,2000,9000,1600,1700,9000],}) table['DateTime'] = table['DateTime'].apply(lambda x: dt.datetime.strptime(x,'%d-%m-%y %H:%M')) df = pd.DataFrame(table, columns=['ID','DateTime', 'Amount', 'Ref', 'DrCode', 'CrCode']) df = pd.merge(df, df, left_on='DrCode', right_on='CrCode', how='inner') df = df.query("Amount_y <= Amount_x and (Ref_x == 'Deduct' or Ref_x=='Transfer') and (Ref_y=='Add' or Ref_y=='Transfer')") df['DateDiff'] = df['DateTime_y']-df['DateTime_x']I want to keep only the records with lowest absolute value of date difference for each 'ID_x' records where the 'DrCode_y' is not equal to 9000. I'm expecting to do some thing like below SQL (using MS-SQL 2014) SELECT * FROM (select * ,row_number() over (partition by(ID_x) order by ABS(datediff)) 'DateIDRank' from df ) as t1 WHERE t1.DateIDRank = 1 or t1.DrCode_y != 9000A help on how to do this with pandas is highly appreciated. RE: Select rows based on a criteria - zivoni - May-11-2017 Selecting rows with row number=1 over partition can be emulated with .groupby() followed by .first() :df['absdatediff'] = abs(df.DateDiff) first_ones = df.reset_index().sort_values('absdatediff').groupby('ID_x').first().set_index('index')reset_index() and set_index is used to preserve original index for joining with rows with df.CrCode_y == 9000 To get dataframe containing all "first_ones" rows and all rows with df.CrCode == 9000 without eventual duplicities pd.Index.union() can be used:df.loc[pd.Index.union(df[df.DrCode_y == 9000].index, first_ones.index)] RE: Select rows based on a criteria - klllmmm - May-11-2017 Thanks again for the effort & guidance. I only changed "==" to "!=", rest of the solution is perfect. (May-11-2017, 06:10 PM)zivoni Wrote: df.DrCode_y == 9000 |