Python Forum
Need Help With Filtering Data For Excel Files Using Pandas
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need Help With Filtering Data For Excel Files Using Pandas
#1
Hi There,

I am trying to filter Data, for Excel Files in Pandas. Based on the Column Value i.e. String Value.

I managed to get, some data displayed before, but have gone wrong somewhere. As only the Chosen Column Names display, now when I run the Code, in Python 3.7.

Here is the Website Link to the XLS File, if someone could download the File, and tell me where I am going wrong :-

http://web.archive.org/web/2009080423493...6AA506.xls

Here is my latest Code :-

import pandas as pd

data = pd.read_excel (r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date'])

df[(df['Venue'].str.contains('[a-zA-Z]') & df['A/C'].str.contains('DHS')) & (df['A/C'].str.contains('DSS')) & (df['A/C'].str.contains('DAK')) & (df['A/C'].str.contains('HS')) & (df['A/C'] != 'L') & (df['DISPLAY/'].str.contains('DISPLAY'))]
As you can see from the Code, I want data to display, based on Column Values.

Any help would be much appreciated.

Regards

Eddie
Reply
#2
if you look at the output of df['A/C'].value_counts(),
you can find that there is no DSS in A/C column. And only one DHS value there. So, logical and of all these conditions yield an empty dataframe.
Reply
#3
Many thanks for reply scidam, I changed my Code to the following, and the Data is almost displaying How I want :-

import pandas as pd
 
data = pd.read_excel (r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
 
df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date'])
 
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]
How do I Code, so that only the rows show for DAK and HS, where the Venue is the same above and below, i.e. Two KEMBLE one Above and below, when one One row in the A/C Column says DAK and the row below or above says HS, normally below. Ideally I would like to change, DAK and HS to DHS and only one Venue row, i.e. not two KEMBLES just one in one Row. Any Ideas ?
Reply
#4
(Aug-02-2019, 04:59 PM)eddywinch82 Wrote: DAK and HS to DHS and only one Venue row, i.e. not two KEMBLES just one in one Row. Any Ideas ?
Not sure I understand entire problem, but if you need to replace DAK and HS to DHS, you can do this as follows:

df.loc[df['A/C'] == 'DAK', 'A/C'] = 'DHS'
df.loc[df['A/C'] == 'HS', 'A/C'] = 'DHS'
Hope that helps.
Reply
#5
Hi scidam, many thanks for your help, sorry I have not described, what I mean to say very well.

I mean how do I change all off :-

VENUE DAK DISPLAY 2008-09-10 00:00:00
VENUE HS DISPLAY 2008-09-10 00:00:00

Or

VENUE HS DISPLAY 2008-09-06 00:00:00
VENUE DAK DISPLAY 2008-09-06 00:00:00

Or

VENUE DAK DISPLAY 2008-05-25 00:00:00
VENUE SPIT DISPLAY 2008-05-25 00:00:00

And the other way around, like with DAK and HS

VENUE same both lines i.e. GUERNSEY i.e. :-

GUERNSEY
GUERNSEY

To :-

VENUE DHS DISPLAY 2008-09-10 00:00:00

And the DAK and SPIT Ones to :-

VENUE DS DISPLAY 2008-09-10 00:00:00

Also the one case where it is :-

DHS i.e. :- WIGTON DHS DISPLAY 2008-07-26 00:00:00

It doesn't display at all, when I run the above Code, for some reason ?

Also some Displays i.e. WINDERMERE And SUNDERLAND that Have HS and DAK only show DAK when I use :-
df['A/C'].str.contains("DAK"))
in my code, but don't show at all, when I use
df['A/C'].str.contains("DHS|DAK|HS"))
I don't know why that is the case ?

And also only one, DAK and SPIT DISPLAY entry, displays i.e SCARBOROUGH, when I use :-
df['A/C'].str.contains("DHS|DAK|HS|SPIT"))
Regards

Eddie
Reply
#6
Just a note to the Moderators, should this thread, be in the Data Science Forum, instead of this Forum ? Could it be moved, to that Forum instead ?
Reply
#7
Can anyone help me ?
Reply
#8
I have sorted out, the issue about only certain rows being shown.

I found a post where it said how to set, the maximum number of rows shown.

So I typed :-

pd.options.display.max_rows = 1000
And the other required Data, now displays.

I have noticed that in the BID Column, which I now include. In the Data I wan't to keep, the Bid Numbers are the same in both rows, i.e. the same VENUE . So what should I type, to remove rows, where there is only 1 of a Bid Number ? Except where in the A/C Column the values are DHS or DSS ?

So How do I drop the rows, where there
Reply
#9
I typed the following :-

df['BID'].value_counts()


To get all occurances of a particular BID Number. What do I type, so that all Rows, With a Bid Number value_count of 1 I remove leaving the remaining rows, i.e. only the Duplicates ?

This is my latest Python Code :-

import pandas as pd
  
xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
  
data = pd.read_excel(xls, sheet_name="Sheet1")
    
pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])
  
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]
Reply
#10
I included the following Code :-
df.groupby('BID').filter(lambda x : len(x)>1) 
at the end of my Python Code.

To filter out rows, that have only 1 of each BID Number, it does that, but it now includes the other Data which I wanted excluding, how do I use this Groupby Code, in the rest of my Code, to achieve the result I wan't ?

The current order is :-
import pandas as pd
  
xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
  
data = pd.read_excel(xls, sheet_name="Sheet1")
    
pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])
  
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))] 

df.groupby('BID').filter(lambda x : len(x)>1) 
I have added a code onto the end of my Python Code, My Code now ends :-
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC')) & df['BID'].value_counts(">1")]
But even with the latest Code, added on the end, Some of the relevant data is being missed out , where have I gone wrong ?

I have # out the groupby part of my Code.

And I get the Following Traceback Error :-

Error:
c:\python37\lib\site-packages\ipykernel_launcher.py:11: UserWarning: Boolean Series key will be reindexed to match DataFrame index. # This is added back by InteractiveShellApp.init_path()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas keep existing format of Excel AlphaInc 2 881 Jan-11-2024, 03:44 AM
Last Post: plonkarchivist
  Grouping in pandas/multi-index data frame Aleqsie 3 607 Jan-06-2024, 03:55 PM
Last Post: deanhystad
  Filtering Data Frame, with another value NewBiee 9 1,330 Aug-21-2023, 10:53 AM
Last Post: NewBiee
  Data Sorting and filtering(From an Excel File) PY_ALM 0 1,012 Jan-09-2023, 08:14 PM
Last Post: PY_ALM
Smile How to further boost the data read write speed using pandas tjk9501 1 1,228 Nov-14-2022, 01:46 PM
Last Post: jefsummers
  Pandas Dataframe Filtering based on rows mvdlm 0 1,396 Apr-02-2022, 06:39 PM
Last Post: mvdlm
Thumbs Up can't access data from URL in pandas/jupyter notebook aaanoushka 1 1,830 Feb-13-2022, 01:19 PM
Last Post: jefsummers
  Help with poorly formatted excel data armitron121 1 1,694 Jan-13-2022, 07:31 AM
Last Post: paul18fr
Question Sorting data with pandas TheZaind 4 2,295 Nov-22-2021, 07:33 PM
Last Post: aserian
  Exporting data frame to excel dyerlee91 0 1,604 Oct-05-2021, 11:34 AM
Last Post: dyerlee91

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020