Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Need Help With Filtering Data For Excel Files Using Pandas
#1
Hi There,

I am trying to filter Data, for Excel Files in Pandas. Based on the Column Value i.e. String Value.

I managed to get, some data displayed before, but have gone wrong somewhere. As only the Chosen Column Names display, now when I run the Code, in Python 3.7.

Here is the Website Link to the XLS File, if someone could download the File, and tell me where I am going wrong :-

http://web.archive.org/web/2009080423493...6AA506.xls

Here is my latest Code :-

import pandas as pd

data = pd.read_excel (r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date'])

df[(df['Venue'].str.contains('[a-zA-Z]') & df['A/C'].str.contains('DHS')) & (df['A/C'].str.contains('DSS')) & (df['A/C'].str.contains('DAK')) & (df['A/C'].str.contains('HS')) & (df['A/C'] != 'L') & (df['DISPLAY/'].str.contains('DISPLAY'))]
As you can see from the Code, I want data to display, based on Column Values.

Any help would be much appreciated.

Regards

Eddie
Quote
#2
if you look at the output of df['A/C'].value_counts(),
you can find that there is no DSS in A/C column. And only one DHS value there. So, logical and of all these conditions yield an empty dataframe.
Quote
#3
Many thanks for reply scidam, I changed my Code to the following, and the Data is almost displaying How I want :-

import pandas as pd
 
data = pd.read_excel (r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
 
df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date'])
 
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]
How do I Code, so that only the rows show for DAK and HS, where the Venue is the same above and below, i.e. Two KEMBLE one Above and below, when one One row in the A/C Column says DAK and the row below or above says HS, normally below. Ideally I would like to change, DAK and HS to DHS and only one Venue row, i.e. not two KEMBLES just one in one Row. Any Ideas ?
Quote
#4
(Aug-02-2019, 04:59 PM)eddywinch82 Wrote: DAK and HS to DHS and only one Venue row, i.e. not two KEMBLES just one in one Row. Any Ideas ?
Not sure I understand entire problem, but if you need to replace DAK and HS to DHS, you can do this as follows:

df.loc[df['A/C'] == 'DAK', 'A/C'] = 'DHS'
df.loc[df['A/C'] == 'HS', 'A/C'] = 'DHS'
Hope that helps.
Quote
#5
Hi scidam, many thanks for your help, sorry I have not described, what I mean to say very well.

I mean how do I change all off :-

VENUE DAK DISPLAY 2008-09-10 00:00:00
VENUE HS DISPLAY 2008-09-10 00:00:00

Or

VENUE HS DISPLAY 2008-09-06 00:00:00
VENUE DAK DISPLAY 2008-09-06 00:00:00

Or

VENUE DAK DISPLAY 2008-05-25 00:00:00
VENUE SPIT DISPLAY 2008-05-25 00:00:00

And the other way around, like with DAK and HS

VENUE same both lines i.e. GUERNSEY i.e. :-

GUERNSEY
GUERNSEY

To :-

VENUE DHS DISPLAY 2008-09-10 00:00:00

And the DAK and SPIT Ones to :-

VENUE DS DISPLAY 2008-09-10 00:00:00

Also the one case where it is :-

DHS i.e. :- WIGTON DHS DISPLAY 2008-07-26 00:00:00

It doesn't display at all, when I run the above Code, for some reason ?

Also some Displays i.e. WINDERMERE And SUNDERLAND that Have HS and DAK only show DAK when I use :-
df['A/C'].str.contains("DAK"))
in my code, but don't show at all, when I use
df['A/C'].str.contains("DHS|DAK|HS"))
I don't know why that is the case ?

And also only one, DAK and SPIT DISPLAY entry, displays i.e SCARBOROUGH, when I use :-
df['A/C'].str.contains("DHS|DAK|HS|SPIT"))
Regards

Eddie
Quote
#6
Just a note to the Moderators, should this thread, be in the Data Science Forum, instead of this Forum ? Could it be moved, to that Forum instead ?
Quote
#7
Can anyone help me ?
Quote
#8
I have sorted out, the issue about only certain rows being shown.

I found a post where it said how to set, the maximum number of rows shown.

So I typed :-

pd.options.display.max_rows = 1000
And the other required Data, now displays.

I have noticed that in the BID Column, which I now include. In the Data I wan't to keep, the Bid Numbers are the same in both rows, i.e. the same VENUE . So what should I type, to remove rows, where there is only 1 of a Bid Number ? Except where in the A/C Column the values are DHS or DSS ?

So How do I drop the rows, where there
Quote
#9
I typed the following :-

df['BID'].value_counts()


To get all occurances of a particular BID Number. What do I type, so that all Rows, With a Bid Number value_count of 1 I remove leaving the remaining rows, i.e. only the Duplicates ?

This is my latest Python Code :-

import pandas as pd
  
xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
  
data = pd.read_excel(xls, sheet_name="Sheet1")
    
pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])
  
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]
Quote
#10
I included the following Code :-
df.groupby('BID').filter(lambda x : len(x)>1) 
at the end of my Python Code.

To filter out rows, that have only 1 of each BID Number, it does that, but it now includes the other Data which I wanted excluding, how do I use this Groupby Code, in the rest of my Code, to achieve the result I wan't ?

The current order is :-
import pandas as pd
  
xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
  
data = pd.read_excel(xls, sheet_name="Sheet1")
    
pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])
  
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))] 

df.groupby('BID').filter(lambda x : len(x)>1) 
I have added a code onto the end of my Python Code, My Code now ends :-
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC')) & df['BID'].value_counts(">1")]
But even with the latest Code, added on the end, Some of the relevant data is being missed out , where have I gone wrong ?

I have # out the groupby part of my Code.

And I get the Following Traceback Error :-

Error:
c:\python37\lib\site-packages\ipykernel_launcher.py:11: UserWarning: Boolean Series key will be reindexed to match DataFrame index. # This is added back by InteractiveShellApp.init_path()
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  pandas DataReader error on all data sources glidecode 5 7,003 Sep-25-2019, 02:10 PM
Last Post: perfringo
  Loop pandas data frame by position ? Johnse 1 241 Sep-06-2019, 12:26 AM
Last Post: scidam
  Ask for machine learning Python example with 2 data files user5566b 2 169 Sep-05-2019, 12:15 PM
Last Post: user5566b
  pandas data frame dervast 1 219 Aug-28-2019, 12:40 PM
Last Post: ThomasL
  Insert Pandas Data Frame into Teradata DB kylenater 0 855 Jul-19-2019, 04:53 PM
Last Post: kylenater
  Loading .csv data using Pandas zaki424160 1 429 Jul-15-2019, 09:48 AM
Last Post: perfringo
  How to use the excel filename as a value to populate new column, using Pandas? Dequanharrison 5 542 Jun-26-2019, 11:11 PM
Last Post: Dequanharrison
  pandas writing to excel .. help anna 0 280 Jun-20-2019, 06:34 AM
Last Post: anna
  Aligning excel data gat 1 302 Jun-17-2019, 07:05 PM
Last Post: michalmonday
  [pandas] Convert categorical data to numbers pradeep_as400 1 438 Jun-15-2019, 08:27 AM
Last Post: ThomasL

Forum Jump:


Users browsing this thread: 1 Guest(s)