Posts: 218
Threads: 27
Joined: May 2018
Hi There,
I am trying to filter Data, for Excel Files in Pandas. Based on the Column Value i.e. String Value.
I managed to get, some data displayed before, but have gone wrong somewhere. As only the Chosen Column Names display, now when I run the Code, in Python 3.7.
Here is the Website Link to the XLS File, if someone could download the File, and tell me where I am going wrong :-
http://web.archive.org/web/2009080423493...6AA506.xls
Here is my latest Code :-
import pandas as pd
data = pd.read_excel (r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date'])
df[(df['Venue'].str.contains('[a-zA-Z]') & df['A/C'].str.contains('DHS')) & (df['A/C'].str.contains('DSS')) & (df['A/C'].str.contains('DAK')) & (df['A/C'].str.contains('HS')) & (df['A/C'] != 'L') & (df['DISPLAY/'].str.contains('DISPLAY'))] As you can see from the Code, I want data to display, based on Column Values.
Any help would be much appreciated.
Regards
Eddie
Posts: 817
Threads: 1
Joined: Mar 2018
if you look at the output of df['A/C'].value_counts() ,
you can find that there is no DSS in A/C column. And only one DHS value there. So, logical and of all these conditions yield an empty dataframe.
Posts: 218
Threads: 27
Joined: May 2018
Aug-02-2019, 04:59 PM
(This post was last modified: Aug-02-2019, 04:59 PM by eddywinch82.)
Many thanks for reply scidam, I changed my Code to the following, and the Data is almost displaying How I want :-
import pandas as pd
data = pd.read_excel (r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date'])
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))] How do I Code, so that only the rows show for DAK and HS, where the Venue is the same above and below, i.e. Two KEMBLE one Above and below, when one One row in the A/C Column says DAK and the row below or above says HS, normally below. Ideally I would like to change, DAK and HS to DHS and only one Venue row, i.e. not two KEMBLES just one in one Row. Any Ideas ?
Posts: 817
Threads: 1
Joined: Mar 2018
(Aug-02-2019, 04:59 PM)eddywinch82 Wrote: DAK and HS to DHS and only one Venue row, i.e. not two KEMBLES just one in one Row. Any Ideas ? Not sure I understand entire problem, but if you need to replace DAK and HS to DHS, you can do this as follows:
df.loc[df['A/C'] == 'DAK', 'A/C'] = 'DHS'
df.loc[df['A/C'] == 'HS', 'A/C'] = 'DHS' Hope that helps.
Posts: 218
Threads: 27
Joined: May 2018
Aug-03-2019, 02:49 PM
(This post was last modified: Aug-03-2019, 08:11 PM by eddywinch82.)
Hi scidam, many thanks for your help, sorry I have not described, what I mean to say very well.
I mean how do I change all off :-
VENUE DAK DISPLAY 2008-09-10 00:00:00
VENUE HS DISPLAY 2008-09-10 00:00:00
Or
VENUE HS DISPLAY 2008-09-06 00:00:00
VENUE DAK DISPLAY 2008-09-06 00:00:00
Or
VENUE DAK DISPLAY 2008-05-25 00:00:00
VENUE SPIT DISPLAY 2008-05-25 00:00:00
And the other way around, like with DAK and HS
VENUE same both lines i.e. GUERNSEY i.e. :-
GUERNSEY
GUERNSEY
To :-
VENUE DHS DISPLAY 2008-09-10 00:00:00
And the DAK and SPIT Ones to :-
VENUE DS DISPLAY 2008-09-10 00:00:00
Also the one case where it is :-
DHS i.e. :- WIGTON DHS DISPLAY 2008-07-26 00:00:00
It doesn't display at all, when I run the above Code, for some reason ?
Also some Displays i.e. WINDERMERE And SUNDERLAND that Have HS and DAK only show DAK when I use :- df['A/C'].str.contains("DAK")) in my code, but don't show at all, when I use df['A/C'].str.contains("DHS|DAK|HS")) I don't know why that is the case ?
And also only one, DAK and SPIT DISPLAY entry, displays i.e SCARBOROUGH, when I use :- df['A/C'].str.contains("DHS|DAK|HS|SPIT")) Regards
Eddie
Posts: 218
Threads: 27
Joined: May 2018
Aug-04-2019, 09:29 AM
(This post was last modified: Aug-04-2019, 09:33 AM by eddywinch82.)
Just a note to the Moderators, should this thread, be in the Data Science Forum, instead of this Forum ? Could it be moved, to that Forum instead ?
Posts: 218
Threads: 27
Joined: May 2018
Posts: 218
Threads: 27
Joined: May 2018
Aug-05-2019, 02:55 PM
(This post was last modified: Aug-05-2019, 02:55 PM by eddywinch82.)
I have sorted out, the issue about only certain rows being shown.
I found a post where it said how to set, the maximum number of rows shown.
So I typed :-
pd.options.display.max_rows = 1000 And the other required Data, now displays.
I have noticed that in the BID Column, which I now include. In the Data I wan't to keep, the Bid Numbers are the same in both rows, i.e. the same VENUE . So what should I type, to remove rows, where there is only 1 of a Bid Number ? Except where in the A/C Column the values are DHS or DSS ?
So How do I drop the rows, where there
Posts: 218
Threads: 27
Joined: May 2018
Aug-05-2019, 07:54 PM
(This post was last modified: Aug-05-2019, 07:54 PM by eddywinch82.)
I typed the following :-
df['BID'].value_counts()
To get all occurances of a particular BID Number. What do I type, so that all Rows, With a Bid Number value_count of 1 I remove leaving the remaining rows, i.e. only the Duplicates ?
This is my latest Python Code :-
import pandas as pd
xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
data = pd.read_excel(xls, sheet_name="Sheet1")
pd.options.display.max_rows = 1000
df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]
Posts: 218
Threads: 27
Joined: May 2018
Aug-06-2019, 03:44 PM
(This post was last modified: Aug-06-2019, 03:45 PM by eddywinch82.)
I included the following Code :- df.groupby('BID').filter(lambda x : len(x)>1) at the end of my Python Code.
To filter out rows, that have only 1 of each BID Number, it does that, but it now includes the other Data which I wanted excluding, how do I use this Groupby Code, in the rest of my Code, to achieve the result I wan't ?
The current order is :- import pandas as pd
xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
data = pd.read_excel(xls, sheet_name="Sheet1")
pd.options.display.max_rows = 1000
df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]
df.groupby('BID').filter(lambda x : len(x)>1) I have added a code onto the end of my Python Code, My Code now ends :- df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC')) & df['BID'].value_counts(">1")] But even with the latest Code, added on the end, Some of the relevant data is being missed out , where have I gone wrong ?
I have # out the groupby part of my Code.
And I get the Following Traceback Error :-
Error: c:\python37\lib\site-packages\ipykernel_launcher.py:11: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
# This is added back by InteractiveShellApp.init_path()
|