Python Forum
PANDAS: DataFrame | White Spaces & Special Character Removal
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PANDAS: DataFrame | White Spaces & Special Character Removal
#1
Hi All,

Just to give you an idea, I'm new to python & coding in general, Tongue , however I've been in IT for 14+ years...

I'm building an automated task to clean CSV data produced by one of our systems. Reason being, we use a automation application that sends out emails to customers & references that data to build the emails in question.

However the data produced by the system sometimes has white spaces or full stop's present in a number of the columns & rows produced Confused (No idea as to why??). I manually go through this data every morning & clean it - records ranging from 20,000 > 2000... Sad

I'm getting a little sick & tired of doing this... & I know there's always a smarter way to work; & I guess learning to code Big Grin is the next best thing, so this is win win situation, once i get the hang of it!

CSV Example - Layout

Letter | Our_Reference | Fullname | Contact Name | Mailing_Add1 | Mailing_Pcode | Client Name | Client Reference | InboundTel | Text11 | Text6
TempLa | 12168001 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client1 | 9613905571 | 0753422499 | DISCON | 0118954
TempLa | 12168002 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client2 | 9613905572 | 0753422499 | DISCON | 0 1 954
TempLa | 12168003 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client3 | 9613905573 | 0753 42 24 | DISCON | 0118954
TempLa | 12168004 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client4 | 9613905574 | 0753422499 | . | 0118954
TempLa | 12168005 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client5 | 9613905575 | 0753422499 | DISCON | 0118954

The ERROR I'm running into is below; Please note, this happens at column "Text6".

Traceback (most recent call last):
  File "C:/Users/username/PycharmProjects/pythonProject/ManipulateCSVDataNRemovesErrors.py", line 19, in <module>
    df['Text6'] = df['Text6'].str.replace(" ", "")
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\generic.py", line 5126, in __getattr__
    return object.__getattribute__(self, name)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2100, in __init__
    self._inferred_dtype = self._validate(data)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2157, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!
The code I've built is below; please be gentle, I know this is probably not the best way to arrange code or even design it.. Blush

import pandas as pd


data = pd.read_csv(r'EMAIL9th Sep 2020_000000.csv')
df = pd.DataFrame(data, columns=['Letter Name', 'Our_Reference', 'Debtors Fullname', 'Contact Name', 'Salutation',
                                 'Mailing_Add1', 'Mailing_Add2', 'Mailing_Add3', 'Mailing_Add4', 'Mailing_Add5',
                                 'Mailing_Pcode', 'Client Name', 'Client Reference', 'Collector', 'Currency',
                                 'Balance', 'Initial Payment', 'No. Of Payments', 'Amount Agreed', 'Payable',
                                 'Starting From', 'Final Payment', 'LG Principal', 'LG Interest', 'LG Court Costs',
                                 'LG Solictor Cost', 'LG Warrant Fee', 'LG Total Due', 'Creditor Name',
                                 'Letter Header', 'Collector', 'Debtor Email', 'Inbound Tel', 'Text11', 'Text12',
                                 'Text1', 'Text2', 'Text3', 'Text4', 'Text5', 'Text6', 'Text7', 'Text8', 'Text9',
                                 'Text10', 'Date1', 'Date2', 'Date3', 'Date4', 'Date5', 'Date6', 'Number1',
                                 'Number2', 'Number3', 'Number4'])


df['Inbound Tel'] = df['Inbound Tel'].str.replace(" ", "")
df['Text11'] = df['Text11'].str.replace(".", "")
df['Text6'] = df['Text6'].str.replace(" ", "")

df.to_csv(r'EMAIL9th Sep 2020_000000_Mod.CSV', index=False)
print(df)
Now I understand this error "AttributeError: Can only use .str accessor with string values!" or I think I do, sting's are variables that hold characters & an integer's are variables that hold numerical values.

The three specified columns "Inbound Tel", "Text11" & "Text6" are set to general formatting when opened in excel (however, this is always by default)... So my question is, why does the code work on column "Inbound Tel" & not "Text6"?

Ever article or solution on Google when it comes to white spaces has pointed me to the following;
df['ColunmName'] = df['ColunmName'].str.replace(" ", "")
.

I've also tried
df['ColunmName'] = df['ColunmName'].__int__.replace(" ", "")
, this was unsuccessful... The following does not work either
df['ColunmName'] = df['ColunmName'].replace(" ", "")
.

However the following does work
df['ColunmName'] = df['ColunmName'].replace(".", "")
on 'Text11' Huh

I have also changed the data to RAW text & then imported it into excel, saved it as a CSV & it has produced a different result... The code now stops at "Inbound Tel"; Please note, below ERROR is now at "Inbound Tel" Huh

Traceback (most recent call last): 
  File "C:/Users/username/PycharmProjects/pythonProject/ManipulateCSVDataNRemovesErrors.py", line 17, in <module>
    df['Inbound Tel'] = df['Inbound Tel'].str.replace(" ", "")
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\generic.py", line 5126, in __getattr__
    return object.__getattribute__(self, name)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2100, in __init__
    self._inferred_dtype = self._validate(data)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2157, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!
Huh Cry

I've arrived at the conclusion that I need to format the CSV file to the correct formatting style before running the code I've created, I'm not sure where to start though - possibly "Styler.apply", but I've also now be left in limbo due to the above errors & all three columns being set to general...

Any help or advise on what the heck is going on would be would be much appreciated!

Look forward to your response Smile

Kind Regards,
Richard
Reply
#2
Hi traibr,

Try the following, I hope they work for you :-

df['Inbound Tel'] = df['Inbound Tel'].str.replace({" ": ""})
Or

df['Inbound Tel'] = df['Inbound Tel'].replace({" ": ""})
And

df['Text6'] = df['Text6'].str.replace({" ": ""})
Or

df['Text6'] = df['Text6'].replace({" ": ""})
Or insert the Line :-

df.columns = ['Inbound Tel', 'Text6']
Above these Lines in your Code :-

df['Inbound Tel'] = df['Inbound Tel'].str.replace(" ", "")
df['Text11'] = df['Text11'].str.replace(".", "")
df['Text6'] = df['Text6'].str.replace(" ", "")
Hash out the Inbound Tel, and Text6 Lines of the above lines of Code, and use the following line of Code instead,
in place of the hashed out lines of Code :-

df.columns = df.columns.str.replace(" ", "")
Or

df.columns = df.columns.str.replace(' ', '')
I hope some of those work for you.

Best Regards

Eddie ))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML Decoder pandas dataframe column mbrown009 3 931 Sep-29-2023, 05:56 PM
Last Post: deanhystad
  Use pandas to obtain cartesian product between a dataframe of int and equations? haihal 0 1,080 Jan-06-2023, 10:53 PM
Last Post: haihal
  Pandas Dataframe Filtering based on rows mvdlm 0 1,381 Apr-02-2022, 06:39 PM
Last Post: mvdlm
  Pandas dataframe: calculate metrics by year mcva 1 2,251 Mar-02-2022, 08:22 AM
Last Post: mcva
  Pandas dataframe comparing anto5 0 1,228 Jan-30-2022, 10:21 AM
Last Post: anto5
  PANDAS: DataFrame | Replace and others questions moduki1 2 1,745 Jan-10-2022, 07:19 PM
Last Post: moduki1
  PANDAS: DataFrame | Saving the wrong value moduki1 0 1,515 Jan-10-2022, 04:42 PM
Last Post: moduki1
  update values in one dataframe based on another dataframe - Pandas iliasb 2 9,036 Aug-14-2021, 12:38 PM
Last Post: jefsummers
  empty row in pandas dataframe rwahdan 3 2,395 Jun-22-2021, 07:57 PM
Last Post: snippsat
  Outliers remain in the scatterplot even after removal d8a988 0 1,261 Mar-12-2021, 12:58 PM
Last Post: d8a988

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020