Python Forum

Full Version: PANDAS: DataFrame | White Spaces & Special Character Removal
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi All,

Just to give you an idea, I'm new to python & coding in general, Tongue , however I've been in IT for 14+ years...

I'm building an automated task to clean CSV data produced by one of our systems. Reason being, we use a automation application that sends out emails to customers & references that data to build the emails in question.

However the data produced by the system sometimes has white spaces or full stop's present in a number of the columns & rows produced Confused (No idea as to why??). I manually go through this data every morning & clean it - records ranging from 20,000 > 2000... Sad

I'm getting a little sick & tired of doing this... & I know there's always a smarter way to work; & I guess learning to code Big Grin is the next best thing, so this is win win situation, once i get the hang of it!

CSV Example - Layout

Letter | Our_Reference | Fullname | Contact Name | Mailing_Add1 | Mailing_Pcode | Client Name | Client Reference | InboundTel | Text11 | Text6
TempLa | 12168001 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client1 | 9613905571 | 0753422499 | DISCON | 0118954
TempLa | 12168002 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client2 | 9613905572 | 0753422499 | DISCON | 0 1 954
TempLa | 12168003 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client3 | 9613905573 | 0753 42 24 | DISCON | 0118954
TempLa | 12168004 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client4 | 9613905574 | 0753422499 | . | 0118954
TempLa | 12168005 | Joe Blog | 07528945863 | 2 House | rg1 7pt | Client5 | 9613905575 | 0753422499 | DISCON | 0118954

The ERROR I'm running into is below; Please note, this happens at column "Text6".

Traceback (most recent call last):
  File "C:/Users/username/PycharmProjects/pythonProject/ManipulateCSVDataNRemovesErrors.py", line 19, in <module>
    df['Text6'] = df['Text6'].str.replace(" ", "")
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\generic.py", line 5126, in __getattr__
    return object.__getattribute__(self, name)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2100, in __init__
    self._inferred_dtype = self._validate(data)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2157, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!
The code I've built is below; please be gentle, I know this is probably not the best way to arrange code or even design it.. Blush

import pandas as pd


data = pd.read_csv(r'EMAIL9th Sep 2020_000000.csv')
df = pd.DataFrame(data, columns=['Letter Name', 'Our_Reference', 'Debtors Fullname', 'Contact Name', 'Salutation',
                                 'Mailing_Add1', 'Mailing_Add2', 'Mailing_Add3', 'Mailing_Add4', 'Mailing_Add5',
                                 'Mailing_Pcode', 'Client Name', 'Client Reference', 'Collector', 'Currency',
                                 'Balance', 'Initial Payment', 'No. Of Payments', 'Amount Agreed', 'Payable',
                                 'Starting From', 'Final Payment', 'LG Principal', 'LG Interest', 'LG Court Costs',
                                 'LG Solictor Cost', 'LG Warrant Fee', 'LG Total Due', 'Creditor Name',
                                 'Letter Header', 'Collector', 'Debtor Email', 'Inbound Tel', 'Text11', 'Text12',
                                 'Text1', 'Text2', 'Text3', 'Text4', 'Text5', 'Text6', 'Text7', 'Text8', 'Text9',
                                 'Text10', 'Date1', 'Date2', 'Date3', 'Date4', 'Date5', 'Date6', 'Number1',
                                 'Number2', 'Number3', 'Number4'])


df['Inbound Tel'] = df['Inbound Tel'].str.replace(" ", "")
df['Text11'] = df['Text11'].str.replace(".", "")
df['Text6'] = df['Text6'].str.replace(" ", "")

df.to_csv(r'EMAIL9th Sep 2020_000000_Mod.CSV', index=False)
print(df)
Now I understand this error "AttributeError: Can only use .str accessor with string values!" or I think I do, sting's are variables that hold characters & an integer's are variables that hold numerical values.

The three specified columns "Inbound Tel", "Text11" & "Text6" are set to general formatting when opened in excel (however, this is always by default)... So my question is, why does the code work on column "Inbound Tel" & not "Text6"?

Ever article or solution on Google when it comes to white spaces has pointed me to the following;
df['ColunmName'] = df['ColunmName'].str.replace(" ", "")
.

I've also tried
df['ColunmName'] = df['ColunmName'].__int__.replace(" ", "")
, this was unsuccessful... The following does not work either
df['ColunmName'] = df['ColunmName'].replace(" ", "")
.

However the following does work
df['ColunmName'] = df['ColunmName'].replace(".", "")
on 'Text11' Huh

I have also changed the data to RAW text & then imported it into excel, saved it as a CSV & it has produced a different result... The code now stops at "Inbound Tel"; Please note, below ERROR is now at "Inbound Tel" Huh

Traceback (most recent call last): 
  File "C:/Users/username/PycharmProjects/pythonProject/ManipulateCSVDataNRemovesErrors.py", line 17, in <module>
    df['Inbound Tel'] = df['Inbound Tel'].str.replace(" ", "")
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\generic.py", line 5126, in __getattr__
    return object.__getattribute__(self, name)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2100, in __init__
    self._inferred_dtype = self._validate(data)
  File "C:\Users\username\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\strings.py", line 2157, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!
Huh Cry

I've arrived at the conclusion that I need to format the CSV file to the correct formatting style before running the code I've created, I'm not sure where to start though - possibly "Styler.apply", but I've also now be left in limbo due to the above errors & all three columns being set to general...

Any help or advise on what the heck is going on would be would be much appreciated!

Look forward to your response Smile

Kind Regards,
Richard
Hi traibr,

Try the following, I hope they work for you :-

df['Inbound Tel'] = df['Inbound Tel'].str.replace({" ": ""})
Or

df['Inbound Tel'] = df['Inbound Tel'].replace({" ": ""})
And

df['Text6'] = df['Text6'].str.replace({" ": ""})
Or

df['Text6'] = df['Text6'].replace({" ": ""})
Or insert the Line :-

df.columns = ['Inbound Tel', 'Text6']
Above these Lines in your Code :-

df['Inbound Tel'] = df['Inbound Tel'].str.replace(" ", "")
df['Text11'] = df['Text11'].str.replace(".", "")
df['Text6'] = df['Text6'].str.replace(" ", "")
Hash out the Inbound Tel, and Text6 Lines of the above lines of Code, and use the following line of Code instead,
in place of the hashed out lines of Code :-

df.columns = df.columns.str.replace(" ", "")
Or

df.columns = df.columns.str.replace(' ', '')
I hope some of those work for you.

Best Regards

Eddie ))