Python Forum

Full Version: How to do a mass replace within a CSV file?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have the following code that im using to open a CSV and then create a new filtered csv for importing into excel.

# THIS OPENS THE NEWLY CLEAN DATA IN ORDER TO REMOVE OLD RECORDS
df = pd.read_csv("Dates.txt", usecols=range(3), header=None, names=["Date", "Time", "Comment"])
# THIS CONVERTS THE DATE COLUMN INTO A DATETIME FORMAT
df['DT'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
# HERE YOU NEED TO PROVIDE THE DATE YOU WANT TO KEEP GOING FORWARD
mask = (df['DT'] > '2022-04-28  7:06 AM')
# THIS RETURNS ALL ROWS GREATER THAN THE DATE PROVIDED ABOVE
ndf = df.loc[mask]
The above works great and does what is needed, NOW we are trying to do a mass replace for a key string value so that it can update the pivot table we have in excel when the data is imported.

In the 3rd column of the file ("Comment") there is a value for each record that has 2 different values based on the row, so just need to replace this string:

Someone Else:    

with

Completed:
And replace this string:

Davids:

with

Dave:
Is it possible to do a find and replace or do we have to do it per line?

Currently we open the file in NotePad++ and just do a find and replace, but would love to have that built into the creation of the new file so we dont have to remember to do that step.
Which OS? There are tools for doing this kind of thing. You don't have to write any code.
im in using windows, the point of doing it within the code is so that no one has to touch the file multiple times..

What tools are you referring to?
On linux you could use tools like sed or awk to process one file at a time, all files in a folder, or all files in a folder and any subdirectories. With Windows I am at a loss.
On Windows, you can look at Windows Subsystem for Linux to run those kinds of tools.

I think it's well worth learning these things because they're made for these tasks. Those tools are present on any typical UNIX system (Mac OS as well as Linux, for example).
When you already use Pandas,then do of course Pandas have method that dos this task.
Test.
import pandas as pd

df = pd.read_csv("data_test.csv")
>>> df
                  Date       Comment
0  2022-04-28  7:06 AM            NO
1  2022-04-28  7:06 AM  Someone Else
2  2022-04-28  7:06 AM        Davids
3  2022-04-28  7:06 AM            NO
4  2022-04-28  7:06 AM  Someone Else
5  2022-04-28  7:06 AM        Davids
6  2022-04-28  7:06 AM  Someone Else
>>> 
>>> replace_values = {'Someone Else': 'Completed', 'Davids': 'David'} 
>>> df = df.replace({"Comment": replace_values})
>>> df
                  Date    Comment
0  2022-04-28  7:06 AM         NO
1  2022-04-28  7:06 AM  Completed
2  2022-04-28  7:06 AM      David
3  2022-04-28  7:06 AM         NO
4  2022-04-28  7:06 AM  Completed
5  2022-04-28  7:06 AM      David
6  2022-04-28  7:06 AM  Completed
Read more about this in link under,a lot of methods accept Regex so can do all kind more advance task than this.
How to use Regex in Pandas
It's also task that can do plain Python,a ok training task to write some lines that fix this.
I tried implementing the above in my existing code and nothing gets replaced.

# THIS OPENS THE NEWLY CLEAN DATA IN ORDER TO REMOVE OLD RECORDS
df = pd.read_csv("Dates.txt", usecols=range(3), header=None, names=["Date", "Time", "Comment"])


replace_values = {'Someone Else': 'Dan', 'This one':'Ros'}
df = df.replace({"Comment": replace_values})


# THIS CONVERTS THE DATE COLUMN INTO A DATETIME FORMAT
df['DT'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

# HERE YOU NEED TO PROVIDE THE DATE YOU WANT TO KEEP GOING FORWARD
mask = (df['DT'] > '2022-05-02  7:06 AM')

# THIS RETURNS ALL ROWS GREATER THAN THE DATE PROVIDED ABOVE
df = df.loc[mask]

# DROP TEMP COLUMN BEFORE WRITING CSV
df['DT'].drop
Everything in this code works except the replace lines. Did i miss something? Im replacing after reading in the CSV..
(May-09-2022, 01:56 PM)cubangt Wrote: [ -> ]I tried implementing the above in my existing code and nothing gets replaced.
How dos the DataFrame look?
Use print(df.head()) is Dafaframe is big.
In my Example so do Comment column only contained the word that need to replaced,
if it's a word in i middle of sentence then need to write Regex for that.
So yes in my "comment" column its a free form text field, so it can be anything from 1 word to a whole paragraph

I ran the above suggestions and this is what is returned:

Date Time Comment
0 1/18/22 2:22 PM Someone Else: so for a Christmas back then
1 1/18/22 2:25 PM Someone Else: <Media omitted>
2 1/18/22 2:32 PM Davids: i dont care about the gifts

So im guessing Regex is my only option...??
(May-09-2022, 05:50 PM)cubangt Wrote: [ -> ]So im guessing Regex is my only option...??
Yes it's the easiest option for this can this and df.replace take Regex with parameter regex=True.
So with Regex is word boundary character \b used when want to match a exact word only.
To make a example.
>>> df
                  Date                                        Comment
0  2022-04-28  7:06 AM                                             NO
1  2022-04-28  7:06 AM  PM Someone Else: so for a Christmas back then
2  2022-04-28  7:06 AM                                Davids are cool
3  2022-04-28  7:06 AM                                             NO
4  2022-04-28  7:06 AM                         hello Someone Else 123
5  2022-04-28  7:06 AM                            Welcome Davids 9999
6  2022-04-28  7:06 AM                                   Someone Else
>>> 
>>> replace_values = {'Someone Else': 'Completed', 'Davids': 'David'}
>>> replace_values = {rf'\b{k}\b': v for k, v in replace_values.items()}
>>> df.replace({"Comment": replace_values}, regex=True)
                  Date                                     Comment
0  2022-04-28  7:06 AM                                          NO
1  2022-04-28  7:06 AM  PM Completed: so for a Christmas back then
2  2022-04-28  7:06 AM                              David are cool
3  2022-04-28  7:06 AM                                          NO
4  2022-04-28  7:06 AM                         hello Completed 123
5  2022-04-28  7:06 AM                          Welcome David 9999
6  2022-04-28  7:06 AM                                   Completed