Python Forum
Running search/replace across Polars dataframe columns efficiently
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Running search/replace across Polars dataframe columns efficiently
#1
I'm looking to learn more about Polars and hope to leverage Polars more aggressively where possible. With this in mind I'd like to work through a series of questions that help me to arrive at a robust and efficient solution to a specific problem I need to address.

I need to run a search/replace operation on a number of columns in a Polars dataframe, by matching existing column contents with a dictionary key, and then using the key's corresponding value as the column replacement value in the dataframe. I'd like to do this as efficiently as possible as the the dataframe can contain circa 1m records and the search/replace operation needs to run across 9 or more columns (sourced from a SQLite database). The search (key) and replacement (value) pairs can be as many as 20-30k key/value pairs.

I've created the dataframe and dictionary as follows:

import polars as pl
uri = "sqlite:///tmp/amg/dbtemplate.db"

# get search/replace records, convert to dict and drop dataframe
q = ("SELECT current_val, replacement_val FROM corrections ORDER BY replacement_val, current_val")
df = pl.read_database_uri(query=q, uri=uri)
replacements_dict = dict(zip(df["current_val"], df["replacement_val"]))
del df

# get source data to be transformed
q = ("SELECT rowid AS alib_rowid, col1, col2, col3, col4, col5, col6, col7, col8, col9 FROM source")
df = pl.read_database_uri(query=q, uri=uri)
This gets me as far as having a search/replace dictionary and a dataframe to run the search/ replace operation against.

An added complication is that any of the columns in the df can comprise multiple strings, delimited by '\\'. So, when comparing the value in a column, the comparison needs to be against the individual words, not the concatenated string. For example, if a column in the dataframe contains:
"first string\\second string", the search / replace operation should check on first string and second string, making any necessary replacements and then rewriting the column value as the delimited string "1st val\\2nd val".

To deal with the search/replace operation using Pandas I previously had an inner function as follows:
def convert_dfrow(row, delim: str = r"\\"):

    result = []

    for item in row:
        if not pd.isna(item):
            item = delim.join(replacements_dict.get(x, x) for x in item.split(delim))

        result.append(item)

    return result
For the purposes of discussion, here's a populated Polars DF and dictionary:
df = pl.DataFrame({
    "col1": ["first string\\second string", "third string\\fourth string"],
    "col2": ["another\\one", "example\\case"]
})


replace_dict = {
    "first string": "1st val",
    "second string": "2nd val",
    "third string": "3rd val",
    "fourth string": "4th val",
    "another": "another_replacement",
    "one": "one_replacement"
}
When using Pandas the transformation of the dataframe was accomplished by calling the inner function
df1[['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9']] = df[['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9']].apply(convert_dfrow)
Polar's API Reference explicitly warns about polars.DataFrame.apply(): 'This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise".

Two questions:
  1. Should I be creating a dictionary as I have above or would it be more efficient to leverage a dataframe for the search / replace criteria and comparison operation?
  2. Is there a better way to accomplish what I want using the native expressions API?

Hopefully someone more familiar with Polars can provide some input/feedback?
Reply
#2
Can I ask, is it the way I've asked, or what I've asked that's caused nobody to respond?
Reply
#3
I have never heard of polars! But then, there must be lots of modules I have never heard of!

Looking it up in a search engine, it seems like a hard sell!

Sooooo much better and faster than all other dfs! Really? Or just salestalk?

Maybe ask the question you have about pandas, then convert that answer to polars, if polars really is so easy and fast!
Reply
#4
Ok, same df’s, facts and questions, only I’m using Pandas?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Converting a json file to a dataframe with rows and columns eyavuz21 13 14,349 Jan-29-2023, 03:59 PM
Last Post: eyavuz21
  How to read in mulitple files efficiently garynewport 3 1,618 Jan-27-2023, 10:44 AM
Last Post: DeaD_EyE
  Replace columns indexes reading a XSLX file Larry1888 2 1,758 Nov-18-2022, 10:16 PM
Last Post: Pedroski55
  Replace for loop to search index position illmattic 5 2,359 Sep-03-2022, 04:04 PM
Last Post: illmattic
  Nested for loops: Iterating over columns of a DataFrame to plot on subplots dm222 0 2,975 Aug-19-2022, 11:07 AM
Last Post: dm222
  How to efficiently average same entries of lists in a list xquad 5 3,287 Dec-17-2021, 04:44 PM
Last Post: xquad
  Apply fillna to multiple columns in dataframe rraillon 2 3,974 Aug-05-2021, 01:11 PM
Last Post: rraillon
  How to rename dataframe columns based on the content in an index? ar_mahdavi 2 3,308 Jun-07-2021, 06:09 AM
Last Post: ricslato
  Cloning a directory and using a .CSV file as a reference to search and replace bg25lam 2 2,900 May-31-2021, 07:00 AM
Last Post: bowlofred
  How to sum across variable columns in a dataframe rennerom 2 4,119 Jan-31-2021, 05:44 PM
Last Post: rennerom

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020