Running search/replace across Polars dataframe columns efficiently

hobbycoder · (This post was last modified: Oct-22-2024, 07:47 AM by hobbycoder.)

I'm looking to learn more about Polars and hope to leverage Polars more aggressively where possible. With this in mind I'd like to work through a series of questions that help me to arrive at a robust and efficient solution to a specific problem I need to address.

I need to run a search/replace operation on a number of columns in a Polars dataframe, by matching existing column contents with a dictionary key, and then using the key's corresponding value as the column replacement value in the dataframe. I'd like to do this as efficiently as possible as the the dataframe can contain circa 1m records and the search/replace operation needs to run across 9 or more columns (sourced from a SQLite database). The search (key) and replacement (value) pairs can be as many as 20-30k key/value pairs.

I've created the dataframe and dictionary as follows:

import polars as pl
uri = "sqlite:///tmp/amg/dbtemplate.db"

# get search/replace records, convert to dict and drop dataframe
q = ("SELECT current_val, replacement_val FROM corrections ORDER BY replacement_val, current_val")
df = pl.read_database_uri(query=q, uri=uri)
replacements_dict = dict(zip(df["current_val"], df["replacement_val"]))
del df

# get source data to be transformed
q = ("SELECT rowid AS alib_rowid, col1, col2, col3, col4, col5, col6, col7, col8, col9 FROM source")
df = pl.read_database_uri(query=q, uri=uri)

This gets me as far as having a search/replace dictionary and a dataframe to run the search/ replace operation against.

An added complication is that any of the columns in the df can comprise multiple strings, delimited by '\\'. So, when comparing the value in a column, the comparison needs to be against the individual words, not the concatenated string. For example, if a column in the dataframe contains:
"first string\\second string", the search / replace operation should check on first string and second string, making any necessary replacements and then rewriting the column value as the delimited string "1st val\\2nd val".

To deal with the search/replace operation using Pandas I previously had an inner function as follows:

def convert_dfrow(row, delim: str = r"\\"):

    result = []

    for item in row:
        if not pd.isna(item):
            item = delim.join(replacements_dict.get(x, x) for x in item.split(delim))

        result.append(item)

    return result

For the purposes of discussion, here's a populated Polars DF and dictionary:

df = pl.DataFrame({
    "col1": ["first string\\second string", "third string\\fourth string"],
    "col2": ["another\\one", "example\\case"]
})


replace_dict = {
    "first string": "1st val",
    "second string": "2nd val",
    "third string": "3rd val",
    "fourth string": "4th val",
    "another": "another_replacement",
    "one": "one_replacement"
}

When using Pandas the transformation of the dataframe was accomplished by calling the inner function

df1[['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9']] = df[['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9']].apply(convert_dfrow)

Polar's API Reference explicitly warns about polars.DataFrame.apply(): 'This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise".

Two questions:

Should I be creating a dictionary as I have above or would it be more efficient to leverage a dataframe for the search / replace criteria and comparison operation?
Is there a better way to accomplish what I want using the native expressions API?

Hopefully someone more familiar with Polars can provide some input/feedback?

hobbycoder · Oct-22-2024, 04:54 PM

Can I ask, is it the way I've asked, or what I've asked that's caused nobody to respond?

Pedroski55 · Oct-22-2024, 05:30 PM

I have never heard of polars! But then, there must be lots of modules I have never heard of!

Looking it up in a search engine, it seems like a hard sell!

Sooooo much better and faster than all other dfs! Really? Or just salestalk?

Maybe ask the question you have about pandas, then convert that answer to polars, if polars really is so easy and fast!

hobbycoder · Oct-28-2024, 03:18 AM

Ok, same df’s, facts and questions, only I’m using Pandas?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Converting a json file to a dataframe with rows and columns	eyavuz21	13	14,349	Jan-29-2023, 03:59 PM Last Post: eyavuz21
	How to read in mulitple files efficiently	garynewport	3	1,618	Jan-27-2023, 10:44 AM Last Post: DeaD_EyE
	Replace columns indexes reading a XSLX file	Larry1888	2	1,758	Nov-18-2022, 10:16 PM Last Post: Pedroski55
	Replace for loop to search index position	illmattic	5	2,359	Sep-03-2022, 04:04 PM Last Post: illmattic
	Nested for loops: Iterating over columns of a DataFrame to plot on subplots	dm222	0	2,975	Aug-19-2022, 11:07 AM Last Post: dm222
	How to efficiently average same entries of lists in a list	xquad	5	3,287	Dec-17-2021, 04:44 PM Last Post: xquad
	Apply fillna to multiple columns in dataframe	rraillon	2	3,974	Aug-05-2021, 01:11 PM Last Post: rraillon
	How to rename dataframe columns based on the content in an index?	ar_mahdavi	2	3,308	Jun-07-2021, 06:09 AM Last Post: ricslato
	Cloning a directory and using a .CSV file as a reference to search and replace	bg25lam	2	2,900	May-31-2021, 07:00 AM Last Post: bowlofred
	How to sum across variable columns in a dataframe	rennerom	2	4,119	Jan-31-2021, 05:44 PM Last Post: rennerom

Running search/replace across Polars dataframe columns efficiently

User Panel Messages

Announcements