Oct-22-2024, 07:47 AM
(This post was last modified: Oct-22-2024, 07:47 AM by hobbycoder.)
I'm looking to learn more about Polars and hope to leverage Polars more aggressively where possible. With this in mind I'd like to work through a series of questions that help me to arrive at a robust and efficient solution to a specific problem I need to address.
I need to run a search/replace operation on a number of columns in a Polars dataframe, by matching existing column contents with a dictionary key, and then using the key's corresponding value as the column replacement value in the dataframe. I'd like to do this as efficiently as possible as the the dataframe can contain circa 1m records and the search/replace operation needs to run across 9 or more columns (sourced from a SQLite database). The search (key) and replacement (value) pairs can be as many as 20-30k key/value pairs.
I've created the dataframe and dictionary as follows:
An added complication is that any of the columns in the df can comprise multiple strings, delimited by '\\'. So, when comparing the value in a column, the comparison needs to be against the individual words, not the concatenated string. For example, if a column in the dataframe contains:
"first string\\second string", the search / replace operation should check on first string and second string, making any necessary replacements and then rewriting the column value as the delimited string "1st val\\2nd val".
To deal with the search/replace operation using Pandas I previously had an inner function as follows:
Two questions:
Hopefully someone more familiar with Polars can provide some input/feedback?
I need to run a search/replace operation on a number of columns in a Polars dataframe, by matching existing column contents with a dictionary key, and then using the key's corresponding value as the column replacement value in the dataframe. I'd like to do this as efficiently as possible as the the dataframe can contain circa 1m records and the search/replace operation needs to run across 9 or more columns (sourced from a SQLite database). The search (key) and replacement (value) pairs can be as many as 20-30k key/value pairs.
I've created the dataframe and dictionary as follows:
import polars as pl uri = "sqlite:///tmp/amg/dbtemplate.db" # get search/replace records, convert to dict and drop dataframe q = ("SELECT current_val, replacement_val FROM corrections ORDER BY replacement_val, current_val") df = pl.read_database_uri(query=q, uri=uri) replacements_dict = dict(zip(df["current_val"], df["replacement_val"])) del df # get source data to be transformed q = ("SELECT rowid AS alib_rowid, col1, col2, col3, col4, col5, col6, col7, col8, col9 FROM source") df = pl.read_database_uri(query=q, uri=uri)This gets me as far as having a search/replace dictionary and a dataframe to run the search/ replace operation against.
An added complication is that any of the columns in the df can comprise multiple strings, delimited by '\\'. So, when comparing the value in a column, the comparison needs to be against the individual words, not the concatenated string. For example, if a column in the dataframe contains:
"first string\\second string", the search / replace operation should check on first string and second string, making any necessary replacements and then rewriting the column value as the delimited string "1st val\\2nd val".
To deal with the search/replace operation using Pandas I previously had an inner function as follows:
def convert_dfrow(row, delim: str = r"\\"): result = [] for item in row: if not pd.isna(item): item = delim.join(replacements_dict.get(x, x) for x in item.split(delim)) result.append(item) return resultFor the purposes of discussion, here's a populated Polars DF and dictionary:
df = pl.DataFrame({ "col1": ["first string\\second string", "third string\\fourth string"], "col2": ["another\\one", "example\\case"] }) replace_dict = { "first string": "1st val", "second string": "2nd val", "third string": "3rd val", "fourth string": "4th val", "another": "another_replacement", "one": "one_replacement" }When using Pandas the transformation of the dataframe was accomplished by calling the inner function
df1[['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9']] = df[['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9']].apply(convert_dfrow)Polar's API Reference explicitly warns about polars.DataFrame.apply(): 'This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise".
Two questions:
- Should I be creating a dictionary as I have above or would it be more efficient to leverage a dataframe for the search / replace criteria and comparison operation?
- Is there a better way to accomplish what I want using the native expressions API?
Hopefully someone more familiar with Polars can provide some input/feedback?