Jul-05-2022, 05:45 PM
I don't like the index idea. It is relatively slow and doesn't handle the case where the status name is not in the list. A better solution is to use a dictionary which is both faster and handles unexpected status names better.
I tried using map with your if statement and with a dictionary. The dictionary is slightly faster. I also tried using apply instead of map, and they are about the same.
The only way I could figure out to vectorize the substitution is using replace().
Here are my tests. Printed times are how long it took to create a new column of 20,000 values. I hand to make special accommodations to prevent the index method from crashing:
I tried using map with your if statement and with a dictionary. The dictionary is slightly faster. I also tried using apply instead of map, and they are about the same.
The only way I could figure out to vectorize the substitution is using replace().
Here are my tests. Printed times are how long it took to create a new column of 20,000 values. I hand to make special accommodations to prevent the index method from crashing:
import pandas as pd import numpy as np from random import choice from time import time states = {"NORMAL": 0, "BROKEN": 1, "RECOVERING": 2} keys = list(states.keys()) + [""] # Add an invalid state df = pd.DataFrame({"State": [choice(keys) for _ in range(20000)]}) start = time() df["if"] = df["State"].map( lambda x: 0 if x == "NORMAL" else 1 if x == "BROKEN" else 2 if x == "RECOVERING" else np.NaN ) print("if map", time() - start) start = time() df["dict"] = df["State"].map(states) print("dict map", time() - start) start = time() df["if apply"] = df["State"].apply( lambda x: 0 if x == "NORMAL" else 1 if x == "BROKEN" else 2 if x == "RECOVERING" else np.NaN ) print("if apply", time() - start) start = time() df["index map"] = df["State"].map(lambda x: keys.index(x)) print("index map", time() - start) start = time() df["replace"] = df["State"].replace("NORMAL", 0) df["replace"] = df["replace"].replace("BROKEN", 1) df["replace"] = df["replace"].replace("RECOVERING", 2) print("replace", time() - start) print(df[:10])
Output:if map 0.0060176849365234375
dict map 0.0010302066802978516
if apply 0.007014036178588867
index map 0.005976438522338867
replace 0.003970146179199219
State if dict if apply index map replace
0 RECOVERING 2.0 2.0 2.0 2 2
1 NaN NaN NaN 3
2 NaN NaN NaN 3
3 RECOVERING 2.0 2.0 2.0 2 2
4 NaN NaN NaN 3
5 NORMAL 0.0 0.0 0.0 0 0
6 NORMAL 0.0 0.0 0.0 0 0
7 RECOVERING 2.0 2.0 2.0 2 2
8 BROKEN 1.0 1.0 1.0 1 1
9 BROKEN 1.0 1.0 1.0 1 1