I need to convert a column which indicates machine status (normal, broken or recovering) to a numeric representation. This seems easy enough, but I want to do it in one line of python 3 code if possible. It would be something like this.
sensor_data['label'] = sensor_data['machine_status'].map(lambda label: 0 if label == 'NORMAL' else 1)
.
I found this online, and I want to use it because it is only one line.
The pump has two states in the final version: normal or broken; broken includes recovering because the pump is still not functional when recovering.
I believe that this one line of python code can do it for all 220320 values in the column.
My question is, is it on the right track? Is there an even easier way to do it?
Any help appreciated.
Respectfully,
LZ
Personally, I would use
index ()
and a look-up table like this:
sensor_data = {'machine_status': 'BROKEN'}
sensor_data ['label'] = 'NBR'.index (sensor_data ['machine_status'][0])
print (sensor_data)
Output:
{'machine_status': 'BROKEN', 'label': 1}
I don't like the index idea. It is relatively slow and doesn't handle the case where the status name is not in the list. A better solution is to use a dictionary which is both faster and handles unexpected status names better.
I tried using map with your if statement and with a dictionary. The dictionary is slightly faster. I also tried using apply instead of map, and they are about the same.
The only way I could figure out to vectorize the substitution is using replace().
Here are my tests. Printed times are how long it took to create a new column of 20,000 values. I hand to make special accommodations to prevent the index method from crashing:
import pandas as pd
import numpy as np
from random import choice
from time import time
states = {"NORMAL": 0, "BROKEN": 1, "RECOVERING": 2}
keys = list(states.keys()) + [""] # Add an invalid state
df = pd.DataFrame({"State": [choice(keys) for _ in range(20000)]})
start = time()
df["if"] = df["State"].map(
lambda x: 0
if x == "NORMAL"
else 1
if x == "BROKEN"
else 2
if x == "RECOVERING"
else np.NaN
)
print("if map", time() - start)
start = time()
df["dict"] = df["State"].map(states)
print("dict map", time() - start)
start = time()
df["if apply"] = df["State"].apply(
lambda x: 0
if x == "NORMAL"
else 1
if x == "BROKEN"
else 2
if x == "RECOVERING"
else np.NaN
)
print("if apply", time() - start)
start = time()
df["index map"] = df["State"].map(lambda x: keys.index(x))
print("index map", time() - start)
start = time()
df["replace"] = df["State"].replace("NORMAL", 0)
df["replace"] = df["replace"].replace("BROKEN", 1)
df["replace"] = df["replace"].replace("RECOVERING", 2)
print("replace", time() - start)
print(df[:10])
Output:
if map 0.0060176849365234375
dict map 0.0010302066802978516
if apply 0.007014036178588867
index map 0.005976438522338867
replace 0.003970146179199219
State if dict if apply index map replace
0 RECOVERING 2.0 2.0 2.0 2 2
1 NaN NaN NaN 3
2 NaN NaN NaN 3
3 RECOVERING 2.0 2.0 2.0 2 2
4 NaN NaN NaN 3
5 NORMAL 0.0 0.0 0.0 0 0
6 NORMAL 0.0 0.0 0.0 0 0
7 RECOVERING 2.0 2.0 2.0 2 2
8 BROKEN 1.0 1.0 1.0 1 1
9 BROKEN 1.0 1.0 1.0 1 1
I see and understand these ideas. But what is wrong with my one-line proposal?
It seems simple and fast. It is using lambda function of which, I know little, hence the post. Will it do all the
entries in the column? There are 220320 of them
Please understand that there are two states: 1 and 0.
While there are three states listed in the machine status column, I consider broken and recovering to be the just one state and normal the other second state.
Respectfully,
LZ
This is one line.
df["if"] = df["State"].map(
lambda x: 0
if x == "NORMAL"
else 1
if x == "BROKEN"
else 2
if x == "RECOVERING"
else np.NaN
)
I thought I read in your initial post that the pump had 3 states, normal, broken, recovering, but that was a mistake on my part. BashBedlam made the same mistake so that's who I'm going to blame. Is there a guarantee that the status will always be either NORMAL or BROKEN? I allow for there being no state or an unexpected state.
If you don't like lambdas, use functions.
def status_to_number(status):
if status == "NORMAL":
return 0
return 1
sensor_data['label'] = sensor_data['machine_status'].map(status_to_number)
lambda expessions are just a way of writing unnamed functions (with a few additional limitations).
(Jul-05-2022, 06:06 PM)Led_Zeppelin Wrote: [ -> ]It seems simple and fast. It is using lambda function of which, I know little, hence the post. Will it do all the
entries in the column? There are 220320 of them
It's just that the
lambda
and
map
are unnecessary. You could just do this:
sensor_data ['label'] = 0 if sensor_data ['machine_status'] == 'NORMAL' else 1
Also, if you are doing 220320 of them you will need to put your one-liner in some kind of loop.
I think you are missing that df is a dataframe. This does not work because df["State"] is a series, not an element of a list or value in a dictionary.
import pandas as pd
df = pd.DataFrame({"State": ["NORMAL", "BROKEN"]})
df["label"] = 0 if df["State"] == "NORMAL" else 1
Error:
Traceback (most recent call last):
File "...", line 3, in <module>
df["label"] = 0 if df["State"] == "NORMAL" else 1
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
"put your one-liner in some kind of loop" is essentially what "map()" and apply() are doing