A better explanation of the last post

Led_Zeppelin · Jul-13-2022, 04:54 PM

I think m description of the problem in the
"changing order of columns" might be a little
confusing so I am adding a screenshot

The code has changed slightly, but that is all that has changed.

Putting the file with the three non-numeric columns through the code causes
the column with no data to move from sensor 15 to sensor 16. That should
not happen. How do I fix this and still have a scaled and normalized dataframe?

Respectfully,

LZ

woooee · (This post was last modified: Jul-13-2022, 06:45 PM by woooee.)

You can either search for NaN or split a row on each column. Since the columns all appear to be the same width, this should be simple to do, i.e. pseudo-code

for col in range(however_many):
    this_col=row[start, stop]
    start=stop
    stop += length

Led_Zeppelin · Jul-14-2022, 02:18 PM

I do not understand your solution. I am seeing an error where originally the column with no data is 15. Then after being put through StandardScaler function, it now column 16. I wanted to scale and normalize the numeric data in the columns. I did not expect the column with no data to jump from15 to 16.

That just suggests to me something is wrong.

Why is it doing this and how to prevent?

Respectfully,

LZ

***snippsat*** · (This post was last modified: Jul-14-2022, 08:08 PM by snippsat.)

(Jul-14-2022, 02:18 PM)Led_Zeppelin Wrote: Why is it doing this and how to prevent?

Don't now what you start with to get the NaN column.
You should post a shorter code sample that we can run.
It not hard to fix as it now,and look that i make shorter code sample that can tested out.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [np.arange(1, 5)],
    index=["a", "b", "c", "d"],
    columns = [f'sensor_{idx:02d}' for idx in range(1,5)],
)
df.insert(2, "sensor_099", [np.nan, np.nan, np.nan, np.nan])

>>> df
   sensor_01  sensor_02  sensor_099  sensor_03  sensor_04
a          1          2         NaN          3          4
b          1          2         NaN          3          4
c          1          2         NaN          3          4
d          1          2         NaN          3          4

Remove the NaN column,and let say sensor_03 sensor_04 is now sensor_04 sensor_05(fix this bye make new columns names)

>>> df
   sensor_01  sensor_02  sensor_099  sensor_03  sensor_04
a          1          2         NaN          3          4
b          1          2         NaN          3          4
c          1          2         NaN          3          4
d          1          2         NaN          3          4
>>> df = df.dropna(axis='columns')
>>> df
   sensor_01  sensor_02  sensor_03  sensor_04
a          1          2          3          4
b          1          2          3          4
c          1          2          3          4
d          1          2          3          4
>>> # Now let say the columns names are wrong after remove of NaN,this will fix it
>>> df.columns = [f'sensor_{idx:02d}' for idx in range(1,5)]
>>> df
   sensor_01  sensor_02  sensor_03  sensor_04
a          1          2          3          4
b          1          2          3          4
c          1          2          3          4
d          1          2          3          4

Led_Zeppelin · (This post was last modified: Jul-15-2022, 09:02 PM by Led_Zeppelin.)

"What we got here is a failure to communicate"

import numpy as np
import pandas as pd
%matplotlib inline

from sklearn.preprocessing import StandardScaler

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose

import time
from tqdm import tqdm 
from scipy  import stats

df = pd.read_csv("shortened_sensor.csv")

print('here')

pd.set_option("display.max_rows", None, "display.max_columns", None)

df.head()

#df = df.head(5)

#df.to_csv("shortened_sensor.csv", index = False)

#df = df.head(5)

# Find Duplicate Values
# Results will be the list of duplicate values
# If no duplicate values, nothing will list.

df[df['timestamp'].duplicated(keep=False)]

df.isnull().sum()

df['machine_status'].value_counts()

# Convert timestamp column into data type into datetime

df['timestamp'] = pd.to_datetime(df['timestamp'])

# Create a Series
time_period = pd.Series([])

# Assign values to series
for i in tqdm(range(df.shape[0])):
    if (df["timestamp"][i].hour >= 4) and (df["timestamp"][i].hour < 10):
        time_period[i]="Morning"  
    elif (df["timestamp"][i].hour >= 10) and (df["timestamp"][i].hour < 16):
        time_period[i]="Noon"  
    elif (df["timestamp"][i].hour >= 16) and (df["timestamp"][i].hour < 22):   
        time_period[i]="Evening"
    else:    
        time_period[i]="Night"

# Insert new column time period
df.insert(2, 'time_period', time_period)

# The columns sensor_00, sensor_06, sensor-07, sensor_08, sensor_09, sensor-51
# Missing values are filled with median value of respective columns

df['sensor_00'].fillna(df['sensor_00'].median(), inplace=True)
df['sensor_06'].fillna(df['sensor_06'].median(), inplace=True)
df['sensor_07'].fillna(df['sensor_07'].median(), inplace=True)
df['sensor_08'].fillna(df['sensor_08'].median(), inplace=True)
df['sensor_09'].fillna(df['sensor_09'].median(), inplace=True)
df['sensor_51'].fillna(df['sensor_51'].median(), inplace=True)

df['sensor_01'].fillna(df['sensor_01'].median(), inplace=True)
df['sensor_02'].fillna(df['sensor_02'].median(), inplace=True)
df['sensor_03'].fillna(df['sensor_03'].median(), inplace=True)
df['sensor_04'].fillna(df['sensor_04'].median(), inplace=True)
df['sensor_05'].fillna(df['sensor_05'].median(), inplace=True)
df['sensor_10'].fillna(df['sensor_10'].median(), inplace=True)


df

df1 = df.copy()

df.drop(["timestamp","time_period","machine_status"], axis = 1, inplace=True)

df.head()

scaler=StandardScaler()
df=scaler.fit_transform(df)

df

columns = [f'sensor_{idx:02d}' for idx in range(53)]
df2 = pd.DataFrame(df, columns=columns)

df2.head()

Now I have attached a shortened version of my csv file. The longer version would take over 20 minutes to run and would have added nothing. That is unnecessary to show my point. It is what I have been trying to say since the topmost post. This my fault for trying a to find a shortcut to explain this issue.

To begin I have 52 columns with the 15th column having no data. The columns are numbered from starting at 0 so and going to 51. For a total of 52 columns.

Now when I run it through the StandardScaler, I get the (I hope) transformed data; he data are now scaled and normalized also.

But wait a minute! There is a glaring inconsistency. The column with no data has moved from column 15 to column 16. That might be okay because if numbering columns from 1 to ....52, I would I get 52 columns again. But this is not the case.

So somewhere the column with no data has jumped from 15 to 16, and I have added a new column!

I am still numbering from 0 as the first input and now the column with no data has jumped from column 15 to column 16. The columns number from sensor_0 to sensor_52 or 53 columns.

I am giving you the straight python file that I copied from a jupyter notebook, it took a while to create a shortened version.

I would put it in a notebook if I were you. I am also including a shortened csv file that can be run with this code.

Where did the additional column come from and why is it there?

Any help appreciated.

Respectfully,

LZ

***snippsat*** · Jul-15-2022, 10:02 PM

(Jul-15-2022, 08:59 PM)Led_Zeppelin Wrote: Where did the additional column come from and why is it there?

It's not hard to figure out when you post code.
Load shortened_sensor.csv in here csv-viewer-online.
Then you see that sensor_15 has no reading,and pandas will give it NaN value as it should.

Quote:So somewhere the column with no data has jumped from 15 to 16, and I have added a new column!

There is no jump,sensor_15 get NaN and sensor_14 and sensor_16 has correct data as it is in the .csv file.

Output:   sensor_09  sensor_10  sensor_11  sensor_12  sensor_13  sensor_14  \
0   15.05353   37.22740   47.52422   31.11716   1.681353   419.5747   
1   15.05353   37.22740   47.52422   31.11716   1.681353   419.5747   
2   15.01013   37.86777   48.17723   32.08894   1.708474   420.8480   
3   15.08247   38.57977   48.65607   31.67221   1.579427   420.7494   
4   15.08247   39.48939   49.06298   31.95202   1.683831   419.8926   

   sensor_15  sensor_16  sensor_17  sensor_18  sensor_19  sensor_20  \
0        NaN   461.8781   466.3284   2.565284   665.3993   398.9862   
1        NaN   461.8781   466.3284   2.565284   665.3993   398.9862   
2        NaN   462.7798   459.6364   2.500062   666.2234   399.9418   
3        NaN   462.8980   460.8858   2.509521   666.0114   399.1046   
4        NaN   461.4906   468.2206   2.604785   663.2111   400.5426

Led_Zeppelin · Jul-17-2022, 03:38 PM

I think that I understand your explanation. My first question is how do I fix this?

sensor_15 has no data, not sensor 16. I created a new column the 53rd column, but what is it made up of. I only have so much data it is all spoken for. What is in this new column.

I really want column 15 to be the column with no data not column16.

Respectfully,

LZ

**deanhystad** · (This post was last modified: Jul-17-2022, 08:33 PM by deanhystad.)

I scaled down the csv file and removed superfluous processing and the problem becomes immediately obvious.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

"""Contents of data2.csv file
Unnamed: 0,timestamp,sensor_00,sensor_01,sensor_02
0,4/1/2018 0:00,419.5747,,461.8781
1,4/1/2018 0:01,419.5747,,461.8781
"""
df = pd.read_csv("data2.csv")
df1 = df.copy()
df1.drop(["timestamp"], axis = 1, inplace=True)
df2 = StandardScaler().fit_transform(df1)

columns = [f'sensor_{idx:02d}' for idx in range(4)]
df2 = pd.DataFrame(df2, columns=columns)

print("df")
print(df1)
print("\ndf2")
print(df2)

Output:df
   Unnamed: 0  sensor_00  sensor_01  sensor_02
0           0   419.5747        NaN   461.8781
1           1   419.5747        NaN   461.8781

df2
   sensor_00  sensor_01  sensor_02  sensor_03
0       -1.0        0.0        NaN        0.0
1        1.0        0.0        NaN        0.0

When you removed extra columns from df you forgot to remove "Unamed: 0". This becomes "sensor_00" in your new dataframe.

This is ghastly.

df['sensor_00'].fillna(df['sensor_00'].median(), inplace=True)
df['sensor_06'].fillna(df['sensor_06'].median(), inplace=True)
df['sensor_07'].fillna(df['sensor_07'].median(), inplace=True)
df['sensor_08'].fillna(df['sensor_08'].median(), inplace=True)
df['sensor_09'].fillna(df['sensor_09'].median(), inplace=True)
df['sensor_51'].fillna(df['sensor_51'].median(), inplace=True)
 
df['sensor_01'].fillna(df['sensor_01'].median(), inplace=True)
df['sensor_02'].fillna(df['sensor_02'].median(), inplace=True)
df['sensor_03'].fillna(df['sensor_03'].median(), inplace=True)
df['sensor_04'].fillna(df['sensor_04'].median(), inplace=True)
df['sensor_05'].fillna(df['sensor_05'].median(), inplace=True)

All lines are identical except the label. Copy/paste should be the tool of last resort as its primary function is to generate errors that are difficult to track down.

for sensor in (6, 7, 8, 9, 51, 1, 2, 3, 4, 5):
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)

Led_Zeppelin · Sep-19-2022, 08:31 PM

Over a month ago you wrote

for sensor in (6, 7, 8, 9, 51, 1, 2, 3, 4, 5):
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)

Now my question is in several parts. Should not all single digit sensors
numbers be preceded. by a 0?

Also, does it matter if all the number are in sequence or not?

Something like

or sensor in (01, 02, 03, 04, 05, 06, 07, 08, 09, 51):
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)

These are not the correct sensors that I chose, but all I am really interested here is the syntax anyway.

Respectfully,

LZ

**deanhystad** · (This post was last modified: Sep-20-2022, 05:08 PM by deanhystad.)

I don't understand what you are asking with this:

Quote:Should not all single digit sensors numbers be preceded. by a 0?

Are you asking if the code will pad single digit sensor numbers with a leading "0"? To that the answer is yes.
Are you asking why I am not padding all single digit sensor numbers (0 through 9)? This is an example. I forgot sensor_00.

Quote:Also, does it matter if all the number are in sequence or not?

This is not going to be allowed. You cannot have leading zeros in integer literals. Use regular numbers (no leading zero) and let the f"string pad the string, adding a zero when necessary.

for sensor in (01, 02, 03, 04, 05, 06, 07, 08, 09, 51):
    label = f"sensor_{sensor:02}"

Or you could specify the sensors as strings. This would work. It is more typing and no better than using ints, but it will work.

for sensor in ("01", "02", "03", "04", "05", "06", "07", "08", "09", "51"):
    label = f"sensor_{sensor}"

You should not be asking questions like this. This is the kind of thing where you should write a short program to test out your ideas, or in this case, your understanding of my ideas.

import numpy as np
import pandas as pd

# Make some data that includes NAN's
data = {
    "sensor_01": [np.nan, 2.0, 3.0, 4.0, 5.0],
    "sensor_02": [2.0, np.nan, 4.0, 5.0, 6.0],
    "sensor_03": [3.0, 4.0, np.nan, 6.0, 7.0],
    "sensor_04": [4.0, 5.0, 6.0, np.nan, 8.0],
}

df = pd.DataFrame(data)
print(df)

print(df.median())  # Print out the median values for each column

# Replace NAN's with column median values
for sensor in (3, 1):  # Leaving out sensors 02 and 04.  Will they get processed?
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)

print(df)

Output:   sensor_01  sensor_02  sensor_03  sensor_04
0        NaN        2.0        3.0        4.0
1        2.0        NaN        4.0        5.0
2        3.0        4.0        NaN        6.0
3        4.0        5.0        6.0        NaN
4        5.0        6.0        7.0        8.0
sensor_01    3.5
sensor_02    4.5
sensor_03    5.0
sensor_04    5.5
dtype: float64
   sensor_01  sensor_02  sensor_03  sensor_04
0        3.5        2.0        3.0        4.0
1        2.0        NaN        4.0        5.0
2        3.0        4.0        5.0        6.0
3        4.0        5.0        6.0        NaN
4        5.0        6.0        7.0        8.0

As you can see, NaN in sensor_03 and sensor_01 have been replaced by median values from that column. NaN's in sensor_02 and sensor_04 were not replaced.

I do this with everything new that I try. First I write a short program to learn how it works. Then I write some test program to test boundary and unusual conditions (What if all the values are NaN?). Only after I think I really understand the new package or function do I add it to my main program.

Another benefit of writing a test program is you get to know that I screwed up in the example. Finding, and fixing, mistakes in other's code will you become better at finding your own mistakes. Why do you think I answer so many posts? With each one I learn something and I get a little better. Give me a decade and I'll be pretty good.

In case you were wondering, the median of (NaN, NaN, NaN, NaN, NaN) is NaN.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Explanation of code	ejKDE	4	1,579	Feb-26-2024, 02:50 PM Last Post: ejKDE
	Operator meaning explanation	Sherine	3	2,970	Jul-31-2021, 11:05 AM Last Post: Sherine
	Explanation of except ... as :	Fernando_7obink	2	2,686	Feb-13-2021, 04:45 AM Last Post: deanhystad
	.maketrans() - a piece of code which needs some explanation	InputOutput007	5	4,411	Jan-28-2021, 05:05 PM Last Post: buran
	.remove() from a list - request for explanation	InputOutput007	3	3,097	Jan-28-2021, 04:21 PM Last Post: InputOutput007
	Explanation of the left side of this statement please	rascalsailor	3	3,505	Sep-09-2020, 02:02 PM Last Post: rascalsailor
	Need explanation of one line of code	Fliberty	6	5,220	Feb-18-2020, 12:50 AM Last Post: Fliberty
	explanation of code	hikerguy62	2	2,878	Aug-01-2019, 01:37 PM Last Post: hikerguy62
	While loop explanation	rdgbl	1	3,049	Dec-18-2018, 01:03 AM Last Post: stullis
	Lamda function explanation	mrcool4	4	4,725	Jul-04-2018, 10:44 AM Last Post: mrcool4

A better explanation of the last post

User Panel Messages

Announcements