Python Forum
A better explanation of the last post
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
A better explanation of the last post
#1
I think m description of the problem in the
"changing order of columns" might be a little
confusing so I am adding a screenshot

The code has changed slightly, but that is all that has changed.

Putting the file with the three non-numeric columns through the code causes
the column with no data to move from sensor 15 to sensor 16. That should
not happen. How do I fix this and still have a scaled and normalized dataframe?

Respectfully,

LZ

Attached Files

Thumbnail(s)
   
Reply
#2
You can either search for NaN or split a row on each column. Since the columns all appear to be the same width, this should be simple to do, i.e. pseudo-code
for col in range(however_many):
    this_col=row[start, stop]
    start=stop
    stop += length
Reply
#3
I do not understand your solution. I am seeing an error where originally the column with no data is 15. Then after being put through StandardScaler function, it now column 16. I wanted to scale and normalize the numeric data in the columns. I did not expect the column with no data to jump from15 to 16.

That just suggests to me something is wrong.

Why is it doing this and how to prevent?

Respectfully,

LZ
Reply
#4
(Jul-14-2022, 02:18 PM)Led_Zeppelin Wrote: Why is it doing this and how to prevent?
Don't now what you start with to get the NaN column.
You should post a shorter code sample that we can run.
It not hard to fix as it now,and look that i make shorter code sample that can tested out.
import pandas as pd
import numpy as np

df = pd.DataFrame(
    [np.arange(1, 5)],
    index=["a", "b", "c", "d"],
    columns = [f'sensor_{idx:02d}' for idx in range(1,5)],
)
df.insert(2, "sensor_099", [np.nan, np.nan, np.nan, np.nan])
>>> df
   sensor_01  sensor_02  sensor_099  sensor_03  sensor_04
a          1          2         NaN          3          4
b          1          2         NaN          3          4
c          1          2         NaN          3          4
d          1          2         NaN          3          4
Remove the NaN column,and let say sensor_03 sensor_04 is now sensor_04 sensor_05(fix this bye make new columns names)
>>> df
   sensor_01  sensor_02  sensor_099  sensor_03  sensor_04
a          1          2         NaN          3          4
b          1          2         NaN          3          4
c          1          2         NaN          3          4
d          1          2         NaN          3          4
>>> df = df.dropna(axis='columns')
>>> df
   sensor_01  sensor_02  sensor_03  sensor_04
a          1          2          3          4
b          1          2          3          4
c          1          2          3          4
d          1          2          3          4
>>> # Now let say the columns names are wrong after remove of NaN,this will fix it
>>> df.columns = [f'sensor_{idx:02d}' for idx in range(1,5)]
>>> df
   sensor_01  sensor_02  sensor_03  sensor_04
a          1          2          3          4
b          1          2          3          4
c          1          2          3          4
d          1          2          3          4
Reply
#5
"What we got here is a failure to communicate"

import numpy as np
import pandas as pd
%matplotlib inline

from sklearn.preprocessing import StandardScaler

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose

import time
from tqdm import tqdm 
from scipy  import stats

df = pd.read_csv("shortened_sensor.csv")

print('here')

pd.set_option("display.max_rows", None, "display.max_columns", None)

df.head()

#df = df.head(5)

#df.to_csv("shortened_sensor.csv", index = False)

#df = df.head(5)

# Find Duplicate Values
# Results will be the list of duplicate values
# If no duplicate values, nothing will list.

df[df['timestamp'].duplicated(keep=False)]

df.isnull().sum()

df['machine_status'].value_counts()

# Convert timestamp column into data type into datetime

df['timestamp'] = pd.to_datetime(df['timestamp'])

# Create a Series
time_period = pd.Series([])

# Assign values to series
for i in tqdm(range(df.shape[0])):
    if (df["timestamp"][i].hour >= 4) and (df["timestamp"][i].hour < 10):
        time_period[i]="Morning"  
    elif (df["timestamp"][i].hour >= 10) and (df["timestamp"][i].hour < 16):
        time_period[i]="Noon"  
    elif (df["timestamp"][i].hour >= 16) and (df["timestamp"][i].hour < 22):   
        time_period[i]="Evening"
    else:    
        time_period[i]="Night"

# Insert new column time period
df.insert(2, 'time_period', time_period)

# The columns sensor_00, sensor_06, sensor-07, sensor_08, sensor_09, sensor-51
# Missing values are filled with median value of respective columns

df['sensor_00'].fillna(df['sensor_00'].median(), inplace=True)
df['sensor_06'].fillna(df['sensor_06'].median(), inplace=True)
df['sensor_07'].fillna(df['sensor_07'].median(), inplace=True)
df['sensor_08'].fillna(df['sensor_08'].median(), inplace=True)
df['sensor_09'].fillna(df['sensor_09'].median(), inplace=True)
df['sensor_51'].fillna(df['sensor_51'].median(), inplace=True)

df['sensor_01'].fillna(df['sensor_01'].median(), inplace=True)
df['sensor_02'].fillna(df['sensor_02'].median(), inplace=True)
df['sensor_03'].fillna(df['sensor_03'].median(), inplace=True)
df['sensor_04'].fillna(df['sensor_04'].median(), inplace=True)
df['sensor_05'].fillna(df['sensor_05'].median(), inplace=True)
df['sensor_10'].fillna(df['sensor_10'].median(), inplace=True)


df

df1 = df.copy()

df.drop(["timestamp","time_period","machine_status"], axis = 1, inplace=True)

df.head()

scaler=StandardScaler()
df=scaler.fit_transform(df)

df

columns = [f'sensor_{idx:02d}' for idx in range(53)]
df2 = pd.DataFrame(df, columns=columns)

df2.head()
Now I have attached a shortened version of my csv file. The longer version would take over 20 minutes to run and would have added nothing. That is unnecessary to show my point. It is what I have been trying to say since the topmost post. This my fault for trying a to find a shortcut to explain this issue.

To begin I have 52 columns with the 15th column having no data. The columns are numbered from starting at 0 so and going to 51. For a total of 52 columns.

Now when I run it through the StandardScaler, I get the (I hope) transformed data; he data are now scaled and normalized also.

But wait a minute! There is a glaring inconsistency. The column with no data has moved from column 15 to column 16. That might be okay because if numbering columns from 1 to ....52, I would I get 52 columns again. But this is not the case.

So somewhere the column with no data has jumped from 15 to 16, and I have added a new column!

I am still numbering from 0 as the first input and now the column with no data has jumped from column 15 to column 16. The columns number from sensor_0 to sensor_52 or 53 columns.

I am giving you the straight python file that I copied from a jupyter notebook, it took a while to create a shortened version.

I would put it in a notebook if I were you. I am also including a shortened csv file that can be run with this code.

Where did the additional column come from and why is it there?

Any help appreciated.

Respectfully,

LZ

Attached Files

.csv   shortened_sensor.csv (Size: 3.02 KB / Downloads: 85)
Reply
#6
(Jul-15-2022, 08:59 PM)Led_Zeppelin Wrote: Where did the additional column come from and why is it there?
It's not hard to figure out when you post code.
Load shortened_sensor.csv in here csv-viewer-online.
Then you see that sensor_15 has no reading,and pandas will give it NaN value as it should.
Quote:So somewhere the column with no data has jumped from 15 to 16, and I have added a new column!
There is no jump,sensor_15 get NaN and sensor_14 and sensor_16 has correct data as it is in the .csv file.
Output:
sensor_09 sensor_10 sensor_11 sensor_12 sensor_13 sensor_14 \ 0 15.05353 37.22740 47.52422 31.11716 1.681353 419.5747 1 15.05353 37.22740 47.52422 31.11716 1.681353 419.5747 2 15.01013 37.86777 48.17723 32.08894 1.708474 420.8480 3 15.08247 38.57977 48.65607 31.67221 1.579427 420.7494 4 15.08247 39.48939 49.06298 31.95202 1.683831 419.8926 sensor_15 sensor_16 sensor_17 sensor_18 sensor_19 sensor_20 \ 0 NaN 461.8781 466.3284 2.565284 665.3993 398.9862 1 NaN 461.8781 466.3284 2.565284 665.3993 398.9862 2 NaN 462.7798 459.6364 2.500062 666.2234 399.9418 3 NaN 462.8980 460.8858 2.509521 666.0114 399.1046 4 NaN 461.4906 468.2206 2.604785 663.2111 400.5426
Reply
#7
I think that I understand your explanation. My first question is how do I fix this?

sensor_15 has no data, not sensor 16. I created a new column the 53rd column, but what is it made up of. I only have so much data it is all spoken for. What is in this new column.

I really want column 15 to be the column with no data not column16.

Respectfully,

LZ
Reply
#8
I scaled down the csv file and removed superfluous processing and the problem becomes immediately obvious.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

"""Contents of data2.csv file
Unnamed: 0,timestamp,sensor_00,sensor_01,sensor_02
0,4/1/2018 0:00,419.5747,,461.8781
1,4/1/2018 0:01,419.5747,,461.8781
"""
df = pd.read_csv("data2.csv")
df1 = df.copy()
df1.drop(["timestamp"], axis = 1, inplace=True)
df2 = StandardScaler().fit_transform(df1)

columns = [f'sensor_{idx:02d}' for idx in range(4)]
df2 = pd.DataFrame(df2, columns=columns)

print("df")
print(df1)
print("\ndf2")
print(df2)
Output:
df Unnamed: 0 sensor_00 sensor_01 sensor_02 0 0 419.5747 NaN 461.8781 1 1 419.5747 NaN 461.8781 df2 sensor_00 sensor_01 sensor_02 sensor_03 0 -1.0 0.0 NaN 0.0 1 1.0 0.0 NaN 0.0
When you removed extra columns from df you forgot to remove "Unamed: 0". This becomes "sensor_00" in your new dataframe.

This is ghastly.
df['sensor_00'].fillna(df['sensor_00'].median(), inplace=True)
df['sensor_06'].fillna(df['sensor_06'].median(), inplace=True)
df['sensor_07'].fillna(df['sensor_07'].median(), inplace=True)
df['sensor_08'].fillna(df['sensor_08'].median(), inplace=True)
df['sensor_09'].fillna(df['sensor_09'].median(), inplace=True)
df['sensor_51'].fillna(df['sensor_51'].median(), inplace=True)
 
df['sensor_01'].fillna(df['sensor_01'].median(), inplace=True)
df['sensor_02'].fillna(df['sensor_02'].median(), inplace=True)
df['sensor_03'].fillna(df['sensor_03'].median(), inplace=True)
df['sensor_04'].fillna(df['sensor_04'].median(), inplace=True)
df['sensor_05'].fillna(df['sensor_05'].median(), inplace=True)
All lines are identical except the label. Copy/paste should be the tool of last resort as its primary function is to generate errors that are difficult to track down.
for sensor in (6, 7, 8, 9, 51, 1, 2, 3, 4, 5):
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)
Reply
#9
Over a month ago you wrote

for sensor in (6, 7, 8, 9, 51, 1, 2, 3, 4, 5):
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)
Now my question is in several parts. Should not all single digit sensors
numbers be preceded. by a 0?

Also, does it matter if all the number are in sequence or not?

Something like

or sensor in (01, 02, 03, 04, 05, 06, 07, 08, 09, 51):
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)
These are not the correct sensors that I chose, but all I am really interested here is the syntax anyway.

Respectfully,

LZ
Reply
#10
I don't understand what you are asking with this:
Quote:Should not all single digit sensors numbers be preceded. by a 0?
Are you asking if the code will pad single digit sensor numbers with a leading "0"? To that the answer is yes.
Are you asking why I am not padding all single digit sensor numbers (0 through 9)? This is an example. I forgot sensor_00.

Quote:Also, does it matter if all the number are in sequence or not?
This is not going to be allowed. You cannot have leading zeros in integer literals. Use regular numbers (no leading zero) and let the f"string pad the string, adding a zero when necessary.
for sensor in (01, 02, 03, 04, 05, 06, 07, 08, 09, 51):
    label = f"sensor_{sensor:02}"
Or you could specify the sensors as strings. This would work. It is more typing and no better than using ints, but it will work.
for sensor in ("01", "02", "03", "04", "05", "06", "07", "08", "09", "51"):
    label = f"sensor_{sensor}"
You should not be asking questions like this. This is the kind of thing where you should write a short program to test out your ideas, or in this case, your understanding of my ideas.
import numpy as np
import pandas as pd

# Make some data that includes NAN's
data = {
    "sensor_01": [np.nan, 2.0, 3.0, 4.0, 5.0],
    "sensor_02": [2.0, np.nan, 4.0, 5.0, 6.0],
    "sensor_03": [3.0, 4.0, np.nan, 6.0, 7.0],
    "sensor_04": [4.0, 5.0, 6.0, np.nan, 8.0],
}

df = pd.DataFrame(data)
print(df)

print(df.median())  # Print out the median values for each column

# Replace NAN's with column median values
for sensor in (3, 1):  # Leaving out sensors 02 and 04.  Will they get processed?
    label = f"sensor_{sensor:02}"
    df[label].fillna(df[label].median(), inplace=True)

print(df)
Output:
sensor_01 sensor_02 sensor_03 sensor_04 0 NaN 2.0 3.0 4.0 1 2.0 NaN 4.0 5.0 2 3.0 4.0 NaN 6.0 3 4.0 5.0 6.0 NaN 4 5.0 6.0 7.0 8.0 sensor_01 3.5 sensor_02 4.5 sensor_03 5.0 sensor_04 5.5 dtype: float64 sensor_01 sensor_02 sensor_03 sensor_04 0 3.5 2.0 3.0 4.0 1 2.0 NaN 4.0 5.0 2 3.0 4.0 5.0 6.0 3 4.0 5.0 6.0 NaN 4 5.0 6.0 7.0 8.0
As you can see, NaN in sensor_03 and sensor_01 have been replaced by median values from that column. NaN's in sensor_02 and sensor_04 were not replaced.

I do this with everything new that I try. First I write a short program to learn how it works. Then I write some test program to test boundary and unusual conditions (What if all the values are NaN?). Only after I think I really understand the new package or function do I add it to my main program.

Another benefit of writing a test program is you get to know that I screwed up in the example. Finding, and fixing, mistakes in other's code will you become better at finding your own mistakes. Why do you think I answer so many posts? With each one I learn something and I get a little better. Give me a decade and I'll be pretty good.

In case you were wondering, the median of (NaN, NaN, NaN, NaN, NaN) is NaN.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Explanation of code ejKDE 4 393 Feb-26-2024, 02:50 PM
Last Post: ejKDE
  Operator meaning explanation Sherine 3 2,043 Jul-31-2021, 11:05 AM
Last Post: Sherine
  Explanation of except ... as : Fernando_7obink 2 1,935 Feb-13-2021, 04:45 AM
Last Post: deanhystad
  .maketrans() - a piece of code which needs some explanation InputOutput007 5 2,968 Jan-28-2021, 05:05 PM
Last Post: buran
  .remove() from a list - request for explanation InputOutput007 3 2,241 Jan-28-2021, 04:21 PM
Last Post: InputOutput007
  Explanation of the left side of this statement please rascalsailor 3 2,509 Sep-09-2020, 02:02 PM
Last Post: rascalsailor
  Need explanation of one line of code Fliberty 6 3,489 Feb-18-2020, 12:50 AM
Last Post: Fliberty
  explanation of code hikerguy62 2 2,259 Aug-01-2019, 01:37 PM
Last Post: hikerguy62
  While loop explanation rdgbl 1 2,306 Dec-18-2018, 01:03 AM
Last Post: stullis
  Lamda function explanation mrcool4 4 3,564 Jul-04-2018, 10:44 AM
Last Post: mrcool4

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020