Python Forum
Replacing values for specific columns in Panda data structure
Thread Rating:
  • 1 Vote(s) - 1 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Replacing values for specific columns in Panda data structure
#1
Hello!

Im trying to replace invalid values ( x< -3 and x >12) with 'nan's in a pandas data structure . The first two columns consist of ids and names respectively, and should not be modified. I want to make a general code for data with an unknown amount of column values, I know that the first two columns are ids and names but don't know the amount columns of number data that I will have to deal with.

My question is: How would I be able to modify data, replacing invalid values with nans for the third column and onwards (range(2,Columns)) ?

Using below mentioned code replaces all names and ids with nan, but does not do anything to the number data.

1
Data[(Data>=12) & (Data<=-3] = np.nan
Using below mentioned code replaces all rows, from 2 to 6, with nan values.

1
Data[(Data.loc[range(2,Columns)] >=12) & (Data.loc[range(2,Columns)]<=-3)] = np.nan
I have uploaded a csv file..

Kind regards

The entirety of the code;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
import chardet
import pandas as pd
 
# Open and read csv file
with open('testfilex1w.csv', 'rb') as f:
     
    #detect encoding of csv file, assigning encoding to -Result-
    Result = chardet.detect(f.read())
     
# Use panda to read csv file relative to above detected encoding   
Data = pd.read_csv('testfilex1w.csv', encoding=Result['encoding'], header= None)
 
#Display duplicates (student ids) found in pd.dataframe -Data-
print(Data[Data.duplicated([0], keep=False)])
 
# Drop duplicated rows based on IDs[0] and Names[1], IDS and Names are defined
# below
 
Data = Data.drop_duplicates([0], keep='last')
Data = Data.drop_duplicates([1], keep='last')
 
 
 
# Compute number of rows and columns of original data
Columns = len(Data.columns)
Rows    = len(Data.index)
 
# Create a selection of columns to group data
Ids = Data.iloc[:, 0]
Names = Data.iloc[:, 1]
 
# The amount of grades columns is unknown in the assignment, I must therefore
# create code that will work with x amount of columns
Grades = Data.iloc[:, range(2,Columns)]
 
# Compute number of rows and columns of dupliate ridden grades
Gradecols = len(Grades.columns)
Graderows =len(Grades.index)
 
Elements = (Gradecols * Graderows)
 
# Create a "memory" array of ones for indexing, in order to remove rows from
# original grades data by modifying this array with zeros based on below for/ if loop.
#Valid = pd.DataFrame(np.ones((len(Grades.index),len(Grades.columns))))
 
 
# Create variable to keep track of rows, where invalid data might exist
Rowcount=0
Colcount = 0
Gradecount = 0
 
# Create matrix array of grades... because I cant figure out how to iterate over
# a pandas dataframe :(
Gradesnp = Grades.values
Valid=np.ones(Elements)
# For loop with if statements in order to find occurences of values out of
# range and modify the memory array with zeros so that I can remove out of range
# values from orginal grades data.
 
 
for row in Gradesnp:
    for item in row:
        if(item < -3.0 or item > 12.0):
            print("Invalid grade found! Grade was {} in Line {}.".format(item, Rowcount))
            #modify -Valid- array per index, with zero if the if statement is satisfied.
            Gradecount+=1
            Valid[Rowcount]=0
             
    Rowcount+=1
     
 
         
 
# Through the use of Boolean indexing, the data array is modified by the
# memory vector(-Valid-) so that values corresponding to 0 are excluded and values
#corresponding to 1 are included.
#Validgrades=Gradesnp[Valid==1,:]
 
 
#Data[(Data>=12) & (Data<=-3] = np.nan
      
#Data[(Data.iloc[:, range(2,Columns)]>=12) & (Data.iloc[:, range(2,Columns)]<=-3)] = np.nan
 
#Data.loc[range(2,Columns)] = np.nan 
 
#Data.loc[range(2,Columns)]
 
#Data[(Data.loc[range(2,Columns)] >=12) & (Data.loc[range(2,Columns)]<=-3)] = np.nan

Attached Files

.csv   testfilex1w.csv (Size: 745 bytes / Downloads: 166)
Reply
#2
You guys are slow.

#Replace invalid grades with Nans

1
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data < 12) & (Data > -3))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Data structure question standenman 1 1,333 Jun-04-2023, 11:51 AM
Last Post: jefsummers
  Sorting data by specific variables using argparse Bearinabox 5 2,609 Jan-01-2023, 07:44 PM
Last Post: Bearinabox
  Training a model to identify specific SMS types and extract relevant data? lord_of_cinder 0 1,585 Oct-10-2022, 04:35 AM
Last Post: lord_of_cinder
  Remove specific values from dataframe jonah88888 0 2,279 Sep-24-2021, 05:09 AM
Last Post: jonah88888
  Replacing "." in columns donnertrud 1 2,884 Jul-26-2021, 07:15 AM
Last Post: warnerarc
  Adding a new column to a Panda Data Frame rsherry8 2 3,013 Jun-06-2021, 06:49 PM
Last Post: jefsummers
  import columns of data from local csv file CatherineKan 2 4,843 May-10-2021, 05:10 AM
Last Post: ricslato
  pandas.to_datetime: Combine data from 2 columns ju21878436312 1 3,692 Feb-20-2021, 08:25 PM
Last Post: perfringo
  how to filter data frame dynamically with the columns psahay 0 3,099 Aug-24-2020, 01:10 PM
Last Post: psahay
  Can the data types be different for different columns? Robotguy 2 2,849 Aug-19-2020, 09:24 PM
Last Post: Robotguy

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020