Nov-26-2017, 11:42 PM
Hello!
Im trying to replace invalid values ( x< -3 and x >12) with 'nan's in a pandas data structure . The first two columns consist of ids and names respectively, and should not be modified. I want to make a general code for data with an unknown amount of column values, I know that the first two columns are ids and names but don't know the amount columns of number data that I will have to deal with.
My question is: How would I be able to modify data, replacing invalid values with nans for the third column and onwards (range(2,Columns)) ?
Using below mentioned code replaces all names and ids with nan, but does not do anything to the number data.
Using below mentioned code replaces all rows, from 2 to 6, with nan values.
I have uploaded a csv file..
Kind regards
The entirety of the code;
Im trying to replace invalid values ( x< -3 and x >12) with 'nan's in a pandas data structure . The first two columns consist of ids and names respectively, and should not be modified. I want to make a general code for data with an unknown amount of column values, I know that the first two columns are ids and names but don't know the amount columns of number data that I will have to deal with.
My question is: How would I be able to modify data, replacing invalid values with nans for the third column and onwards (range(2,Columns)) ?
Using below mentioned code replaces all names and ids with nan, but does not do anything to the number data.
1 |
Data[(Data> = 12 ) & (Data< = - 3 ] = np.nan |
1 |
Data[(Data.loc[ range ( 2 ,Columns)] > = 12 ) & (Data.loc[ range ( 2 ,Columns)]< = - 3 )] = np.nan |
Kind regards
The entirety of the code;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
import numpy as np import chardet import pandas as pd # Open and read csv file with open ( 'testfilex1w.csv' , 'rb' ) as f: #detect encoding of csv file, assigning encoding to -Result- Result = chardet.detect(f.read()) # Use panda to read csv file relative to above detected encoding Data = pd.read_csv( 'testfilex1w.csv' , encoding = Result[ 'encoding' ], header = None ) #Display duplicates (student ids) found in pd.dataframe -Data- print (Data[Data.duplicated([ 0 ], keep = False )]) # Drop duplicated rows based on IDs[0] and Names[1], IDS and Names are defined # below Data = Data.drop_duplicates([ 0 ], keep = 'last' ) Data = Data.drop_duplicates([ 1 ], keep = 'last' ) # Compute number of rows and columns of original data Columns = len (Data.columns) Rows = len (Data.index) # Create a selection of columns to group data Ids = Data.iloc[:, 0 ] Names = Data.iloc[:, 1 ] # The amount of grades columns is unknown in the assignment, I must therefore # create code that will work with x amount of columns Grades = Data.iloc[:, range ( 2 ,Columns)] # Compute number of rows and columns of dupliate ridden grades Gradecols = len (Grades.columns) Graderows = len (Grades.index) Elements = (Gradecols * Graderows) # Create a "memory" array of ones for indexing, in order to remove rows from # original grades data by modifying this array with zeros based on below for/ if loop. #Valid = pd.DataFrame(np.ones((len(Grades.index),len(Grades.columns)))) # Create variable to keep track of rows, where invalid data might exist Rowcount = 0 Colcount = 0 Gradecount = 0 # Create matrix array of grades... because I cant figure out how to iterate over # a pandas dataframe :( Gradesnp = Grades.values Valid = np.ones(Elements) # For loop with if statements in order to find occurences of values out of # range and modify the memory array with zeros so that I can remove out of range # values from orginal grades data. for row in Gradesnp: for item in row: if (item < - 3.0 or item > 12.0 ): print ( "Invalid grade found! Grade was {} in Line {}." . format (item, Rowcount)) #modify -Valid- array per index, with zero if the if statement is satisfied. Gradecount + = 1 Valid[Rowcount] = 0 Rowcount + = 1 # Through the use of Boolean indexing, the data array is modified by the # memory vector(-Valid-) so that values corresponding to 0 are excluded and values #corresponding to 1 are included. #Validgrades=Gradesnp[Valid==1,:] #Data[(Data>=12) & (Data<=-3] = np.nan #Data[(Data.iloc[:, range(2,Columns)]>=12) & (Data.iloc[:, range(2,Columns)]<=-3)] = np.nan #Data.loc[range(2,Columns)] = np.nan #Data.loc[range(2,Columns)] #Data[(Data.loc[range(2,Columns)] >=12) & (Data.loc[range(2,Columns)]<=-3)] = np.nan |
Attached Files