Just gotten back on the laptop after an extended break and can't seem to complete this simple task! I've got a large dataset, and just going about my preprocessing and I can't seem to mark and drop the rows with a '?'. I've tried repeatedly to replace said '?'s with NaN so I can drop them willy nilly, though nothing seems to be affecting the dataset whatsoever.
Most seem to drop the rows if a value occurs in a particular column, though I don't want to go through each column, rather just the entire dataset at once. Also because my rows have different types then that perhaps is causing some friction: all mixed between float and object.
Here's what I've tried:
train = pre_train.replace('?', 'np.Nan')
train = pre_train.replace({'?': np.nan}).dropna()
train = pre_train.replace({to_replace = "?", value = "NaN"})
train = pre_train.where(pre_train != '?', other = 'NaN')
And I can't seem to get any to work, so any help is appreciated. Will offer a little segment of what the dataset looks like (note there are more columns). If I do the opposite and attempt to rid my df of all rows that contain an element that is not '?', and I manage to clear the df, so really confused by this!
I can't seem to work out how to edit my original post, so if a mod could join these two together I would be eternally grateful!
It would be better, if I had original data, or you provide minimal reproducible example.
It seems everything works fine for me, look at the following example:
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3], 'y':['one', '?', 'two']})
df.loc[df.y.str.contains('?', regex=False), 'y'] = pd.np.nan
I get a type error with this. I did upload an image though guess it didn't work.
Find example of data here:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country prediction
17 32.0 Private 186824.0 HS-grad 9.0 Never-married Machine-op-inspct Unmarried White Male 0.0 0.0 40.0 United-States <=50K
18 38.0 Private 28887.0 11th 7.0 Married-civ-spouse Sales Husband White Male 0.0 0.0 50.0 United-States <=50K
19 43.0 Self-emp-not-inc 292175.0 Masters 14.0 Divorced Exec-managerial Unmarried White Female 0.0 0.0 45.0 United-States >50K
20 40.0 Private 193524.0 Doctorate 16.0 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 60.0 United-States >50K
21 54.0 Private 302146.0 HS-grad 9.0 Separated Other-service Unmarried Black Female 0.0 0.0 20.0 United-States <=50K
22 35.0 Federal-gov 76845.0 9th 5.0 Married-civ-spouse Farming-fishing Husband Black Male 0.0 0.0 40.0 United-States <=50K
23 43.0 Private 117037.0 11th 7.0 Married-civ-spouse Transport-moving Husband White Male 0.0 2042.0 40.0 United-States <=50K
24 59.0 Private 109015.0 HS-grad 9.0 Divorced Tech-support Unmarried White Female 0.0 0.0 40.0 United-States <=50K
25 56.0 Local-gov 216851.0 Bachelors 13.0 Married-civ-spouse Tech-support Husband White Male 0.0 0.0 40.0 United-States >50K
26 19.0 Private 168294.0 HS-grad 9.0 Never-married Craft-repair Own-child White Male 0.0 0.0 40.0 United-States <=50K
27 54.0 ? 180211.0 Some-college 10.0 Married-civ-spouse ? Husband Asian-Pac-Islander Male 0.0 0.0 60.0 South >50K
28 39.0 Private 367260.0 HS-grad 9.0 Divorced Exec-managerial Not-in-family White Male 0.0 0.0 80.0 United-States <=50K
29 49.0 Private 193366.0 HS-grad 9.0 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 40.0 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Refer to index 27 to see '?'
I would to to search the entire df rather than search a single column. Or would it be necessary to iterate through each column? Though that seems kinda un-python.
OK previously I had uploaded the file wrong, contained spaces everywhere.
Here is it new:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country prediction
2 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
3 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
4 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
5 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
6 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
7 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
8 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica <=50K
9 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
10 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States >50K
11 42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States >50K
12 37 Private 280464 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 80 United-States >50K
13 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
14 23 Private 122272 Bachelors 13 Never-married Adm-clerical Own-child White Female 0 0 30 United-States <=50K
15 32 Private 205019 Assoc-acdm 12 Never-married Sales Not-in-family Black Male 0 0 50 United-States <=50K
16 40 Private 121772 Assoc-voc 11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0 0 40 ? >50K
Refer to [16] for '?'. Now when I use your code it simply either removes all content, or it replaces the entire dataframe with 'nan'.
Did you try something like this?
df.loc[df.loc['native-country'].str.contains('?', regex=False), 'native-country'] = pd.np.nan
(Aug-16-2019, 01:12 AM)scidam Wrote: [ -> ]Did you try something like this?
df.loc[df.loc['native-country'].str.contains('?', regex=False), 'native-country'] = pd.np.nan
It worked wonderfully, thank you