can't get rid of '?' within my df!

brkolvr · (This post was last modified: Aug-13-2019, 05:29 PM by brkolvr.)

Just gotten back on the laptop after an extended break and can't seem to complete this simple task! I've got a large dataset, and just going about my preprocessing and I can't seem to mark and drop the rows with a '?'. I've tried repeatedly to replace said '?'s with NaN so I can drop them willy nilly, though nothing seems to be affecting the dataset whatsoever.

Most seem to drop the rows if a value occurs in a particular column, though I don't want to go through each column, rather just the entire dataset at once. Also because my rows have different types then that perhaps is causing some friction: all mixed between float and object.

Here's what I've tried:

train = pre_train.replace('?', 'np.Nan')

train = pre_train.replace({'?': np.nan}).dropna()

train = pre_train.replace({to_replace = "?", value = "NaN"})

train = pre_train.where(pre_train != '?', other = 'NaN')

And I can't seem to get any to work, so any help is appreciated. Will offer a little segment of what the dataset looks like (note there are more columns). If I do the opposite and attempt to rid my df of all rows that contain an element that is not '?', and I manage to clear the df, so really confused by this!

I can't seem to work out how to edit my original post, so if a mod could join these two together I would be eternally grateful!

**scidam** · Aug-13-2019, 11:16 PM

It would be better, if I had original data, or you provide minimal reproducible example.
It seems everything works fine for me, look at the following example:

import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3], 'y':['one', '?', 'two']})
df.loc[df.y.str.contains('?', regex=False), 'y'] = pd.np.nan

brkolvr · (This post was last modified: Aug-14-2019, 02:33 PM by brkolvr.)

I get a type error with this. I did upload an image though guess it didn't work.

Find example of data here:

Hide/Show

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	prediction

17	32.0	Private	186824.0	HS-grad	9.0	Never-married	Machine-op-inspct	Unmarried	White	Male	0.0	0.0	40.0	United-States	<=50K
18	38.0	Private	28887.0	11th	7.0	Married-civ-spouse	Sales	Husband	White	Male	0.0	0.0	50.0	United-States	<=50K
19	43.0	Self-emp-not-inc	292175.0	Masters	14.0	Divorced	Exec-managerial	Unmarried	White	Female	0.0	0.0	45.0	United-States	>50K
20	40.0	Private	193524.0	Doctorate	16.0	Married-civ-spouse	Prof-specialty	Husband	White	Male	0.0	0.0	60.0	United-States	>50K
21	54.0	Private	302146.0	HS-grad	9.0	Separated	Other-service	Unmarried	Black	Female	0.0	0.0	20.0	United-States	<=50K
22	35.0	Federal-gov	76845.0	9th	5.0	Married-civ-spouse	Farming-fishing	Husband	Black	Male	0.0	0.0	40.0	United-States	<=50K
23	43.0	Private	117037.0	11th	7.0	Married-civ-spouse	Transport-moving	Husband	White	Male	0.0	2042.0	40.0	United-States	<=50K
24	59.0	Private	109015.0	HS-grad	9.0	Divorced	Tech-support	Unmarried	White	Female	0.0	0.0	40.0	United-States	<=50K
25	56.0	Local-gov	216851.0	Bachelors	13.0	Married-civ-spouse	Tech-support	Husband	White	Male	0.0	0.0	40.0	United-States	>50K
26	19.0	Private	168294.0	HS-grad	9.0	Never-married	Craft-repair	Own-child	White	Male	0.0	0.0	40.0	United-States	<=50K
27	54.0	?	180211.0	Some-college	10.0	Married-civ-spouse	?	Husband	Asian-Pac-Islander	Male	0.0	0.0	60.0	South	>50K
28	39.0	Private	367260.0	HS-grad	9.0	Divorced	Exec-managerial	Not-in-family	White	Male	0.0	0.0	80.0	United-States	<=50K
29	49.0	Private	193366.0	HS-grad	9.0	Married-civ-spouse	Craft-repair	Husband	White	Male	0.0	0.0	40.0	United-States	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Refer to index 27 to see '?'

I would to to search the entire df rather than search a single column. Or would it be necessary to iterate through each column? Though that seems kinda un-python.

brkolvr · Aug-15-2019, 05:07 PM

OK previously I had uploaded the file wrong, contained spaces everywhere.

Here is it new:

Hide/Show

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	prediction
2	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
3	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
4	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
5	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
6	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
7	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	<=50K
8	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	<=50K
9	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	>50K
10	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	0	50	United-States	>50K
11	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	0	40	United-States	>50K
12	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	0	80	United-States	>50K
13	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	India	>50K
14	23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	0	30	United-States	<=50K
15	32	Private	205019	Assoc-acdm	12	Never-married	Sales	Not-in-family	Black	Male	0	0	50	United-States	<=50K
16	40	Private	121772	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Asian-Pac-Islander	Male	0	0	40	?	>50K

Refer to [16] for '?'. Now when I use your code it simply either removes all content, or it replaces the entire dataframe with 'nan'.

**scidam** · Aug-16-2019, 01:12 AM

Did you try something like this?

df.loc[df.loc['native-country'].str.contains('?', regex=False), 'native-country'] = pd.np.nan

brkolvr · Aug-16-2019, 01:36 AM

(Aug-16-2019, 01:12 AM)scidam Wrote: Did you try something like this?
df.loc[df.loc['native-country'].str.contains('?', regex=False), 'native-country'] = pd.np.nan

It worked wonderfully, thank you

can't get rid of '?' within my df!

User Panel Messages

Announcements