Hi,
I'm working on data cleaning for a while (already posted some questions about this

) and this one is really bit "complicated" for me(beginner :p).
So this is how my raw file looks like:
csv.file
How can I drop those rows that are not in a dateformat(the column type is still an object NOT datetime64[ns]!!!)
I tried this approach :
df= pd.read_csv('file.csv', header = None)
for index, row in df.iterrows():
if df[df[1].apply(lambda x: type(x)==int)]:
df.drop(index, inplace=True)
but didnt work
Thks
Karlito
Your
csv.file
link dos not work.
Can do it like this,it's not
type()
in pandas but
dtypes
.
>>> import pandas as pd
>>> df = pd.DataFrame({"x": ["a", "b", "c"], "y": [1, 2, 3], "z": ["d", "e", "f"]})
>>> df
x y z
0 a 1 d
1 b 2 e
2 c 3 f
>>> df.dtypes
x object
y int64
z object
dtype: object
>>> df = df.select_dtypes(exclude=['object'])
>>> df
y
0 1
1 2
2 3
An other approach is to convert to correct types if that's needed.
>>> df = pd.DataFrame({"x": ["4", "5", "6"], "y": [1, 2, 3], "z": ["d", "e", "f"]})
>>> df
x y z
0 4 1 d
1 5 2 e
2 6 3 f
>>> df.dtypes
x object
y int64
z object
dtype: object
>>> df['x'] = df['x'].astype('int')
>>> df.dtypes
x int32 # Now integer
y int64
z object
dtype: object
(Oct-26-2019, 11:57 AM)snippsat Wrote: [ -> ]Your csv.file
link dos not work.
Can do it like this,it's not type()
in pandas but dtypes
.
>>> import pandas as pd
>>> df = pd.DataFrame({"x": ["a", "b", "c"], "y": [1, 2, 3], "z": ["d", "e", "f"]})
>>> df
x y z
0 a 1 d
1 b 2 e
2 c 3 f
>>> df.dtypes
x object
y int64
z object
dtype: object
>>> df = df.select_dtypes(exclude=['object'])
>>> df
y
0 1
1 2
2 3
An other approach is to convert to correct types if that's needed.
>>> df = pd.DataFrame({"x": ["4", "5", "6"], "y": [1, 2, 3], "z": ["d", "e", "f"]})
>>> df
x y z
0 4 1 d
1 5 2 e
2 6 3 f
>>> df.dtypes
x object
y int64
z object
dtype: object
>>> df['x'] = df['x'].astype('int')
>>> df.dtypes
x int32 # Now integer
y int64
z object
dtype: object
Hi Thks for replying. I want to drop rows not columns ... hier ist the file
file.csv
I want to drop rows that doesnt start/look like a date (although the dtypes is still an object and not dateimt64[ns])
Thks
Karlito
(Oct-26-2019, 02:14 PM)karlito Wrote: [ -> ]I want to drop rows that doesnt start/look like a date
Ok,is better if you post a sample of csv file and not image.
Can not use data from a image to test with.
>>> import pandas as pd
>>> df = pd.DataFrame({'date': ['12.06.2017', '2003', '114999999', '20.06.2017', '2.08.2018', '554777777'],'value': range(6)})
>>> df
date value
0 12.06.2017 0
1 2003 1
2 114999999 2
3 20.06.2017 3
4 2.08.2018 4
5 554777777 5
# Make NaT of row to be dropped
>>> pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
0 2017-06-12
1 NaT
2 NaT
3 2017-06-20
4 2018-08-02
5 NaT
Name: date, dtype: datetime64[ns]
# Apply
>>> df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
>>> df = df.dropna()
>>> df
date value
0 2017-06-12 0
3 2017-06-20 3
4 2018-08-02 4
# Turn date around to original format if that's needed
>>> df['date'] = df['date'].dt.strftime('%d.%m.%Y')
>>> df
date value
0 12.06.2017 0
3 20.06.2017 3
4 02.08.2018 4
(Oct-26-2019, 04:51 PM)snippsat Wrote: [ -> ] (Oct-26-2019, 02:14 PM)karlito Wrote: [ -> ]I want to drop rows that doesnt start/look like a date
Ok,is better if you post a sample of csv file and not image.
Can not use data from a image to test with.
>>> import pandas as pd
>>> df = pd.DataFrame({'date': ['12.06.2017', '2003', '114999999', '20.06.2017', '2.08.2018', '554777777'],'value': range(6)})
>>> df
date value
0 12.06.2017 0
1 2003 1
2 114999999 2
3 20.06.2017 3
4 2.08.2018 4
5 554777777 5
# Make NaT of row to be dropped
>>> pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
0 2017-06-12
1 NaT
2 NaT
3 2017-06-20
4 2018-08-02
5 NaT
Name: date, dtype: datetime64[ns]
# Apply
>>> df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
>>> df = df.dropna()
>>> df
date value
0 2017-06-12 0
3 2017-06-20 3
4 2018-08-02 4
# Turn date around to original format if that's needed
>>> df['date'] = df['date'].dt.strftime('%d.%m.%Y')
>>> df
date value
0 12.06.2017 0
3 20.06.2017 3
4 02.08.2018 4
Hi Snippsat,
Thks for your help but I tried it and it doesn't work
link :
error
import pandas as pd
df = pd.DataFrame({0: ['09.05.2017 13:56', '1494331179', '1494331625', '09.05.2017 14:11', '944006550', '03.07.2017 16:50'],1: range(6)})
df
0 1
0 09.05.2017 13:56 0
1 1494331179 1
2 1494331625 2
3 09.05.2017 14:11 3
4 944006550 4
5 03.07.2017 16:50 5
>>> import pandas as pd
>>> df = pd.DataFrame({0: ['09.05.2017 13:56', '1494331179', '1494331625', '09.05.2017 14:11', '944006550', '03.07.2017 16:50'],1: range(6)})
>>> pd.to_datetime(df[0], format='%d.%m.%Y %H:%M', errors='coerce')
0 2017-05-09 13:56:00
1 NaT
2 NaT
3 2017-05-09 14:11:00
4 NaT
5 2017-07-03 16:50:00
Name: 0, dtype: datetime64[ns]
You are missing %H
:
%M
(Oct-28-2019, 10:55 AM)snippsat Wrote: [ -> ]>>> import pandas as pd
>>> df = pd.DataFrame({0: ['09.05.2017 13:56', '1494331179', '1494331625', '09.05.2017 14:11', '944006550', '03.07.2017 16:50'],1: range(6)})
>>> pd.to_datetime(df[0], format='%d.%m.%Y %H:%M', errors='coerce')
0 2017-05-09 13:56:00
1 NaT
2 NaT
3 2017-05-09 14:11:00
4 NaT
5 2017-07-03 16:50:00
Name: 0, dtype: datetime64[ns]
You are missing %H:
%M
Thks ... sometimes I think the mistake is too big without taking the time to analyze my code. sorry about that.