drop rows that doesnt matched

karlito · (This post was last modified: Oct-25-2019, 05:58 PM by karlito.)

Hi,
I'm working on data cleaning for a while (already posted some questions about this Wink

) and this one is really bit "complicated" for me(beginner :p).
So this is how my raw file looks like: csv.file
How can I drop those rows that are not in a dateformat(the column type is still an object NOT datetime64[ns]!!!)

I tried this approach :

df= pd.read_csv('file.csv', header = None)
for index, row in df.iterrows():
    if df[df[1].apply(lambda x: type(x)==int)]:
        df.drop(index, inplace=True)

but didnt work
Thks
Karlito

***snippsat*** · (This post was last modified: Oct-26-2019, 11:57 AM by snippsat.)

Your csv.file link dos not work.
Can do it like this,it's not type() in pandas but dtypes.

>>> import pandas as pd

>>> df = pd.DataFrame({"x": ["a", "b", "c"], "y": [1, 2, 3], "z": ["d", "e", "f"]})

>>> df
   x  y  z
0  a  1  d
1  b  2  e
2  c  3  f

>>> df.dtypes
x    object
y     int64
z    object
dtype: object

>>> df = df.select_dtypes(exclude=['object'])
>>> df
   y
0  1
1  2
2  3

An other approach is to convert to correct types if that's needed.

>>> df = pd.DataFrame({"x": ["4", "5", "6"], "y": [1, 2, 3], "z": ["d", "e", "f"]})

>>> df
   x  y  z
0  4  1  d
1  5  2  e
2  6  3  f

>>> df.dtypes
x    object
y     int64
z    object
dtype: object

>>> df['x'] = df['x'].astype('int')
>>> df.dtypes
x     int32 # Now integer
y     int64
z    object
dtype: object

karlito · Oct-26-2019, 02:14 PM

(Oct-26-2019, 11:57 AM)snippsat Wrote: Your csv.file link dos not work.
Can do it like this,it's not type() in pandas but dtypes.

>>> import pandas as pd

>>> df = pd.DataFrame({"x": ["a", "b", "c"], "y": [1, 2, 3], "z": ["d", "e", "f"]})

>>> df
   x  y  z
0  a  1  d
1  b  2  e
2  c  3  f

>>> df.dtypes
x    object
y     int64
z    object
dtype: object

>>> df = df.select_dtypes(exclude=['object'])
>>> df
   y
0  1
1  2
2  3

An other approach is to convert to correct types if that's needed.

>>> df = pd.DataFrame({"x": ["4", "5", "6"], "y": [1, 2, 3], "z": ["d", "e", "f"]})

>>> df
   x  y  z
0  4  1  d
1  5  2  e
2  6  3  f

>>> df.dtypes
x    object
y     int64
z    object
dtype: object

>>> df['x'] = df['x'].astype('int')
>>> df.dtypes
x     int32 # Now integer
y     int64
z    object
dtype: object

Hi Thks for replying. I want to drop rows not columns ... hier ist the file file.csv

I want to drop rows that doesnt start/look like a date (although the dtypes is still an object and not dateimt64[ns])
Thks
Karlito

***snippsat*** · (This post was last modified: Oct-26-2019, 04:51 PM by snippsat.)

(Oct-26-2019, 02:14 PM)karlito Wrote: I want to drop rows that doesnt start/look like a date

Ok,is better if you post a sample of csv file and not image.
Can not use data from a image to test with.

>>> import pandas as pd

>>> df = pd.DataFrame({'date': ['12.06.2017', '2003', '114999999', '20.06.2017', '2.08.2018', '554777777'],'value': range(6)})
>>> df
         date  value
0  12.06.2017      0
1        2003      1
2   114999999      2
3  20.06.2017      3
4   2.08.2018      4
5   554777777      5

# Make NaT of row to be dropped
>>> pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
0   2017-06-12
1          NaT
2          NaT
3   2017-06-20
4   2018-08-02
5          NaT
Name: date, dtype: datetime64[ns]

# Apply
>>> df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
>>> df = df.dropna()
>>> df
        date  value
0 2017-06-12      0
3 2017-06-20      3
4 2018-08-02      4

# Turn date around to original format if that's needed
>>> df['date'] = df['date'].dt.strftime('%d.%m.%Y')
>>> df
         date  value
0  12.06.2017      0
3  20.06.2017      3
4  02.08.2018      4

karlito · (This post was last modified: Oct-28-2019, 10:20 AM by karlito.)

(Oct-26-2019, 04:51 PM)snippsat Wrote:

(Oct-26-2019, 02:14 PM)karlito Wrote: I want to drop rows that doesnt start/look like a date

Ok,is better if you post a sample of csv file and not image.
Can not use data from a image to test with.

>>> import pandas as pd

>>> df = pd.DataFrame({'date': ['12.06.2017', '2003', '114999999', '20.06.2017', '2.08.2018', '554777777'],'value': range(6)})
>>> df
         date  value
0  12.06.2017      0
1        2003      1
2   114999999      2
3  20.06.2017      3
4   2.08.2018      4
5   554777777      5

# Make NaT of row to be dropped
>>> pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
0   2017-06-12
1          NaT
2          NaT
3   2017-06-20
4   2018-08-02
5          NaT
Name: date, dtype: datetime64[ns]

# Apply
>>> df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y', errors='coerce')
>>> df = df.dropna()
>>> df
        date  value
0 2017-06-12      0
3 2017-06-20      3
4 2018-08-02      4

# Turn date around to original format if that's needed
>>> df['date'] = df['date'].dt.strftime('%d.%m.%Y')
>>> df
         date  value
0  12.06.2017      0
3  20.06.2017      3
4  02.08.2018      4

Hi Snippsat,

Thks for your help but I tried it and it doesn't work
link : error

import pandas as pd
 
df = pd.DataFrame({0: ['09.05.2017 13:56', '1494331179', '1494331625', '09.05.2017 14:11', '944006550', '03.07.2017 16:50'],1: range(6)})
df

	              0	    1
0	09.05.2017 13:56	0
1	1494331179	        1
2	1494331625	        2
3	09.05.2017 14:11	3
4	944006550	        4
5	03.07.2017 16:50	5

***snippsat*** · Oct-28-2019, 10:55 AM

>>> import pandas as pd

>>> df = pd.DataFrame({0: ['09.05.2017 13:56', '1494331179', '1494331625', '09.05.2017 14:11', '944006550', '03.07.2017 16:50'],1: range(6)})
>>> pd.to_datetime(df[0], format='%d.%m.%Y %H:%M', errors='coerce')
0   2017-05-09 13:56:00
1                   NaT
2                   NaT
3   2017-05-09 14:11:00
4                   NaT
5   2017-07-03 16:50:00
Name: 0, dtype: datetime64[ns]

You are missing %H:%M

karlito · Oct-28-2019, 12:21 PM

(Oct-28-2019, 10:55 AM)snippsat Wrote:

>>> import pandas as pd

>>> df = pd.DataFrame({0: ['09.05.2017 13:56', '1494331179', '1494331625', '09.05.2017 14:11', '944006550', '03.07.2017 16:50'],1: range(6)})
>>> pd.to_datetime(df[0], format='%d.%m.%Y %H:%M', errors='coerce')
0   2017-05-09 13:56:00
1                   NaT
2                   NaT
3   2017-05-09 14:11:00
4                   NaT
5   2017-07-03 16:50:00
Name: 0, dtype: datetime64[ns]

You are missing %H:%M

Thks ... sometimes I think the mistake is too big without taking the time to analyze my code. sorry about that.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Drop rows if a set of columns has a value	dervast	1	2,006	Sep-12-2019, 04:18 PM Last Post: sd_0912
	Drop rows from data with zero value	Devilish	3	3,771	Dec-27-2018, 02:06 AM Last Post: Devilish

drop rows that doesnt matched

User Panel Messages

Announcements