Hi Pandas experts,
I got some messy data with dates like:
04/20/2001; 04/20/11; 4/20/09; 4/3/01
Mar-20-2001; Dec 20, 2009; April 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2001; 20 April 2001; 20 Mar. 2007; 20 March, 2009
Mar 21th, 2009; Mar 21st, 2009; Feb 22nd, 2002
Feb 2009; Sep 2009; Nov 2016
6/2008; 12/2009
2003; 2012
I want to do this via regex to extract the year only but struggle to combine them into one big extract.
I expected one column and..well, no NaN's :)
Does someone know what I am doing wrong?
I got some messy data with dates like:
04/20/2001; 04/20/11; 4/20/09; 4/3/01
Mar-20-2001; Dec 20, 2009; April 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2001; 20 April 2001; 20 Mar. 2007; 20 March, 2009
Mar 21th, 2009; Mar 21st, 2009; Feb 22nd, 2002
Feb 2009; Sep 2009; Nov 2016
6/2008; 12/2009
2003; 2012
I want to do this via regex to extract the year only but struggle to combine them into one big extract.
def dates(): df1 = df['text'].str.extract('^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$') df2 = df['text'].str.extract('(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\,*\s\s*\d{4}$|^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\,*\s\d{4}$|^(January|February|March|April|May|June|July|August|September|October|November|December)\,*\s\d{4}$|^(january|february|march|april|may|june|july|august|september|october|november|december)\,*\s\d{4}$') df3 = df['text'].str.extract('^((0[1-9])|(1[0-2]))\/(\d{4})$') df4 = df['text'].str.extract('^(19[0-8][0-9]|199[0-9]|200[0-9]|201[0-8])$') df5 = df['text'].str.extract('^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/(\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/(\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/(\d{2}))|(29\/02\/((0[48]|[2468][048]|[13579][26])|(00))))$') df_all = pd.concat([df1, df2, df3, df4, df5]) return df_all dates()The issue is that my data frame now has 16 columns and all are NaN.
I expected one column and..well, no NaN's :)
Does someone know what I am doing wrong?