df and regex, NaN and df.concat - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: df and regex, NaN and df.concat (/thread-7330.html) |
df and regex, NaN and df.concat - metalray - Jan-04-2018 Hi Pandas experts, I got some messy data with dates like: 04/20/2001; 04/20/11; 4/20/09; 4/3/01 Mar-20-2001; Dec 20, 2009; April 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2001; 20 April 2001; 20 Mar. 2007; 20 March, 2009 Mar 21th, 2009; Mar 21st, 2009; Feb 22nd, 2002 Feb 2009; Sep 2009; Nov 2016 6/2008; 12/2009 2003; 2012 I want to do this via regex to extract the year only but struggle to combine them into one big extract. def dates(): df1 = df['text'].str.extract('^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$') df2 = df['text'].str.extract('(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\,*\s\s*\d{4}$|^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\,*\s\d{4}$|^(January|February|March|April|May|June|July|August|September|October|November|December)\,*\s\d{4}$|^(january|february|march|april|may|june|july|august|september|october|november|december)\,*\s\d{4}$') df3 = df['text'].str.extract('^((0[1-9])|(1[0-2]))\/(\d{4})$') df4 = df['text'].str.extract('^(19[0-8][0-9]|199[0-9]|200[0-9]|201[0-8])$') df5 = df['text'].str.extract('^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/(\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/(\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/(\d{2}))|(29\/02\/((0[48]|[2468][048]|[13579][26])|(00))))$') df_all = pd.concat([df1, df2, df3, df4, df5]) return df_all dates()The issue is that my data frame now has 16 columns and all are NaN. I expected one column and..well, no NaN's :) Does someone know what I am doing wrong? RE: df and regex, NaN and df.concat - metalray - Jan-05-2018 OK I figured out using findall works without errors but the result is still nothing. I must do something wrong to with my regex patterns (found them on regexlib.com) def dates(): df1 = df['text'].str.findall(r'^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$') df2 = df['text'].str.findall(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\,*\s\s*\d{4}$|^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\,*\s\d{4}$|^(January|February|March|April|May|June|July|August|September|October|November|December)\,*\s\d{4}$|^(january|february|march|april|may|june|july|august|september|october|november|december)\,*\s\d{4}$') df3 = df['text'].str.findall(r'^((0[1-9])|(1[0-2]))\/(\d{4})$') df4 = df['text'].str.findall(r'^(19[0-8][0-9]|199[0-9]|200[0-9]|201[0-8])$') df5 = df['text'].str.findall(r'^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/(\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/(\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/(\d{2}))|(29\/02\/((0[48]|[2468][048]|[13579][26])|(00))))$') return print(df_all) dates() RE: df and regex, NaN and df.concat - buran - Jan-05-2018 here is nice site to test your regex patterns. https://regex101.com/ you will see that your patterns do not match anything. here is first pattern with some changes https://regex101.com/r/dER5xH/1 note that using named groups could help and will make your code more readble Actually, because you just want the year, pattern can be updated to catch all possible cases from your example r"199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])" https://regex101.com/r/naVdDj/1 if you wish you may add named group, as is the entire match is what you need RE: df and regex, NaN and df.concat - metalray - Jan-10-2018 Hi buran, many thanks for the great input. I started using named groups straight away. Below is my code. import re def date_sorter(): test = re.compile(r'(?P<word>\b\w+\b)') m = test.search( '(((( Lots of punctuation )))' ) print(m.group('word') ) #m.group(1) year = re.compile(r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])') yearsearch = year.search('it was a great year starting on 1/1/1987') yearsearch = year.search(df['text'].str.extract) #yearsearch.group('1') #print(yearsearch.group('1')) #month day year with month in digits #month day year with month in words #month year with month in words #month year with month in digits #year in digits return date_sorter()I have a dataframe column df['text'] that I need to search. When I write test.search(df['text']) I get the error that a string is expected. How can I search through the dataframe column? RE: df and regex, NaN and df.concat - buran - Jan-10-2018 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html RE: df and regex, NaN and df.concat - metalray - Jan-22-2018 Hi buran, you recommended .search, which works for type _sre.SRE_Pattern. If I use yearsearchdf = df['text'].str.extract(r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])') I only get the error "pattern contains no capture groups" RE: df and regex, NaN and df.concat - buran - Jan-22-2018 (Jan-10-2018, 09:29 AM)metalray Wrote: I started using named groups straight away. (Jan-22-2018, 03:12 PM)metalray Wrote: I only get the error "pattern contains no capture groups" the pattern r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])' has no capture groupstry with (?P<year>19\d{2}|20\d{2}|\d{2}(?![/,\-\w ]))
RE: df and regex, NaN and df.concat - metalray - Feb-06-2018 Hi buran, thanks a lot! that worked great. I got some dates that are only 2 digits like 89 or 72. I know that all of those should be prefixed with a 19. So what do you suggest, looping through the series to prefix all 2 digit series elements with 19 or can I also do that with a regex? RE: df and regex, NaN and df.concat - buran - Feb-06-2018 I would convert them in the dataframe. In any case I think you need to convert this column to datetime values, as the strings maybe in different format (not only 2/4 digit value) |