df and regex, NaN and df.concat

df and regex, NaN and df.concat - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Homework (https://python-forum.io/forum-9.html)
+--- Thread: df and regex, NaN and df.concat (/thread-7330.html)

df and regex, NaN and df.concat - metalray - Jan-04-2018

Hi Pandas experts,

I got some messy data with dates like:

04/20/2001; 04/20/11; 4/20/09; 4/3/01
Mar-20-2001; Dec 20, 2009; April 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2001; 20 April 2001; 20 Mar. 2007; 20 March, 2009
Mar 21th, 2009; Mar 21st, 2009; Feb 22nd, 2002
Feb 2009; Sep 2009; Nov 2016
6/2008; 12/2009
2003; 2012

I want to do this via regex to extract the year only but struggle to combine them into one big extract.

def dates():

    df1 = df['text'].str.extract('^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$')
    df2 = df['text'].str.extract('(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\,*\s\s*\d{4}$|^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\,*\s\d{4}$|^(January|February|March|April|May|June|July|August|September|October|November|December)\,*\s\d{4}$|^(january|february|march|april|may|june|july|august|september|october|november|december)\,*\s\d{4}$')
    df3 = df['text'].str.extract('^((0[1-9])|(1[0-2]))\/(\d{4})$')
    df4 = df['text'].str.extract('^(19[0-8][0-9]|199[0-9]|200[0-9]|201[0-8])$')   
    df5 = df['text'].str.extract('^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/(\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/(\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/(\d{2}))|(29\/02\/((0[48]|[2468][048]|[13579][26])|(00))))$')
    
    df_all = pd.concat([df1, df2, df3, df4, df5])
    return df_all
dates()

The issue is that my data frame now has 16 columns and all are NaN.
I expected one column and..well, no NaN's :)

Does someone know what I am doing wrong?

RE: df and regex, NaN and df.concat - metalray - Jan-05-2018

OK I figured out using findall works without errors but the result is still nothing.
I must do something wrong to with my regex patterns (found them on regexlib.com)

def dates():
    
    df1 = df['text'].str.findall(r'^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$')
    df2 = df['text'].str.findall(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\,*\s\s*\d{4}$|^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\,*\s\d{4}$|^(January|February|March|April|May|June|July|August|September|October|November|December)\,*\s\d{4}$|^(january|february|march|april|may|june|july|august|september|october|november|december)\,*\s\d{4}$')
    df3 = df['text'].str.findall(r'^((0[1-9])|(1[0-2]))\/(\d{4})$')
    df4 = df['text'].str.findall(r'^(19[0-8][0-9]|199[0-9]|200[0-9]|201[0-8])$')   
    df5 = df['text'].str.findall(r'^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/(\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/(\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/(\d{2}))|(29\/02\/((0[48]|[2468][048]|[13579][26])|(00))))$')
    
    return print(df_all)
dates()

RE: df and regex, NaN and df.concat - buran - Jan-05-2018

here is nice site to test your regex patterns.
https://regex101.com/

you will see that your patterns do not match anything.

here is first pattern with some changes
https://regex101.com/r/dER5xH/1

note that using named groups could help and will make your code more readble

Actually, because you just want the year, pattern can be updated to catch all possible cases from your example
r"199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])"
https://regex101.com/r/naVdDj/1
if you wish you may add named group, as is the entire match is what you need

RE: df and regex, NaN and df.concat - metalray - Jan-10-2018

Hi buran,

many thanks for the great input.
I started using named groups straight away.

Below is my code.

import re
def date_sorter():

    test = re.compile(r'(?P<word>\b\w+\b)')
    m = test.search( '(((( Lots of punctuation )))' )
    print(m.group('word') ) #m.group(1)
    
        
    year = re.compile(r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])')
    yearsearch = year.search('it was a great year starting on 1/1/1987')
    yearsearch = year.search(df['text'].str.extract)
    #yearsearch.group('1') 
    #print(yearsearch.group('1'))
    
    #month day year with month in digits
    #month day year with month in words
    #month year with month in words
    #month year with month in digits
    #year in digits

    return 
date_sorter()

I have a dataframe column df['text'] that I need to search.
When I write test.search(df['text']) I get the error that a string is expected. How can I search through the dataframe column?

RE: df and regex, NaN and df.concat - buran - Jan-10-2018

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html

RE: df and regex, NaN and df.concat - metalray - Jan-22-2018

Hi buran,

you recommended .search, which works for type _sre.SRE_Pattern.

If I use yearsearchdf = df['text'].str.extract(r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])')

I only get the error "pattern contains no capture groups"

RE: df and regex, NaN and df.concat - buran - Jan-22-2018

(Jan-10-2018, 09:29 AM)metalray Wrote: I started using named groups straight away.

(Jan-22-2018, 03:12 PM)metalray Wrote: I only get the error "pattern contains no capture groups"

the pattern r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])' has no capture groups
try with (?P<year>19\d{2}|20\d{2}|\d{2}(?![/,\-\w ]))

RE: df and regex, NaN and df.concat - metalray - Feb-06-2018

Hi buran,
thanks a lot! that worked great.
I got some dates that are only 2 digits like 89 or 72.
I know that all of those should be prefixed with a 19.
So what do you suggest, looping through the series to prefix all 2 digit series elements with 19 or can I also do that with a regex?

RE: df and regex, NaN and df.concat - buran - Feb-06-2018

I would convert them in the dataframe.
In any case I think you need to convert this column to datetime values, as the strings maybe in different format (not only 2/4 digit value)