Python Forum
df and regex, NaN and df.concat
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
df and regex, NaN and df.concat
#1
Hi Pandas experts,

I got some messy data with dates like:

04/20/2001; 04/20/11; 4/20/09; 4/3/01
Mar-20-2001; Dec 20, 2009; April 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2001; 20 April 2001; 20 Mar. 2007; 20 March, 2009
Mar 21th, 2009; Mar 21st, 2009; Feb 22nd, 2002
Feb 2009; Sep 2009; Nov 2016
6/2008; 12/2009
2003; 2012


I want to do this via regex to extract the year only but struggle to combine them into one big extract.

def dates():

    df1 = df['text'].str.extract('^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$')
    df2 = df['text'].str.extract('(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\,*\s\s*\d{4}$|^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\,*\s\d{4}$|^(January|February|March|April|May|June|July|August|September|October|November|December)\,*\s\d{4}$|^(january|february|march|april|may|june|july|august|september|october|november|december)\,*\s\d{4}$')
    df3 = df['text'].str.extract('^((0[1-9])|(1[0-2]))\/(\d{4})$')
    df4 = df['text'].str.extract('^(19[0-8][0-9]|199[0-9]|200[0-9]|201[0-8])$')   
    df5 = df['text'].str.extract('^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/(\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/(\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/(\d{2}))|(29\/02\/((0[48]|[2468][048]|[13579][26])|(00))))$')
    
    df_all = pd.concat([df1, df2, df3, df4, df5])
    return df_all
dates()
The issue is that my data frame now has 16 columns and all are NaN.
I expected one column and..well, no NaN's :)

Does someone know what I am doing wrong?
Reply
#2
OK I figured out using findall works without errors but the result is still nothing.
I must do something wrong to with my regex patterns (found them on regexlib.com)

def dates():
    
    df1 = df['text'].str.findall(r'^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$')
    df2 = df['text'].str.findall(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\,*\s\s*\d{4}$|^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\,*\s\d{4}$|^(January|February|March|April|May|June|July|August|September|October|November|December)\,*\s\d{4}$|^(january|february|march|april|may|june|july|august|september|october|november|december)\,*\s\d{4}$')
    df3 = df['text'].str.findall(r'^((0[1-9])|(1[0-2]))\/(\d{4})$')
    df4 = df['text'].str.findall(r'^(19[0-8][0-9]|199[0-9]|200[0-9]|201[0-8])$')   
    df5 = df['text'].str.findall(r'^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/(\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/(\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/(\d{2}))|(29\/02\/((0[48]|[2468][048]|[13579][26])|(00))))$')
    
    return print(df_all)
dates()
Reply
#3
here is nice site to test your regex patterns.
https://regex101.com/

you will see that your patterns do not match anything.
here is first pattern with some changes
https://regex101.com/r/dER5xH/1

note that using named groups could help and will make your code more readble
Actually, because you just want the year, pattern can be updated to catch all possible cases from your example
r"199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])"
https://regex101.com/r/naVdDj/1
if you wish you may add named group, as is the entire match is what you need
Reply
#4
Hi buran,

many thanks for the great input.
I started using named groups straight away.

Below is my code.

import re
def date_sorter():

    test = re.compile(r'(?P<word>\b\w+\b)')
    m = test.search( '(((( Lots of punctuation )))' )
    print(m.group('word') ) #m.group(1)
    
        
    year = re.compile(r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])')
    yearsearch = year.search('it was a great year starting on 1/1/1987')
    yearsearch = year.search(df['text'].str.extract)
    #yearsearch.group('1') 
    #print(yearsearch.group('1'))
    
    #month day year with month in digits
    #month day year with month in words
    #month year with month in words
    #month year with month in digits
    #year in digits

    return 
date_sorter()
I have a dataframe column df['text'] that I need to search.
When I write test.search(df['text']) I get the error that a string is expected. How can I search through the dataframe column?
Reply
#5
https://pandas.pydata.org/pandas-docs/st...tract.html
Reply
#6
Hi buran,

you recommended .search, which works for type _sre.SRE_Pattern.

If I use yearsearchdf = df['text'].str.extract(r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])')

I only get the error "pattern contains no capture groups"
Reply
#7
(Jan-10-2018, 09:29 AM)metalray Wrote: I started using named groups straight away.
(Jan-22-2018, 03:12 PM)metalray Wrote: I only get the error "pattern contains no capture groups"

the pattern r'199\d{1}|20\d{2}|\d{2}(?![/,\-\w ])' has no capture groups
try with (?P<year>19\d{2}|20\d{2}|\d{2}(?![/,\-\w ]))
Reply
#8
Hi buran,
thanks a lot! that worked great.
I got some dates that are only 2 digits like 89 or 72.
I know that all of those should be prefixed with a 19.
So what do you suggest, looping through the series to prefix all 2 digit series elements with 19 or can I also do that with a regex?
Reply
#9
I would convert them in the dataframe.
In any case I think you need to convert this column to datetime values, as the strings maybe in different format (not only 2/4 digit value)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  concat two list based on array MeeranRizvi 9 7,474 Jan-03-2017, 06:28 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020