Python Forum
regex.findall and data frame
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
regex.findall and data frame
#1
Dear Python Experts,

I have been working on my date extractor and but noticed 2 problems.

1. nothing gets extracted from my test data string
2. when I try to use a data frame like df['mytest'] I get a conversion error:
TypeError: expected string or bytes-like object

Can someone help me with those two issues?

import re
def date_sorter():
    #test with df['text']

    #type of date:
    #November 1940
    #Mar, 1975
    #04/01/1988
    #AFeb 1977 this ought to parse out to 1977/02/01
    #2June, 1999
    #Decemeber 2015

    text = 'This is my text I will search through for patterns 01/01/1984, November 1940, 2June, 1999, Mar 2001, Decemeber 2015'

    #month/day/year, month in Digits
    pattern1 = '^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$'

    #month...day...year, month in Letter
    pattern2 = '^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))))\,\ ((1[6-9]|[2-9]\d)\d{2}))'
    #http://regexlib.com/REDetails.aspx?regexp_id=404

    #day...month...year, month in Letter , day is optional
    pattern3 = '(0[1-9]|[12][0-9]|3[01])\s(J(anuary|uly)|Ma(rch|y)|August|(Octo|Decem)ber)\s[1-9][0-9]{3}| (0[1-9]|[12][0-9]|30)\s(April|June|(Sept|Nov)ember)\s[1-9][0-9]{3}| (0[1-9]|1[0-9]|2[0-8])\sFebruary\s[1-9][0-9]{3}| 29\sFebruary\s((0[48]|[2468][048]|[13579][26])00|[0-9]{2}(0[48]|[2468][048]|[13579][26]))'

    #month/year, month in Digits
    pattern4 ='^((0[1-9])|(1[0-2]))\/(\d{4})$'

    #yyyy
    pattern5 = '\s\((\d{4})\)$' # or '\b(19|20)\d{2}\b'

    #Note that if year in 5) and 4) are different, use 5) instead.

    full_pattern = '{}|{}|{}|{}|{}'.format(pattern1, pattern2, pattern3, pattern3, pattern4, pattern5)
    regex = re.compile(full_pattern)
    extracted_values = re.findall(regex, text)
    print('----------')
    print(extracted_values)
    print('----------')

date_sorter()
Reply
#2
Some of that regex are truly infernal and debugging the result from chaining them can be a nightmare.

Some recommendations:
- Use always r'' strings to input regex in python or you will suffer the escape char syndrome.
- Even if you have copied the regex from somewhere remove useless details like {1} or [3]... Indicate 1 match is the default action and a single normal char group is the same as the char. For example your first pattern reduces to r'[01]?\d/(?:[0-2]\d|3[01])/(?:19\d\d|[2-9]\d{3})'
- Some of your patterns include ^ and $ that only match at the beginning or end of the string and that is something I do not think you want.
- Better than trying to build the monster regex I would ratter process one by one in order... so you can also detect bad dates like 02/29/1999 or 31 June 2005 that might get parsed correctly by other of your patterns.

Take a look to the datetime module, it can help you a lot.
Reply
#3
Hi killerrex,

Many thanks for your reply.
I actually dont plan to code any regex myself but re-use whats out there already.
After introducing the r to my regex patterns and removing the ^ and $
I still dont find any matches.
I am certain something else is wrong my my code apart from the regex patterns. They should at least find the simple
dates that I put in my test string.
Reply
#4
There are several problem here,shall you use Pandas or shall you take text out of Pandas?
The task is unclear.
(May-06-2018, 03:53 PM)metalray Wrote: 2. when I try to use a data frame like df['mytest'] I get a conversion error:
TypeError: expected string or bytes-like object
When you use Pandas and do that will not return a string but pandas Series.
Pandas has build in regex(available via str) so you don,t need to convert to text(df['text'].to_string(index=False)
Convert to text you only get data for that column and lose the rest of DataFrame data.
Reply
#5
Hi snippsat,

Thanks for your reply.

If I use the Pandas regex via str then I dont know how to use multiple regex patterns and apply those.

stringserach = df['text'].str.extract(pattern1,pattern2,pattern3,pattern4,pattern5)

Does not work.

You are correct, I have two issues. First, none of the patterns works and second, even if they would work, I cant get the df['mytest'] as input.
Reply
#6
stringsearch2 = df['text'].str.findall(pattern1,pattern2) does not work either.
Reply
#7
(May-06-2018, 05:43 PM)killerrex Wrote: our first pattern reduces to r'[01]?\d/(?:[0-2]\d|3[01])/(?:19\d\d|[2-9]\d{3})'


OK I have progressed by doing a loop and going through the data frame.
That is good for now. I can append it to a list later but I struggle with my
regular expressions.
I need to extract a date like "September. 21, 2012" or "July 25, 1998" or "Oct 18, 1980" and I struggle to find the
right regular expression for that.


Code:
text = 'This is my text I will search through for patterns September. 21, 2012. As well as July 25, 1998 and a date like Oct 18, 1980'
 
pattern1 = '^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))))\,\ ((1[6-9]|[2-9]\d)\d{2}))'

full_pattern = '{}'.format(pattern1)
regex = re.compile(full_pattern)
extracted_values = re.findall(regex, text)   
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Data Frame map suzeet 1 1,642 Jul-20-2020, 04:47 PM
Last Post: GOTO10

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020