regex.findall and data frame - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: regex.findall and data frame (/thread-9953.html) |
regex.findall and data frame - metalray - May-06-2018 Dear Python Experts, I have been working on my date extractor and but noticed 2 problems. 1. nothing gets extracted from my test data string 2. when I try to use a data frame like df['mytest'] I get a conversion error: TypeError: expected string or bytes-like object Can someone help me with those two issues? import re def date_sorter(): #test with df['text'] #type of date: #November 1940 #Mar, 1975 #04/01/1988 #AFeb 1977 this ought to parse out to 1977/02/01 #2June, 1999 #Decemeber 2015 text = 'This is my text I will search through for patterns 01/01/1984, November 1940, 2June, 1999, Mar 2001, Decemeber 2015' #month/day/year, month in Digits pattern1 = '^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$' #month...day...year, month in Letter pattern2 = '^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))))\,\ ((1[6-9]|[2-9]\d)\d{2}))' #http://regexlib.com/REDetails.aspx?regexp_id=404 #day...month...year, month in Letter , day is optional pattern3 = '(0[1-9]|[12][0-9]|3[01])\s(J(anuary|uly)|Ma(rch|y)|August|(Octo|Decem)ber)\s[1-9][0-9]{3}| (0[1-9]|[12][0-9]|30)\s(April|June|(Sept|Nov)ember)\s[1-9][0-9]{3}| (0[1-9]|1[0-9]|2[0-8])\sFebruary\s[1-9][0-9]{3}| 29\sFebruary\s((0[48]|[2468][048]|[13579][26])00|[0-9]{2}(0[48]|[2468][048]|[13579][26]))' #month/year, month in Digits pattern4 ='^((0[1-9])|(1[0-2]))\/(\d{4})$' #yyyy pattern5 = '\s\((\d{4})\)$' # or '\b(19|20)\d{2}\b' #Note that if year in 5) and 4) are different, use 5) instead. full_pattern = '{}|{}|{}|{}|{}'.format(pattern1, pattern2, pattern3, pattern3, pattern4, pattern5) regex = re.compile(full_pattern) extracted_values = re.findall(regex, text) print('----------') print(extracted_values) print('----------') date_sorter() RE: regex.findall and data frame - killerrex - May-06-2018 Some of that regex are truly infernal and debugging the result from chaining them can be a nightmare. Some recommendations: - Use always r'' strings to input regex in python or you will suffer the escape char syndrome. - Even if you have copied the regex from somewhere remove useless details like {1} or [3]... Indicate 1 match is the default action and a single normal char group is the same as the char. For example your first pattern reduces to r'[01]?\d/(?:[0-2]\d|3[01])/(?:19\d\d|[2-9]\d{3})' - Some of your patterns include ^ and $ that only match at the beginning or end of the string and that is something I do not think you want. - Better than trying to build the monster regex I would ratter process one by one in order... so you can also detect bad dates like 02/29/1999 or 31 June 2005 that might get parsed correctly by other of your patterns. Take a look to the datetime module, it can help you a lot. RE: regex.findall and data frame - metalray - May-08-2018 Hi killerrex, Many thanks for your reply. I actually dont plan to code any regex myself but re-use whats out there already. After introducing the r to my regex patterns and removing the ^ and $ I still dont find any matches. I am certain something else is wrong my my code apart from the regex patterns. They should at least find the simple dates that I put in my test string. RE: regex.findall and data frame - snippsat - May-08-2018 There are several problem here,shall you use Pandas or shall you take text out of Pandas? The task is unclear. (May-06-2018, 03:53 PM)metalray Wrote: 2. when I try to use a data frame like df['mytest'] I get a conversion error:When you use Pandas and do that will not return a string but pandas Series .Pandas has build in regex(available via str) so you don,t need to convert to text( df['text'].to_string(index=False )Convert to text you only get data for that column and lose the rest of DataFrame data. RE: regex.findall and data frame - metalray - May-09-2018 Hi snippsat, Thanks for your reply. If I use the Pandas regex via str then I dont know how to use multiple regex patterns and apply those. stringserach = df['text'].str.extract(pattern1,pattern2,pattern3,pattern4,pattern5) Does not work. You are correct, I have two issues. First, none of the patterns works and second, even if they would work, I cant get the df['mytest'] as input. RE: regex.findall and data frame - metalray - May-11-2018 stringsearch2 = df['text'].str.findall(pattern1,pattern2) does not work either. RE: regex.findall and data frame - metalray - May-15-2018 (May-06-2018, 05:43 PM)killerrex Wrote: our first pattern reduces to r'[01]?\d/(?:[0-2]\d|3[01])/(?:19\d\d|[2-9]\d{3})' OK I have progressed by doing a loop and going through the data frame. That is good for now. I can append it to a list later but I struggle with my regular expressions. I need to extract a date like "September. 21, 2012" or "July 25, 1998" or "Oct 18, 1980" and I struggle to find the right regular expression for that. Code: text = 'This is my text I will search through for patterns September. 21, 2012. As well as July 25, 1998 and a date like Oct 18, 1980' pattern1 = '^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))))\,\ ((1[6-9]|[2-9]\d)\d{2}))' full_pattern = '{}'.format(pattern1) regex = re.compile(full_pattern) extracted_values = re.findall(regex, text) |