May-06-2018, 03:53 PM
Dear Python Experts,
I have been working on my date extractor and but noticed 2 problems.
1. nothing gets extracted from my test data string
2. when I try to use a data frame like df['mytest'] I get a conversion error:
TypeError: expected string or bytes-like object
Can someone help me with those two issues?
I have been working on my date extractor and but noticed 2 problems.
1. nothing gets extracted from my test data string
2. when I try to use a data frame like df['mytest'] I get a conversion error:
TypeError: expected string or bytes-like object
Can someone help me with those two issues?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
import re def date_sorter(): #test with df['text'] #type of date: #November 1940 #Mar, 1975 #04/01/1988 #AFeb 1977 this ought to parse out to 1977/02/01 #2June, 1999 #Decemeber 2015 text = 'This is my text I will search through for patterns 01/01/1984, November 1940, 2June, 1999, Mar 2001, Decemeber 2015' #month/day/year, month in Digits pattern1 = '^[0,1]?\d{1}\/(([0-2]?\d{1})|([3][0,1]{1}))\/(([1]{1}[9]{1}[9]{1}\d{1})|([2-9]{1}\d{3}))$' #month...day...year, month in Letter pattern2 = '^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))))\,\ ((1[6-9]|[2-9]\d)\d{2}))' #day...month...year, month in Letter , day is optional pattern3 = '(0[1-9]|[12][0-9]|3[01])\s(J(anuary|uly)|Ma(rch|y)|August|(Octo|Decem)ber)\s[1-9][0-9]{3}| (0[1-9]|[12][0-9]|30)\s(April|June|(Sept|Nov)ember)\s[1-9][0-9]{3}| (0[1-9]|1[0-9]|2[0-8])\sFebruary\s[1-9][0-9]{3}| 29\sFebruary\s((0[48]|[2468][048]|[13579][26])00|[0-9]{2}(0[48]|[2468][048]|[13579][26]))' #month/year, month in Digits pattern4 = '^((0[1-9])|(1[0-2]))\/(\d{4})$' #yyyy pattern5 = '\s\((\d{4})\)$' # or '\b(19|20)\d{2}\b' #Note that if year in 5) and 4) are different, use 5) instead. full_pattern = '{}|{}|{}|{}|{}' . format (pattern1, pattern2, pattern3, pattern3, pattern4, pattern5) regex = re. compile (full_pattern) extracted_values = re.findall(regex, text) print ( '----------' ) print (extracted_values) print ( '----------' ) date_sorter() |