Python Forum

Full Version: Tokenize with RegEx python homework
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello everyone,
I have a Tokenize exercise and i'm not allowed to use the nltk. I'm kind of stuck on the regex. I am having problems with the quotation marks "" that are not recognized as tokens and also with "Mr. , Ms.", this should be considered as one single token while in my output Mr. appears as 'Mr', '.'. The rest seems to be fine but i am having these two problems.



text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
As an example , a text like : ' Mr. Brown opened the door and said with a
smile , "I can \ ' t believe it! It \ ' s such a pleasure to
see you!" '
should give an output like
Output:
[ ' Mr. ' , ' Brown ' , ' opened ' , ' the ' , ' door ' , ' and ' , ' said ' , ' with ' , ' a ' , ' smile ' ' , ' , ' " ' , ' I ' , ' ca ' , "n ' t", ' believe ' , ' it ' , ' ! ' , ' It ' , " ' s", ' a ' , ' pleasure ' , ' to ' , ' see ' , ' you ' , ' ! ' , ' " ' ]
I hope you understood my problem and manage to help me out.
Best regards and thanks in advance,
I cannot run the above line of code. Python throws a syntax error
>>> text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
  File "<stdin>", line 1
    text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
                                                                                                                ^
SyntaxError: unexpected character after line continuation character
I noticed that now is not working and i cannot find the error. How would u write it, given the output example up? Probably there is a less messy way , but im struggling since days and is not working.
The main points are that :

- Punctuation such as commas, colons, ecc have their own token
- Punctuation in “Mr.”, “Mrs.”, “Ms.”, and
“Dr.” should not receive its own token
- The word parts “n’t,” “’ll,” “’d,” “’ve,” “’m,” and “’re” get their own token
- Posessives (i.e. “John’s”) should be treated as two tokens, with the second
token starting at the apostrophe.