Tokenize with RegEx python homework

Tokenize with RegEx python homework - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Homework (https://python-forum.io/forum-9.html)
+--- Thread: Tokenize with RegEx python homework (/thread-14701.html)

Tokenize with RegEx python homework - pythoncrazy1 - Dec-12-2018

Hello everyone,
I have a Tokenize exercise and i'm not allowed to use the nltk. I'm kind of stuck on the regex. I am having problems with the quotation marks "" that are not recognized as tokens and also with "Mr. , Ms.", this should be considered as one single token while in my output Mr. appears as 'Mr', '.'. The rest seems to be fine but i am having these two problems.

text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')

As an example , a text like : ' Mr. Brown opened the door and said with a
smile , "I can \ ' t believe it! It \ ' s such a pleasure to
see you!" '
should give an output like

Output: 
[ ' Mr. ' , ' Brown ' , ' opened ' , ' the ' , ' door ' , ' and ' , ' said ' ,
' with ' , ' a ' , ' smile ' ' , ' , ' " ' , ' I ' , ' ca ' , "n ' t", '
believe ' , ' it ' , ' ! ' , ' It ' , " ' s", ' a ' , ' pleasure ' , ' to '
, ' see ' , ' you ' , ' ! ' , ' " ' ]

I hope you understood my problem and manage to help me out.
Best regards and thanks in advance,

RE: Tokenize with RegEx python homework - Gribouillis - Dec-13-2018

I cannot run the above line of code. Python throws a syntax error

>>> text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
  File "<stdin>", line 1
    text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
                                                                                                                ^
SyntaxError: unexpected character after line continuation character

RE: Tokenize with RegEx python homework - pythoncrazy1 - Dec-13-2018

I noticed that now is not working and i cannot find the error. How would u write it, given the output example up? Probably there is a less messy way , but im struggling since days and is not working.
The main points are that :

- Punctuation such as commas, colons, ecc have their own token
- Punctuation in “Mr.”, “Mrs.”, “Ms.”, and
“Dr.” should not receive its own token
- The word parts “n’t,” “’ll,” “’d,” “’ve,” “’m,” and “’re” get their own token
- Posessives (i.e. “John’s”) should be treated as two tokens, with the second
token starting at the apostrophe.