Tokenize with RegEx python homework - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: Tokenize with RegEx python homework (/thread-14701.html) |
Tokenize with RegEx python homework - pythoncrazy1 - Dec-12-2018 Hello everyone, I have a Tokenize exercise and i'm not allowed to use the nltk. I'm kind of stuck on the regex. I am having problems with the quotation marks "" that are not recognized as tokens and also with "Mr. , Ms.", this should be considered as one single token while in my output Mr. appears as 'Mr', '.'. The rest seems to be fine but i am having these two problems. text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')As an example , a text like : ' Mr. Brown opened the door and said with a smile , "I can \ ' t believe it! It \ ' s such a pleasure to see you!" ' should give an output like I hope you understood my problem and manage to help me out. Best regards and thanks in advance, RE: Tokenize with RegEx python homework - Gribouillis - Dec-13-2018 I cannot run the above line of code. Python throws a syntax error >>> text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\') File "<stdin>", line 1 text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\') ^ SyntaxError: unexpected character after line continuation character RE: Tokenize with RegEx python homework - pythoncrazy1 - Dec-13-2018 I noticed that now is not working and i cannot find the error. How would u write it, given the output example up? Probably there is a less messy way , but im struggling since days and is not working. The main points are that : - Punctuation such as commas, colons, ecc have their own token - Punctuation in “Mr.”, “Mrs.”, “Ms.”, and “Dr.” should not receive its own token - The word parts “n’t,” “’ll,” “’d,” “’ve,” “’m,” and “’re” get their own token - Posessives (i.e. “John’s”) should be treated as two tokens, with the second token starting at the apostrophe. |