Python Forum
Tokenize with RegEx python homework
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Tokenize with RegEx python homework
#1
Hello everyone,
I have a Tokenize exercise and i'm not allowed to use the nltk. I'm kind of stuck on the regex. I am having problems with the quotation marks "" that are not recognized as tokens and also with "Mr. , Ms.", this should be considered as one single token while in my output Mr. appears as 'Mr', '.'. The rest seems to be fine but i am having these two problems.



text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
As an example , a text like : ' Mr. Brown opened the door and said with a
smile , "I can \ ' t believe it! It \ ' s such a pleasure to
see you!" '
should give an output like
Output:
[ ' Mr. ' , ' Brown ' , ' opened ' , ' the ' , ' door ' , ' and ' , ' said ' , ' with ' , ' a ' , ' smile ' ' , ' , ' " ' , ' I ' , ' ca ' , "n ' t", ' believe ' , ' it ' , ' ! ' , ' It ' , " ' s", ' a ' , ' pleasure ' , ' to ' , ' see ' , ' you ' , ' ! ' , ' " ' ]
I hope you understood my problem and manage to help me out.
Best regards and thanks in advance,
Reply
#2
I cannot run the above line of code. Python throws a syntax error
>>> text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
  File "<stdin>", line 1
    text = re.compile (r'[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=)|[^\s\w'][A-Z]?\w+|[;.,!?:]|\')
                                                                                                                ^
SyntaxError: unexpected character after line continuation character
Reply
#3
I noticed that now is not working and i cannot find the error. How would u write it, given the output example up? Probably there is a less messy way , but im struggling since days and is not working.
The main points are that :

- Punctuation such as commas, colons, ecc have their own token
- Punctuation in “Mr.”, “Mrs.”, “Ms.”, and
“Dr.” should not receive its own token
- The word parts “n’t,” “’ll,” “’d,” “’ve,” “’m,” and “’re” get their own token
- Posessives (i.e. “John’s”) should be treated as two tokens, with the second
token starting at the apostrophe.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HELP in python homework makashito 4 3,864 Oct-12-2021, 10:12 AM
Last Post: buran
  CyperSecurity Using Python HomeWork ward1995 1 1,931 Jul-08-2021, 03:55 PM
Last Post: buran
Exclamation urgent , Python homework alm 2 2,257 May-09-2021, 11:19 AM
Last Post: Yoriz
  Homework with python Johnsonmfw 1 1,660 Sep-20-2020, 04:03 AM
Last Post: ndc85430
  Python Homework Help *Urgent GS31 2 2,539 Nov-24-2019, 01:41 PM
Last Post: ichabod801
  Python Homework Question OrcDroid123 1 2,343 Sep-01-2019, 08:44 AM
Last Post: buran
  Python homework / functions sunhyunshine 1 2,416 May-11-2019, 05:37 PM
Last Post: MrTheOne
  python homework help ASAP gk34332 1 2,944 Mar-13-2019, 07:27 PM
Last Post: ichabod801
  Python homework assigment makisha 3 3,232 Feb-28-2019, 10:21 PM
Last Post: Yoriz
  Python Homework Help beepBoop123 2 3,004 Dec-12-2018, 06:25 PM
Last Post: beepBoop123

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020