Dec-24-2017, 04:46 PM
I am trying tokenize words of a Word Document
Doc.docx
having a sentence This is a doc file
. But unfortunately, each token is getting prefixed with a letter 'u'from nltk .tokenize import word_tokenize import docx def getText(filename): doc = docx.Document(filename) fullText = for para in doc.paragraphs: fullText.append(para.text) return '\n'.join(fullText) Text = getText('Doc.docx') words = word_tokenize(Text) print(words)
Output:Output : [u'This', u'is', u'a', u'doc', u'file']
Expected Output : ['This', 'is', 'a', 'doc', 'file']