Python Forum
Trouble importing text from a .docx file
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Trouble importing text from a .docx file
#1
I am trying tokenize words of a Word Document Doc.docx having a sentence This is a doc file. But unfortunately, each token is getting prefixed with a letter 'u'

from nltk .tokenize import word_tokenize
import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText =
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Text = getText('Doc.docx')
words = word_tokenize(Text)
print(words)
Output:
Output : [u'This', u'is', u'a', u'doc', u'file']
Expected Output : ['This', 'is', 'a', 'doc', 'file']
Reply
#2
 But unfortunately, each token is getting prefixed with a letter 'u'
That is basically representing that each token is a unicode string. Try this to get rid of it

Text = getText('Doc.docx')
words = word_tokenize(Text)
words = map(str, words)
print(words)
In Python3 every string is unicode and therefore you wont get this issue (in fact its not even an issue). Use python3 or the trick above if using python2.
Reply
#3
That shouldn't be an issue. When you actually do something with those tokens/words, the 'u' won't be there anyway.
Reply
#4
As mention you do not do anything with those Unicode strings.
All string method work the same with Unicode string and using print the u wont be there.
>>> lst = [u'This', u'is', u'a', u'doc', u'file']
>>> for item in lst:
...     print(item)
...     
This
is
a
doc
file

>>> lst[0].upper()
u'THIS'
>>> # print and u is gone
>>> print(lst[0].upper())
THIS
You should be using python 3,there as mention is all string Unicode.
Unicode was one biggest changes in Python 3.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  docx file to pandas dataframe/excel iitip92 1 3,022 Jun-27-2024, 05:28 AM
Last Post: Pedroski55
  no module named 'docx' when importing docx MaartenRo 1 6,145 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 24,865 Oct-17-2023, 06:03 PM
Last Post: Devan
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 2,124 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  New2Python: Help with Importing/Mapping Image Src to Image Code in File CluelessITguy 0 1,269 Nov-17-2022, 04:46 PM
Last Post: CluelessITguy
  Use module docx to get text from a file with a table Pedroski55 8 20,269 Aug-30-2022, 10:52 PM
Last Post: Pedroski55
  python-docx regex: replace any word in docx text Tmagpy 4 3,934 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
  Problem with importing Python file in Visual Studio Code DXav 7 10,388 Jun-15-2022, 12:54 PM
Last Post: snippsat
  importing functions from a separate python file in a separate directory Scordomaniac 3 2,409 May-17-2022, 07:49 AM
Last Post: Pedroski55
  Modify values in XML file by data from text file (without parsing) Paqqno 2 3,299 Apr-13-2022, 06:02 AM
Last Post: Paqqno

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020