Python Forum
Trouble importing text from a .docx file
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Trouble importing text from a .docx file
#1
I am trying tokenize words of a Word Document Doc.docx having a sentence This is a doc file. But unfortunately, each token is getting prefixed with a letter 'u'

from nltk .tokenize import word_tokenize
import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText =
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Text = getText('Doc.docx')
words = word_tokenize(Text)
print(words)
Output:
Output : [u'This', u'is', u'a', u'doc', u'file']
Expected Output : ['This', 'is', 'a', 'doc', 'file']
Reply
#2
 But unfortunately, each token is getting prefixed with a letter 'u'
That is basically representing that each token is a unicode string. Try this to get rid of it

Text = getText('Doc.docx')
words = word_tokenize(Text)
words = map(str, words)
print(words)
In Python3 every string is unicode and therefore you wont get this issue (in fact its not even an issue). Use python3 or the trick above if using python2.
Reply
#3
That shouldn't be an issue. When you actually do something with those tokens/words, the 'u' won't be there anyway.
Reply
#4
As mention you do not do anything with those Unicode strings.
All string method work the same with Unicode string and using print the u wont be there.
>>> lst = [u'This', u'is', u'a', u'doc', u'file']
>>> for item in lst:
...     print(item)
...     
This
is
a
doc
file

>>> lst[0].upper()
u'THIS'
>>> # print and u is gone
>>> print(lst[0].upper())
THIS
You should be using python 3,there as mention is all string Unicode.
Unicode was one biggest changes in Python 3.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  no module named 'docx' when importing docx MaartenRo 1 843 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 3,290 Oct-17-2023, 06:03 PM
Last Post: Devan
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,111 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  New2Python: Help with Importing/Mapping Image Src to Image Code in File CluelessITguy 0 721 Nov-17-2022, 04:46 PM
Last Post: CluelessITguy
  Use module docx to get text from a file with a table Pedroski55 8 6,089 Aug-30-2022, 10:52 PM
Last Post: Pedroski55
  python-docx regex: replace any word in docx text Tmagpy 4 2,215 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
  Problem with importing Python file in Visual Studio Code DXav 7 5,065 Jun-15-2022, 12:54 PM
Last Post: snippsat
  importing functions from a separate python file in a separate directory Scordomaniac 3 1,365 May-17-2022, 07:49 AM
Last Post: Pedroski55
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,652 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Converted Pipe Delimited text file to CSV file atomxkai 4 6,949 Feb-11-2022, 12:38 AM
Last Post: atomxkai

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020