Read and tokenize doc/docx?

JP_ROMANO · Oct-25-2018, 03:12 PM

Hello... I'm trying to see if Python can, given two filepath/names, read and tokenize the content of those files. This seems like it should be fairly easy, but I can't seem to get anything to work the way I'd expect.

I'd like to start with a 'file open' dialog, and this is all I can get working (with a great deal of help):

from tkinter import filedialog
from tkinter import *
root = Tk()
root.filename =  filedialog.askopenfilename(initialdir = "/",title = "Select file",filetypes = (("all files","*.*"),("all files","*.*")))
print ("For the first document, you selected:  "  + root.filename)

Now I'd like to take the content of each file and create a tokenized string, where each token is a single word. So the input reads "This is the content of my file" and I need
"This" "is" "the" "content" "of" "my" "file"
or
This,is,the,content,of,my,file

Maybe in python that would be a list?
This is about as far as I've gotten

from docx import Document
import re
document1 = Document('mydrive\\mypath\\filenanme1.docx')
document2 = Document('mydrive\\mypath\\filenanme2.docx')

I found some info on tokenizing, but it makes very little sense to person who can't understand a lick of python like myself.
https://docs.python.org/2/library/tokenize.html

Thanks for any guidance!

JP_ROMANO · Oct-25-2018, 04:15 PM

Quick follow up
It looks like we cannot use DOCX because it requires changes to ini files, or path variables or something, and while I don't mind making the changes to my pc, I cannot say the same for the general population that will be using the final product. So, back to the drawing board!

**buran** · (This post was last modified: Oct-25-2018, 04:22 PM by buran.)

(Oct-25-2018, 04:15 PM)JP_ROMANO Wrote: It looks like we cannot use DOCX because it requires changes to ini files, or path variables or something

There is no need to do any changes to work with docx. Can you elaborate what you mean.
Regarding tokenize:
check NLTK. https://www.nltk.org/
e.g. the very first example on the site
a naive approach would be to use str.split() method

JP_ROMANO · Oct-25-2018, 05:08 PM

buran - seems that internal users here in our office get
"pip install fails with “connection error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:598)” when they try to install docx

Maybe it's an environmental thing - the three of us testing were only able to get it to load after changes to our .ini files

**nilamo** · (This post was last modified: Oct-25-2018, 05:30 PM by nilamo.)

That looks like an ssl error, not a docx error. Do they have an old version of openssl installed? Or, if it's windows, an old version of python/pip/windows/anything?

You could install docx in a virtual environment, storing the package locally, and just distribute it along with your program, instead of installing globally on each computer.

JP_ROMANO · (This post was last modified: Oct-25-2018, 06:14 PM by JP_ROMANO.)

One of the users, who is much more experienced than I am, was able to get past the SSL error, but then ran into another set of exceptions - and the deeper we go, the more problems we see.

So, we're going to bail on this approach - but surely there is some other way that python can, right out of the box, read a word document, right?

Quote:Regarding tokenize:
check NLTK. https://www.nltk.org/
e.g. the very first example on the site

-- I tried, but got more errors (I'll won't bore you with all the details) and ultimately wasn't able to get even the first sample of code to do anything.

But, I think maybe I can use the VBA code this was meant to replace to generate txt files, then shell out to a python script and take it from there - which I'll try to figure out in the next few days. That may be more reasonable than trying to have a larger user community installing things, manipulating configurations and so on. I need to try to make this to work straight out of the box, so maybe a new approach is warranted.

Thanks again for your time and help!

**buran** · (This post was last modified: Oct-25-2018, 06:25 PM by buran.)

(Oct-25-2018, 05:32 PM)JP_ROMANO Wrote: So, we're going to bail on this approach - but surely there is some other way that python can, right out of the box, read a word document, right?

No, there is no out of the box way...
1. docx
2. pywin32
3. ctypes
4. use some native xml library (docx is just bunch of zip-ed xml files), but you need to implement all xml work on your own.

As you don't say what are these set of exceptions we cannot help. Also, if problem with install you can download a wheel from Gohlke and use it e.g. pip install docx‑0.2.4‑py2.py3‑none‑any.whl

also docx is hosted on PyPI (https://python-docx.readthedocs.io/en/la...ml#install) I don't think there is problem with the PyPI SSL certificate

**Larz60+** · Oct-25-2018, 06:24 PM

You can use docx, see: https://automatetheboringstuff.com/chapter13/

JP_ROMANO · Oct-25-2018, 07:10 PM

Lazr60 - thanks for the link, I'll give it a read through when my blood pressure returns to a normal state :-)

Buran - thank you, again, for the 4 points, I'll send them over to my coworkers who live in python.
But if you're interested in the bigger picture, I'm happy to give some of the details.
... We have files that are written transcripts from audio/visual conference calls. They are all MS Word files, and we have 2 per call - each from one of two sources (generally, one of them has html/xml markup and the other does not, but not always). We need to evaluate the quality of them, and doing so requires that one be considered a master or golden copy, and the other to be a scored based on closeness to it. I have an excel/vba/sql platform that I wrote with a UI, file import, scoring process, and some other bells and whistles (e.g., it finds where the actual call content begins, and excludes things like operator's instructions; re-calibration, for situations where chunks of words are missed, added, or just wrong; db storage and reporting on the associated metadata; etc.) At the end of the day, the 'other' file has a score and the differences between the files, the types of potential errors identified, and the number of speaker transitions are recorded.

What my approach does not yet do is ignore things that are literal differences, but are fairly insignificant. For example, if one file says "can not" and the other "cannot" or one has cardinal numbers and the other ordinal, we can consider them to be the same.

My thought was that there is probably something already done in some python library that we could leverage, so I pulled some of my peers who are regular python users to help out. I'm just trying to get my hands into the early part of the process, in what I thought would be fairly simple, so they could focus on reproducing the VBA code (or achieving similar results) relating to the text alignment, re-calibration, and the more challenging stuff. Ultimately, I'd like to be able to ditch the UI and run scripts against file pairs sitting on a server somewhere, but we're light years from that at the moment.

I hope that helps clarify a bit. Even though using python was my suggestion, I think we're just barking up the wrong tree until we get to that later phase where we don't need or want to use a UI.

Anyway, I thank you again for your time and the info you provided!

**buran** · (This post was last modified: Oct-25-2018, 07:37 PM by buran.)

NLP is not my domain, but look at https://www.kdnuggets.com/2018/07/compar...aries.html

Based on my experience with VBA it shouldn't be difficult to convert to Python even without all these NLP libraries. For someone who is expert in the filed and who can utilize them it should be even easier.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	no module named 'docx' when importing docx	MaartenRo	1	831	Dec-31-2023, 11:21 AM Last Post: deanhystad
	python-docx regex: replace any word in docx text	Tmagpy	4	2,212	Jun-18-2022, 09:12 AM Last Post: Tmagpy
	My Python Console doesn´t work ModuleNotFoundError: No module named 'tokenize'	RuanKishibe	1	3,122	Aug-06-2020, 10:07 PM Last Post: deanhystad

Read and tokenize doc/docx?

User Panel Messages

Announcements