Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Read and tokenize doc/docx?
#1
Hello... I'm trying to see if Python can, given two filepath/names, read and tokenize the content of those files. This seems like it should be fairly easy, but I can't seem to get anything to work the way I'd expect.

I'd like to start with a 'file open' dialog, and this is all I can get working (with a great deal of help):
from tkinter import filedialog
from tkinter import *
root = Tk()
root.filename =  filedialog.askopenfilename(initialdir = "/",title = "Select file",filetypes = (("all files","*.*"),("all files","*.*")))
print ("For the first document, you selected:  "  + root.filename)
Now I'd like to take the content of each file and create a tokenized string, where each token is a single word. So the input reads "This is the content of my file" and I need
"This" "is" "the" "content" "of" "my" "file"
or
This,is,the,content,of,my,file

Maybe in python that would be a list?
This is about as far as I've gotten
from docx import Document
import re
document1 = Document('mydrive\\mypath\\filenanme1.docx')
document2 = Document('mydrive\\mypath\\filenanme2.docx')
I found some info on tokenizing, but it makes very little sense to person who can't understand a lick of python like myself.
https://docs.python.org/2/library/tokenize.html

Thanks for any guidance!
Reply
#2
Quick follow up
It looks like we cannot use DOCX because it requires changes to ini files, or path variables or something, and while I don't mind making the changes to my pc, I cannot say the same for the general population that will be using the final product. So, back to the drawing board!
Reply
#3
(Oct-25-2018, 04:15 PM)JP_ROMANO Wrote: It looks like we cannot use DOCX because it requires changes to ini files, or path variables or something
There is no need to do any changes to work with docx. Can you elaborate what you mean.
Regarding tokenize:
check NLTK. https://www.nltk.org/
e.g. the very first example on the site
a naive approach would be to use str.split() method
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
buran - seems that internal users here in our office get
"pip install fails with “connection error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:598)” when they try to install docx

Maybe it's an environmental thing - the three of us testing were only able to get it to load after changes to our .ini files
Reply
#5
That looks like an ssl error, not a docx error. Do they have an old version of openssl installed? Or, if it's windows, an old version of python/pip/windows/anything?

You could install docx in a virtual environment, storing the package locally, and just distribute it along with your program, instead of installing globally on each computer.
Reply
#6
One of the users, who is much more experienced than I am, was able to get past the SSL error, but then ran into another set of exceptions - and the deeper we go, the more problems we see.

So, we're going to bail on this approach - but surely there is some other way that python can, right out of the box, read a word document, right?

Quote:Regarding tokenize:
check NLTK. https://www.nltk.org/
e.g. the very first example on the site

-- I tried, but got more errors (I'll won't bore you with all the details) and ultimately wasn't able to get even the first sample of code to do anything.

But, I think maybe I can use the VBA code this was meant to replace to generate txt files, then shell out to a python script and take it from there - which I'll try to figure out in the next few days. That may be more reasonable than trying to have a larger user community installing things, manipulating configurations and so on. I need to try to make this to work straight out of the box, so maybe a new approach is warranted.

Thanks again for your time and help!
Reply
#7
(Oct-25-2018, 05:32 PM)JP_ROMANO Wrote: So, we're going to bail on this approach - but surely there is some other way that python can, right out of the box, read a word document, right?
No, there is no out of the box way...
1. docx
2. pywin32
3. ctypes
4. use some native xml library (docx is just bunch of zip-ed xml files), but you need to implement all xml work on your own.

As you don't say what are these set of exceptions we cannot help. Also, if problem with install you can download a wheel from Gohlke and use it e.g. pip install docx‑0.2.4‑py2.py3‑none‑any.whl

also docx is hosted on PyPI (https://python-docx.readthedocs.io/en/la...ml#install) I don't think there is problem with the PyPI SSL certificate
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#8
You can use docx, see: https://automatetheboringstuff.com/chapter13/
Reply
#9
Lazr60 - thanks for the link, I'll give it a read through when my blood pressure returns to a normal state :-)

Buran - thank you, again, for the 4 points, I'll send them over to my coworkers who live in python.
But if you're interested in the bigger picture, I'm happy to give some of the details.
... We have files that are written transcripts from audio/visual conference calls. They are all MS Word files, and we have 2 per call - each from one of two sources (generally, one of them has html/xml markup and the other does not, but not always). We need to evaluate the quality of them, and doing so requires that one be considered a master or golden copy, and the other to be a scored based on closeness to it. I have an excel/vba/sql platform that I wrote with a UI, file import, scoring process, and some other bells and whistles (e.g., it finds where the actual call content begins, and excludes things like operator's instructions; re-calibration, for situations where chunks of words are missed, added, or just wrong; db storage and reporting on the associated metadata; etc.) At the end of the day, the 'other' file has a score and the differences between the files, the types of potential errors identified, and the number of speaker transitions are recorded.

What my approach does not yet do is ignore things that are literal differences, but are fairly insignificant. For example, if one file says "can not" and the other "cannot" or one has cardinal numbers and the other ordinal, we can consider them to be the same.

My thought was that there is probably something already done in some python library that we could leverage, so I pulled some of my peers who are regular python users to help out. I'm just trying to get my hands into the early part of the process, in what I thought would be fairly simple, so they could focus on reproducing the VBA code (or achieving similar results) relating to the text alignment, re-calibration, and the more challenging stuff. Ultimately, I'd like to be able to ditch the UI and run scripts against file pairs sitting on a server somewhere, but we're light years from that at the moment.

I hope that helps clarify a bit. Even though using python was my suggestion, I think we're just barking up the wrong tree until we get to that later phase where we don't need or want to use a UI.

Anyway, I thank you again for your time and the info you provided!
Reply
#10
NLP is not my domain, but look at https://www.kdnuggets.com/2018/07/compar...aries.html

Based on my experience with VBA it shouldn't be difficult to convert to Python even without all these NLP libraries. For someone who is expert in the filed and who can utilize them it should be even easier.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  no module named 'docx' when importing docx MaartenRo 1 831 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  python-docx regex: replace any word in docx text Tmagpy 4 2,212 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
  My Python Console doesn´t work ModuleNotFoundError: No module named 'tokenize' RuanKishibe 1 3,122 Aug-06-2020, 10:07 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020