Python Forum

Full Version: Comparing 2 Files - Step 1, import and remove tags
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have a project I completed (mostly) in VBA, but don't think it's great for larger data sets. I'm thinking that maybe I should try to use Python for the 'engine' and just keep the VBA side for the UI and distribution. However, as seems to be the case with everything I try in python, I just can't make it work. Maybe if I can get past the first step, I'll be able to move forward on my own. So if any of you can assist, I'd certainly appreciate it.

For the first step, all I'm trying to do is import 2 word documents and remove the HTML/XML tags. I've tried https://www.tutorialspoint.com/python/py...cument.htm but can't get passed PIP INSTALL DOCX. I've tried Beautiful Soup 4 but get errors like " looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup." Then "'"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)" and on and on it goes. At least half a dozen "this is easy and should work"

All I want to do is open two file sand remove the html/xml. Surely there has to be something out there that I can just plug the file path and names into and see the results?

Thanks for any assistance.
(Oct-03-2018, 01:16 AM)JP_ROMANO Wrote: [ -> ]but can't get passed PIP INSTALL DOCX.
That will get the old module,it's pip install python-docx.
python-docx
# Test pip 18.0 is the newest  
C:\Users\Tom
λ pip -V
pip 18.0 from c:\python37\lib\site-packages\pip (python 3.7)

C:\Users\Tom
λ pip install python-docx
Collecting python-docx
  Downloading 
......
Building wheels for collected packages: python-docx
Successfully installed python-docx-0.8.7
λ ptpython
>>> from docx import Document

>>> document = Document('python_div.docx')
>>> for para in document.paragraphs:
...     print(para.text)
Thanks! Unfortunately, when I run pip -V I get
"pip 9.0.1 from c:\program files\anaconda3\lib\site-packages (python 3.5)"
When I run pip intstall python-docx I get
"Collecting python-docx
Could not find a version that satisfies the requirement python-docx (from versions: )
No matching distribution found for python-docx"

Scratch that last entry. Somebody from my engineering team just sent me a .ini to use which allowed me to proceed. The code you posted worked! Python can actually import the contents of a file!
(Oct-03-2018, 12:10 PM)JP_ROMANO Wrote: [ -> ]Scratch that last entry. Somebody from my engineering team just sent me a .ini to use which allowed me to proceed. The code you posted worked! Python can actually import the contents of a file!

I would suggest not to use the word import when you talk about reading(loading) file. import in Python is a reserved keyword, referring to loading module(s)/API(s), and improper use may confuse the potential reader.
volcano63 - Thank you for that tip!
I had no idea, and "read" doesn't seem right, but if that's what it is, that's what I'll use going forward...
(Oct-03-2018, 12:50 PM)JP_ROMANO Wrote: [ -> ]I had no idea, and "read" doesn't seem right, but if that's what it is, that's what I'll use going forward...
All the advanced APIs that create an object from a file have to read file content "under the hood".

E.g., in pandas, there are functions called read_csv, read_excel (explicitly telling you what they do).
Great, thank you!
I never know what functions to use and how to get them. Then when somebody helps me by directing me to them, they almost never work out of the box. So every little tidbit, like the one you just gave me, helps.

Thanks again and have a great day!
So now that I have the ability to load the two files, how do I go about removing the html/xml markup?

Thanks!