Python Forum
Comparing 2 Files - Step 1, import and remove tags
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Comparing 2 Files - Step 1, import and remove tags
#1
I have a project I completed (mostly) in VBA, but don't think it's great for larger data sets. I'm thinking that maybe I should try to use Python for the 'engine' and just keep the VBA side for the UI and distribution. However, as seems to be the case with everything I try in python, I just can't make it work. Maybe if I can get past the first step, I'll be able to move forward on my own. So if any of you can assist, I'd certainly appreciate it.

For the first step, all I'm trying to do is import 2 word documents and remove the HTML/XML tags. I've tried https://www.tutorialspoint.com/python/py...cument.htm but can't get passed PIP INSTALL DOCX. I've tried Beautiful Soup 4 but get errors like " looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup." Then "'"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)" and on and on it goes. At least half a dozen "this is easy and should work"

All I want to do is open two file sand remove the html/xml. Surely there has to be something out there that I can just plug the file path and names into and see the results?

Thanks for any assistance.
Reply
#2
(Oct-03-2018, 01:16 AM)JP_ROMANO Wrote: but can't get passed PIP INSTALL DOCX.
That will get the old module,it's pip install python-docx.
python-docx
# Test pip 18.0 is the newest  
C:\Users\Tom
λ pip -V
pip 18.0 from c:\python37\lib\site-packages\pip (python 3.7)

C:\Users\Tom
λ pip install python-docx
Collecting python-docx
  Downloading 
......
Building wheels for collected packages: python-docx
Successfully installed python-docx-0.8.7
λ ptpython
>>> from docx import Document

>>> document = Document('python_div.docx')
>>> for para in document.paragraphs:
...     print(para.text)
Reply
#3
Thanks! Unfortunately, when I run pip -V I get
"pip 9.0.1 from c:\program files\anaconda3\lib\site-packages (python 3.5)"
When I run pip intstall python-docx I get
"Collecting python-docx
Could not find a version that satisfies the requirement python-docx (from versions: )
No matching distribution found for python-docx"

Scratch that last entry. Somebody from my engineering team just sent me a .ini to use which allowed me to proceed. The code you posted worked! Python can actually import the contents of a file!
Reply
#4
(Oct-03-2018, 12:10 PM)JP_ROMANO Wrote: Scratch that last entry. Somebody from my engineering team just sent me a .ini to use which allowed me to proceed. The code you posted worked! Python can actually import the contents of a file!

I would suggest not to use the word import when you talk about reading(loading) file. import in Python is a reserved keyword, referring to loading module(s)/API(s), and improper use may confuse the potential reader.
Test everything in a Python shell (iPython, Azure Notebook, etc.)
  • Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
  • Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
  • You posted a claim that something you did not test works? Be prepared to eat your hat.
Reply
#5
volcano63 - Thank you for that tip!
I had no idea, and "read" doesn't seem right, but if that's what it is, that's what I'll use going forward...
Reply
#6
(Oct-03-2018, 12:50 PM)JP_ROMANO Wrote: I had no idea, and "read" doesn't seem right, but if that's what it is, that's what I'll use going forward...
All the advanced APIs that create an object from a file have to read file content "under the hood".

E.g., in pandas, there are functions called read_csv, read_excel (explicitly telling you what they do).
Test everything in a Python shell (iPython, Azure Notebook, etc.)
  • Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
  • Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
  • You posted a claim that something you did not test works? Be prepared to eat your hat.
Reply
#7
Great, thank you!
I never know what functions to use and how to get them. Then when somebody helps me by directing me to them, they almost never work out of the box. So every little tidbit, like the one you just gave me, helps.

Thanks again and have a great day!
Reply
#8
So now that I have the ability to load the two files, how do I go about removing the html/xml markup?

Thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Import multiple CSV files into pandas Krayna 0 1,725 May-20-2021, 04:56 PM
Last Post: Krayna
  import numpy in sub-files paul18fr 1 2,034 Aug-06-2019, 12:38 PM
Last Post: chakrimakam
  comparing two columns two different files in pandas nuncio 0 2,403 Jun-06-2018, 01:04 PM
Last Post: nuncio
  import/use data from text files MichealPeterson 1 3,321 Jun-28-2017, 08:51 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020