It was asked in a forum I visit daily for help to extract some table data from a docx file. Nothing big. There was a couple of pages with bash scripting but I was looking for a Python solution. First, it was just a text file and after a little more discussion it was told us that the file is actually a docx document. So I tried python-docx module and I was disappointed. It was easy to use but it managed to get less than a third of the data. The rest wes an empty string. Don't know why. Perhaps encoding...? I don't know. After a few attempts, I gave up. The document is on my own language of course.
So I started with quick research and it turned out that docx format is just a zip file with a bunch of xml files in it.
Importing just xml and zipfile modules and the job was done in 30 lines of code.
I was asked myself why I did even bother with python-docx at the first time. I lost two hours looking for a solution.
Yes, it would have been easier if it had worked. But it didn't. Because all of that now I know how to use zipfile module and how to search for data in an xml file. I know what is docx and how to deal with it. I am satisfied. Not because of the python-docx module.
Perhaps you could write a tutorial on how you used xml and zipfile to get the information.
It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table.
e.g.
from docx import Document
doc = Document('some_doc.docx')
for table in doc.tables:
for row in table rows:
for cell in row.cells:
print(cell.text)
is it some complex table or what?
(Oct-09-2019, 01:24 PM)buran Wrote: [ -> ]It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table.
e.g.
from docx import Document
doc = Document('some_doc.docx')
for table in doc.tables:
for row in table rows:
for cell in row.cells:
print(cell.text)
is it some complex table or what?
Not at all. There were two tables with two columns each. I've used the same approach as you did here.
I'm gonna ask if I can share the file regardless that it was shared with us in the other forum. Probably later this evening.
Quote:Perhaps you could write a tutorial on how you used xml and zipfile to get the information.
I could try. My English is not so good as I want but I think I can do it.
(Oct-09-2019, 01:34 PM)wavic Wrote: [ -> ]My English is not so good as I want but I think I can do it.
I'd be happy to look it over if you want a double check on the English.
Here it is.
Only the second column of the tables was needed - only the domain names along with the top-level ones. Uniques. No http, no www and nothing after all of that - /en/... etc.
This is what I've got from the first table using python-docx ( Windows 10 ):
Output:
www.afh.bg
www.legalcfd.com
www.ptbanc.com
www.cryptofg.com
www.payboutique.com
www.omegafx.io
It's cleat now. Thanks!
I thought I was getting stupid.
Anyway, one could copy all the document and use it as a simple text - bash proposals used this method. But the source is docx. The script I wrote is doing just great.
I may write a tutorial about parsing docx using only the standard library. As @
ichabod801 suggested.
Here is the script. Simple enough for such a simple task.
#!/usr/bin/env python3
import sys
import xml.etree.ElementTree as et
import zipfile as zf
zip = zf.ZipFile(sys.argv[1])
doc = zip.open('word/document.xml')
tree = et.parse(doc)
root = tree.getroot()
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
urls = []
for table in root.findall(".//w:tbl", ns):
urls.extend([cell.text for cell in table.findall('.//w:t', ns)])
domains = []
for url in urls:
if url.startswith('http'):
domains.append(url.strip().split('//')[1].split('/')[0].lstrip('www.'))
elif url.startswith('www'):
domains.append(url.split('/')[0].lstrip('www.'))
else:
domains.append(url.split('/')[0])
for domain in sorted(list(set(domains))):
print(domain)
I had difficulty with xml's namespaces but finally, I've got it.