Why modules?

wavic · Oct-09-2019, 11:57 AM

It was asked in a forum I visit daily for help to extract some table data from a docx file. Nothing big. There was a couple of pages with bash scripting but I was looking for a Python solution. First, it was just a text file and after a little more discussion it was told us that the file is actually a docx document. So I tried python-docx module and I was disappointed. It was easy to use but it managed to get less than a third of the data. The rest wes an empty string. Don't know why. Perhaps encoding...? I don't know. After a few attempts, I gave up. The document is on my own language of course.

So I started with quick research and it turned out that docx format is just a zip file with a bunch of xml files in it.
Importing just xml and zipfile modules and the job was done in 30 lines of code.

I was asked myself why I did even bother with python-docx at the first time. I lost two hours looking for a solution.
Yes, it would have been easier if it had worked. But it didn't. Because all of that now I know how to use zipfile module and how to search for data in an xml file. I know what is docx and how to deal with it. I am satisfied. Not because of the python-docx module.

***ichabod801*** · Oct-09-2019, 01:05 PM

Perhaps you could write a tutorial on how you used xml and zipfile to get the information.

**buran** · (This post was last modified: Oct-09-2019, 01:24 PM by buran.)

It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table.
e.g.

from docx import Document

doc =  Document('some_doc.docx')
for table in doc.tables:
    for row in table rows:
        for cell in row.cells:
            print(cell.text)

is it some complex table or what?

wavic · Oct-09-2019, 01:34 PM

(Oct-09-2019, 01:24 PM)buran Wrote: It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table.
e.g.
from docx import Document

doc =  Document('some_doc.docx')
for table in doc.tables:
    for row in table rows:
        for cell in row.cells:
            print(cell.text)
            
is it some complex table or what?

Not at all. There were two tables with two columns each. I've used the same approach as you did here.
I'm gonna ask if I can share the file regardless that it was shared with us in the other forum. Probably later this evening.

Quote:Perhaps you could write a tutorial on how you used xml and zipfile to get the information.

I could try. My English is not so good as I want but I think I can do it.

***ichabod801*** · Oct-09-2019, 03:30 PM

(Oct-09-2019, 01:34 PM)wavic Wrote: My English is not so good as I want but I think I can do it.

I'd be happy to look it over if you want a double check on the English.

wavic · (This post was last modified: Oct-09-2019, 09:42 PM by wavic.)

Here it is.

Only the second column of the tables was needed - only the domain names along with the top-level ones. Uniques. No http, no www and nothing after all of that - /en/... etc.

This is what I've got from the first table using python-docx ( Windows 10 ):

Output:www.afh.bg



www.legalcfd.com


www.ptbanc.com











www.cryptofg.com

www.payboutique.com





www.omegafx.io

**buran** · (This post was last modified: Oct-10-2019, 07:23 AM by buran.)

OK. It turns out this is known issue due to hyperlinks in the cell text:

https://github.com/python-openxml/python...issues/304
https://github.com/python-openxml/python...issues/406
https://github.com/python-openxml/python...-287474988

Note that for list numbering (values in first column) there is open feature request https://github.com/python-openxml/python...issues/471

I would have copy paste the column from the table as multi-line string and processed it from there

wavic · Oct-10-2019, 11:46 AM

It's cleat now. Thanks!
I thought I was getting stupid. Big Grin

Anyway, one could copy all the document and use it as a simple text - bash proposals used this method. But the source is docx. The script I wrote is doing just great.
I may write a tutorial about parsing docx using only the standard library. As @ichabod801 suggested.

Here is the script. Simple enough for such a simple task.

#!/usr/bin/env python3

import sys
import xml.etree.ElementTree as et
import zipfile as zf

zip = zf.ZipFile(sys.argv[1])
doc = zip.open('word/document.xml')

tree = et.parse(doc)
root = tree.getroot()

ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

urls = []
for table in root.findall(".//w:tbl", ns):
    urls.extend([cell.text for cell in table.findall('.//w:t', ns)])

domains = []
for url in urls:
    if url.startswith('http'):
        domains.append(url.strip().split('//')[1].split('/')[0].lstrip('www.'))
    elif url.startswith('www'):
        domains.append(url.split('/')[0].lstrip('www.'))
    else:
        domains.append(url.split('/')[0])
    
for domain in sorted(list(set(domains))):
    print(domain)

I had difficulty with xml's namespaces but finally, I've got it.

Why modules?

User Panel Messages

Announcements