Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Why modules?
#1
It was asked in a forum I visit daily for help to extract some table data from a docx file. Nothing big. There was a couple of pages with bash scripting but I was looking for a Python solution. First, it was just a text file and after a little more discussion it was told us that the file is actually a docx document. So I tried python-docx module and I was disappointed. It was easy to use but it managed to get less than a third of the data. The rest wes an empty string. Don't know why. Perhaps encoding...? I don't know. After a few attempts, I gave up. The document is on my own language of course.

So I started with quick research and it turned out that docx format is just a zip file with a bunch of xml files in it.
Importing just xml and zipfile modules and the job was done in 30 lines of code.

I was asked myself why I did even bother with python-docx at the first time. I lost two hours looking for a solution.
Yes, it would have been easier if it had worked. But it didn't. Because all of that now I know how to use zipfile module and how to search for data in an xml file. I know what is docx and how to deal with it. I am satisfied. Not because of the python-docx module.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#2
Perhaps you could write a tutorial on how you used xml and zipfile to get the information.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table.
e.g.
from docx import Document

doc =  Document('some_doc.docx')
for table in doc.tables:
    for row in table rows:
        for cell in row.cells:
            print(cell.text)
            
is it some complex table or what?
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
(Oct-09-2019, 01:24 PM)buran Wrote: It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table.
e.g.
from docx import Document

doc =  Document('some_doc.docx')
for table in doc.tables:
    for row in table rows:
        for cell in row.cells:
            print(cell.text)
            
is it some complex table or what?

Not at all. There were two tables with two columns each. I've used the same approach as you did here.
I'm gonna ask if I can share the file regardless that it was shared with us in the other forum. Probably later this evening.

Quote:Perhaps you could write a tutorial on how you used xml and zipfile to get the information.
I could try. My English is not so good as I want but I think I can do it.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#5
(Oct-09-2019, 01:34 PM)wavic Wrote: My English is not so good as I want but I think I can do it.

I'd be happy to look it over if you want a double check on the English.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#6
Here it is.

Only the second column of the tables was needed - only the domain names along with the top-level ones. Uniques. No http, no www and nothing after all of that - /en/... etc.

This is what I've got from the first table using python-docx ( Windows 10 ):
Output:
www.afh.bg www.legalcfd.com www.ptbanc.com www.cryptofg.com www.payboutique.com www.omegafx.io

Attached Files

.docx   Решение 1045.docx (Size: 27.61 KB / Downloads: 154)
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#7
OK. It turns out this is known issue due to hyperlinks in the cell text:

https://github.com/python-openxml/python...issues/304
https://github.com/python-openxml/python...issues/406
https://github.com/python-openxml/python...-287474988

Note that for list numbering (values in first column) there is open feature request https://github.com/python-openxml/python...issues/471

I would have copy paste the column from the table as multi-line string and processed it from there
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#8
It's cleat now. Thanks!
I thought I was getting stupid. Big Grin

Anyway, one could copy all the document and use it as a simple text - bash proposals used this method. But the source is docx. The script I wrote is doing just great.
I may write a tutorial about parsing docx using only the standard library. As @ichabod801 suggested.

Here is the script. Simple enough for such a simple task.
#!/usr/bin/env python3

import sys
import xml.etree.ElementTree as et
import zipfile as zf

zip = zf.ZipFile(sys.argv[1])
doc = zip.open('word/document.xml')

tree = et.parse(doc)
root = tree.getroot()

ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

urls = []
for table in root.findall(".//w:tbl", ns):
    urls.extend([cell.text for cell in table.findall('.//w:t', ns)])

domains = []
for url in urls:
    if url.startswith('http'):
        domains.append(url.strip().split('//')[1].split('/')[0].lstrip('www.'))
    elif url.startswith('www'):
        domains.append(url.split('/')[0].lstrip('www.'))
    else:
        domains.append(url.split('/')[0])
    
for domain in sorted(list(set(domains))):
    print(domain)
I had difficulty with xml's namespaces but finally, I've got it.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020