Why modules? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: News and Discussions (https://python-forum.io/forum-31.html) +--- Thread: Why modules? (/thread-21670.html) |
Why modules? - wavic - Oct-09-2019 It was asked in a forum I visit daily for help to extract some table data from a docx file. Nothing big. There was a couple of pages with bash scripting but I was looking for a Python solution. First, it was just a text file and after a little more discussion it was told us that the file is actually a docx document. So I tried python-docx module and I was disappointed. It was easy to use but it managed to get less than a third of the data. The rest wes an empty string. Don't know why. Perhaps encoding...? I don't know. After a few attempts, I gave up. The document is on my own language of course. So I started with quick research and it turned out that docx format is just a zip file with a bunch of xml files in it. Importing just xml and zipfile modules and the job was done in 30 lines of code. I was asked myself why I did even bother with python-docx at the first time. I lost two hours looking for a solution. Yes, it would have been easier if it had worked. But it didn't. Because all of that now I know how to use zipfile module and how to search for data in an xml file. I know what is docx and how to deal with it. I am satisfied. Not because of the python-docx module. RE: Why modules? - ichabod801 - Oct-09-2019 Perhaps you could write a tutorial on how you used xml and zipfile to get the information. RE: Why modules? - buran - Oct-09-2019 It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table. e.g. from docx import Document doc = Document('some_doc.docx') for table in doc.tables: for row in table rows: for cell in row.cells: print(cell.text)is it some complex table or what? RE: Why modules? - wavic - Oct-09-2019 (Oct-09-2019, 01:24 PM)buran Wrote: It's strange that you were not able to extract all info. Just out of curiosity - it would be interesting if you can share the file. It's more likely to get repetitive info then not to get info in the table. Not at all. There were two tables with two columns each. I've used the same approach as you did here. I'm gonna ask if I can share the file regardless that it was shared with us in the other forum. Probably later this evening. Quote:Perhaps you could write a tutorial on how you used xml and zipfile to get the information.I could try. My English is not so good as I want but I think I can do it. RE: Why modules? - ichabod801 - Oct-09-2019 (Oct-09-2019, 01:34 PM)wavic Wrote: My English is not so good as I want but I think I can do it. I'd be happy to look it over if you want a double check on the English. RE: Why modules? - wavic - Oct-09-2019 Here it is. Only the second column of the tables was needed - only the domain names along with the top-level ones. Uniques. No http, no www and nothing after all of that - /en/... etc. This is what I've got from the first table using python-docx ( Windows 10 ):
RE: Why modules? - buran - Oct-10-2019 OK. It turns out this is known issue due to hyperlinks in the cell text: https://github.com/python-openxml/python-docx/issues/304 https://github.com/python-openxml/python-docx/issues/406 https://github.com/python-openxml/python-docx/pull/377#issuecomment-287474988 Note that for list numbering (values in first column) there is open feature request https://github.com/python-openxml/python-docx/issues/471 I would have copy paste the column from the table as multi-line string and processed it from there RE: Why modules? - wavic - Oct-10-2019 It's cleat now. Thanks! I thought I was getting stupid. Anyway, one could copy all the document and use it as a simple text - bash proposals used this method. But the source is docx. The script I wrote is doing just great. I may write a tutorial about parsing docx using only the standard library. As @ichabod801 suggested. Here is the script. Simple enough for such a simple task. #!/usr/bin/env python3 import sys import xml.etree.ElementTree as et import zipfile as zf zip = zf.ZipFile(sys.argv[1]) doc = zip.open('word/document.xml') tree = et.parse(doc) root = tree.getroot() ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'} urls = [] for table in root.findall(".//w:tbl", ns): urls.extend([cell.text for cell in table.findall('.//w:t', ns)]) domains = [] for url in urls: if url.startswith('http'): domains.append(url.strip().split('//')[1].split('/')[0].lstrip('www.')) elif url.startswith('www'): domains.append(url.split('/')[0].lstrip('www.')) else: domains.append(url.split('/')[0]) for domain in sorted(list(set(domains))): print(domain)I had difficulty with xml's namespaces but finally, I've got it. |