Oct-10-2019, 11:46 AM
It's cleat now. Thanks!
I thought I was getting stupid.
Anyway, one could copy all the document and use it as a simple text - bash proposals used this method. But the source is docx. The script I wrote is doing just great.
I may write a tutorial about parsing docx using only the standard library. As @ichabod801 suggested.
Here is the script. Simple enough for such a simple task.
I thought I was getting stupid.
Anyway, one could copy all the document and use it as a simple text - bash proposals used this method. But the source is docx. The script I wrote is doing just great.
I may write a tutorial about parsing docx using only the standard library. As @ichabod801 suggested.
Here is the script. Simple enough for such a simple task.
#!/usr/bin/env python3 import sys import xml.etree.ElementTree as et import zipfile as zf zip = zf.ZipFile(sys.argv[1]) doc = zip.open('word/document.xml') tree = et.parse(doc) root = tree.getroot() ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'} urls = [] for table in root.findall(".//w:tbl", ns): urls.extend([cell.text for cell in table.findall('.//w:t', ns)]) domains = [] for url in urls: if url.startswith('http'): domains.append(url.strip().split('//')[1].split('/')[0].lstrip('www.')) elif url.startswith('www'): domains.append(url.split('/')[0].lstrip('www.')) else: domains.append(url.split('/')[0]) for domain in sorted(list(set(domains))): print(domain)I had difficulty with xml's namespaces but finally, I've got it.