Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Why modules?
#8
It's cleat now. Thanks!
I thought I was getting stupid. Big Grin

Anyway, one could copy all the document and use it as a simple text - bash proposals used this method. But the source is docx. The script I wrote is doing just great.
I may write a tutorial about parsing docx using only the standard library. As @ichabod801 suggested.

Here is the script. Simple enough for such a simple task.
#!/usr/bin/env python3

import sys
import xml.etree.ElementTree as et
import zipfile as zf

zip = zf.ZipFile(sys.argv[1])
doc = zip.open('word/document.xml')

tree = et.parse(doc)
root = tree.getroot()

ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

urls = []
for table in root.findall(".//w:tbl", ns):
    urls.extend([cell.text for cell in table.findall('.//w:t', ns)])

domains = []
for url in urls:
    if url.startswith('http'):
        domains.append(url.strip().split('//')[1].split('/')[0].lstrip('www.'))
    elif url.startswith('www'):
        domains.append(url.split('/')[0].lstrip('www.'))
    else:
        domains.append(url.split('/')[0])
    
for domain in sorted(list(set(domains))):
    print(domain)
I had difficulty with xml's namespaces but finally, I've got it.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Messages In This Thread
Why modules? - by wavic - Oct-09-2019, 11:57 AM
RE: Why modules? - by ichabod801 - Oct-09-2019, 01:05 PM
RE: Why modules? - by buran - Oct-09-2019, 01:24 PM
RE: Why modules? - by wavic - Oct-09-2019, 01:34 PM
RE: Why modules? - by ichabod801 - Oct-09-2019, 03:30 PM
RE: Why modules? - by wavic - Oct-09-2019, 09:42 PM
RE: Why modules? - by buran - Oct-10-2019, 07:23 AM
RE: Why modules? - by wavic - Oct-10-2019, 11:46 AM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020