Python Forum
Parsing Oasis Open Document format. - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Parsing Oasis Open Document format. (/thread-25957.html)



Parsing Oasis Open Document format. - Achilles - Apr-17-2020

I want to write Parser of Oasis document v1.2 for only tags and their explanation. I am parsing tags correctly but I can't parse links belongs to each tag. Maybe you can offer me another way to do that project. I will be grateful Angel

there is my code link but i know have missing parts.:

https://i.stack.imgur.com/qJjxo.png

and there is for document link:
http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415340_253892949


RE: Parsing Oasis Open Document format. - buran - Apr-17-2020

Please, don't post images of code. Copy paste in python tags.
Please, use proper tags when post code, traceback, output, etc.
See BBcode help for more info.


RE: Parsing Oasis Open Document format. - Achilles - Apr-17-2020

from bs4 import BeautifulSoup, SoupStrainer
import requests, re

def main():
    #request ile metin çekilir
    req = requests.get('http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415340_253892949')
    soup = BeautifulSoup(req.content,"lxml")
    # '<a href="#__RefHeading__1419338_253892949">19.905 xhtml:about</a>''

    containers = soup.find_all(['tr','td'])

    filename = "basliklar.txt"
    f = open(filename, "w")

    headers = "baslik, link\n"
    f.write(headers)

    #başlık ve ona karşılık gelen veri çekilir.
    #tag'e karşılık bir veri yok!! tag = container.nextSibling.text
    for container in containers:
        if container.nextSibling == None:
            baslik = container.text
            f.write(baslik + "\n")
        else:
            links=([link.get('href')for link in soup.find_all('a')])
            print(links)
    f.close()

if __name__ == "__main__":
    main()