Importing created module for web scraping with bs4 - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Importing created module for web scraping with bs4 (/thread-12656.html) |
Importing created module for web scraping with bs4 - Truman - Sep-05-2018 module that I made: <!DOCTYPE html> <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>web scraping script: import ht_doc from bs4 import BeautifulSoup soup = BeautifulSoup(ht_doc.html, 'html.parser') print(soup.prettify())error: Decided to practise Beautiful Soup module and hit a snag at the beginning. Why I can't import ht_doc module?
RE: Importing created module for web scraping with bs4 - metulburr - Sep-06-2018 you dont import the html doc. IF you are writing the html in the same file as the code, you can just save the content in a variable and put it in like html = ''' <!DOCTYPE html> <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' ... soup = BeautifulSoup(html, 'html.parser') ...However if you are obtaining the content from the internet, then you should use the requests module to get the html, and use ... r = requets.get('https://www.google.com') soup = BeautifulSoup(r.text, 'html.parser')If you have the html in a file, then you would have to load that html file like any other standard file and pass BeautifulSoup the whole file via fileObject.read() more info here at our tutorials https://python-forum.io/Thread-Web-Scraping-part-1 RE: Importing created module for web scraping with bs4 - Truman - Sep-06-2018 Thank you, had no idea that I can't use html file as module. Will create a variable. RE: Importing created module for web scraping with bs4 - Truman - Sep-07-2018 here is an another way to import html file: with open("htmll.html") as fp: soup = BeautifulSoup(fp) print(soup.prettify()) RE: Importing created module for web scraping with bs4 - snippsat - Sep-07-2018 (Sep-07-2018, 09:03 PM)Truman Wrote: here is an another way to import html file:Have to be careful with encoding doing this. Both on how get html from web(save) and open it. It's easy to mess up Unicode,so try use UTF-8 always in and out. with open("htmll.html", encoding=utf-8) as fp: soup = BeautifulSoup(fp, 'lxml') print(soup.prettify())Requests and BS do this well together,i explain better here Reading a html file. RE: Importing created module for web scraping with bs4 - buran - Sep-08-2018 (Sep-07-2018, 09:03 PM)Truman Wrote: here is an another way to import html filetry to be technically precise - you don't import the html file, you simply open it and pass the file handler to the BeautifulSoup to process it |