Python Forum

Full Version: Importing created module for web scraping with bs4
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
module that I made:
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
web scraping script:
import ht_doc
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc.html, 'html.parser')
print(soup.prettify())
error:
Error:
C:\Python36\kodovi>busoup.py Traceback (most recent call last): File "C:\Python36\kodovi\busoup.py", line 1, in <module> import ht_doc ModuleNotFoundError: No module named 'ht_doc'
Decided to practise Beautiful Soup module and hit a snag at the beginning. Wall Why I can't import ht_doc module?
you dont import the html doc. IF you are writing the html in the same file as the code, you can just save the content in a variable and put it in like

html = '''
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
  
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
  
<p class="story">...</p>
'''

...
soup = BeautifulSoup(html, 'html.parser')
...
However if you are obtaining the content from the internet, then you should use the requests module to get the html, and use

...
r = requets.get('https://www.google.com')
soup = BeautifulSoup(r.text, 'html.parser')
If you have the html in a file, then you would have to load that html file like any other standard file and pass BeautifulSoup the whole file via fileObject.read()

more info here at our tutorials
https://python-forum.io/Thread-Web-Scraping-part-1
Thank you, had no idea that I can't use html file as module. Will create a variable.
here is an another way to import html file:

with open("htmll.html") as fp:
	soup = BeautifulSoup(fp)
	print(soup.prettify())
(Sep-07-2018, 09:03 PM)Truman Wrote: [ -> ]here is an another way to import html file:
Have to be careful with encoding doing this.
Both on how get html from web(save) and open it.
It's easy to mess up Unicode,so try use UTF-8 always in and out.
with open("htmll.html", encoding=utf-8) as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(soup.prettify())
Requests and BS do this well together,i explain better here Reading a html file.
(Sep-07-2018, 09:03 PM)Truman Wrote: [ -> ]here is an another way to import html file
try to be technically precise - you don't import the html file, you simply open it and pass the file handler to the BeautifulSoup to process it