Python Forum
Importing created module for web scraping with bs4
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Importing created module for web scraping with bs4
#1
module that I made:
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
web scraping script:
import ht_doc
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc.html, 'html.parser')
print(soup.prettify())
error:
Error:
C:\Python36\kodovi>busoup.py Traceback (most recent call last): File "C:\Python36\kodovi\busoup.py", line 1, in <module> import ht_doc ModuleNotFoundError: No module named 'ht_doc'
Decided to practise Beautiful Soup module and hit a snag at the beginning. Wall Why I can't import ht_doc module?
Reply
#2
you dont import the html doc. IF you are writing the html in the same file as the code, you can just save the content in a variable and put it in like

html = '''
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
  
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
  
<p class="story">...</p>
'''

...
soup = BeautifulSoup(html, 'html.parser')
...
However if you are obtaining the content from the internet, then you should use the requests module to get the html, and use

...
r = requets.get('https://www.google.com')
soup = BeautifulSoup(r.text, 'html.parser')
If you have the html in a file, then you would have to load that html file like any other standard file and pass BeautifulSoup the whole file via fileObject.read()

more info here at our tutorials
https://python-forum.io/Thread-Web-Scraping-part-1
Recommended Tutorials:
Reply
#3
Thank you, had no idea that I can't use html file as module. Will create a variable.
Reply
#4
here is an another way to import html file:

with open("htmll.html") as fp:
	soup = BeautifulSoup(fp)
	print(soup.prettify())
Reply
#5
(Sep-07-2018, 09:03 PM)Truman Wrote: here is an another way to import html file:
Have to be careful with encoding doing this.
Both on how get html from web(save) and open it.
It's easy to mess up Unicode,so try use UTF-8 always in and out.
with open("htmll.html", encoding=utf-8) as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(soup.prettify())
Requests and BS do this well together,i explain better here Reading a html file.
Reply
#6
(Sep-07-2018, 09:03 PM)Truman Wrote: here is an another way to import html file
try to be technically precise - you don't import the html file, you simply open it and pass the file handler to the BeautifulSoup to process it
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  having a problem in importing module pratheep 1 2,659 Jan-20-2018, 07:54 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020