Python Forum
Importing created module for web scraping with bs4 - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Importing created module for web scraping with bs4 (/thread-12656.html)



Importing created module for web scraping with bs4 - Truman - Sep-05-2018

module that I made:
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
web scraping script:
import ht_doc
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc.html, 'html.parser')
print(soup.prettify())
error:
Error:
C:\Python36\kodovi>busoup.py Traceback (most recent call last): File "C:\Python36\kodovi\busoup.py", line 1, in <module> import ht_doc ModuleNotFoundError: No module named 'ht_doc'
Decided to practise Beautiful Soup module and hit a snag at the beginning. Wall Why I can't import ht_doc module?


RE: Importing created module for web scraping with bs4 - metulburr - Sep-06-2018

you dont import the html doc. IF you are writing the html in the same file as the code, you can just save the content in a variable and put it in like

html = '''
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
  
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
  
<p class="story">...</p>
'''

...
soup = BeautifulSoup(html, 'html.parser')
...
However if you are obtaining the content from the internet, then you should use the requests module to get the html, and use

...
r = requets.get('https://www.google.com')
soup = BeautifulSoup(r.text, 'html.parser')
If you have the html in a file, then you would have to load that html file like any other standard file and pass BeautifulSoup the whole file via fileObject.read()

more info here at our tutorials
https://python-forum.io/Thread-Web-Scraping-part-1


RE: Importing created module for web scraping with bs4 - Truman - Sep-06-2018

Thank you, had no idea that I can't use html file as module. Will create a variable.


RE: Importing created module for web scraping with bs4 - Truman - Sep-07-2018

here is an another way to import html file:

with open("htmll.html") as fp:
	soup = BeautifulSoup(fp)
	print(soup.prettify())



RE: Importing created module for web scraping with bs4 - snippsat - Sep-07-2018

(Sep-07-2018, 09:03 PM)Truman Wrote: here is an another way to import html file:
Have to be careful with encoding doing this.
Both on how get html from web(save) and open it.
It's easy to mess up Unicode,so try use UTF-8 always in and out.
with open("htmll.html", encoding=utf-8) as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(soup.prettify())
Requests and BS do this well together,i explain better here Reading a html file.


RE: Importing created module for web scraping with bs4 - buran - Sep-08-2018

(Sep-07-2018, 09:03 PM)Truman Wrote: here is an another way to import html file
try to be technically precise - you don't import the html file, you simply open it and pass the file handler to the BeautifulSoup to process it