Python Forum

module that I made:

<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>

web scraping script:

import ht_doc
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc.html, 'html.parser')
print(soup.prettify())

error:

Error:C:\Python36\kodovi>busoup.py
Traceback (most recent call last):
  File "C:\Python36\kodovi\busoup.py", line 1, in <module>
    import ht_doc
ModuleNotFoundError: No module named 'ht_doc'

Decided to practise Beautiful Soup module and hit a snag at the beginning. Wall

Why I can't import ht_doc module?

you dont import the html doc. IF you are writing the html in the same file as the code, you can just save the content in a variable and put it in like

html = '''
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
  
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
  
<p class="story">...</p>
'''

...
soup = BeautifulSoup(html, 'html.parser')
...

However if you are obtaining the content from the internet, then you should use the requests module to get the html, and use

...
r = requets.get('https://www.google.com')
soup = BeautifulSoup(r.text, 'html.parser')

If you have the html in a file, then you would have to load that html file like any other standard file and pass BeautifulSoup the whole file via fileObject.read()

more info here at our tutorials
https://python-forum.io/Thread-Web-Scraping-part-1

Thank you, had no idea that I can't use html file as module. Will create a variable.

here is an another way to import html file:

with open("htmll.html") as fp:
	soup = BeautifulSoup(fp)
	print(soup.prettify())

(Sep-07-2018, 09:03 PM)Truman Wrote: [ -> ]here is an another way to import html file:

Have to be careful with encoding doing this.
Both on how get html from web(save) and open it.
It's easy to mess up Unicode,so try use UTF-8 always in and out.

with open("htmll.html", encoding=utf-8) as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(soup.prettify())

Requests and BS do this well together,i explain better here Reading a html file.

(Sep-07-2018, 09:03 PM)Truman Wrote: [ -> ]here is an another way to import html file

try to be technically precise - you don't import the html file, you simply open it and pass the file handler to the BeautifulSoup to process it

Truman

metulburr

Truman

Truman

snippsat

buran