(Aug-20-2018, 02:25 PM)peterl Wrote: I have saved a website in my HD as html file and I try to use BeautifulSoup to scrape it.
You have to be careful when you save so you keep same encoding.
Use always
Requests for reading site.
When shall give source to BS save it as bytes
wb
.
import requests
url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
response = requests.get(url)
with open('simple.html', 'wb') as f_out:
f_out.write(response.content)
Read it from BS,now will BS handle Unicode which in this case is UTF-8.
bs4 Wrote:Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding
from bs4 import BeautifulSoup
my_url = open('simple.html')
page_soup = BeautifulSoup(my_url, "html.parser")
print(page_soup)
Output:
<!DOCTYPE html>
<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>
Read it with Requests alone then use
text
import requests
url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
response = requests.get(url)
Usage:
# See that requests always get site encoding back
>>> response.encoding
'utf-8'
>>> print(response.text)
<!DOCTYPE html>
<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>
Requests and BS together.
import requests
from bs4 import BeautifulSoup
url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)
Output:
A simple example page