Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Reading a html file
#4
(Aug-20-2018, 02:25 PM)peterl Wrote: I have saved a website in my HD as html file and I try to use BeautifulSoup to scrape it.
You have to be careful when you save so you keep same encoding.
Use always Requests for reading site.

When shall give source to BS save it as bytes wb.
import requests

url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
response = requests.get(url)
with open('simple.html', 'wb') as f_out:
    f_out.write(response.content)
Read it from BS,now will BS handle Unicode which in this case is UTF-8.
bs4 Wrote:Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding
from bs4 import BeautifulSoup

my_url = open('simple.html')
page_soup = BeautifulSoup(my_url, "html.parser")
print(page_soup)
Output:
<!DOCTYPE html> <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>

Read it with Requests alone then use text
import requests

url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
response = requests.get(url)
Usage:
# See that requests always get site encoding back
>>> response.encoding
'utf-8'

>>> print(response.text)
<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>

Requests and BS together.
import requests
from bs4 import BeautifulSoup

url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)
Output:
A simple example page
Reply


Messages In This Thread
Reading a html file - by peterl - Aug-20-2018, 01:37 PM
RE: Reading a html file - by metulburr - Aug-20-2018, 01:48 PM
RE: Reading a html file - by peterl - Aug-20-2018, 02:25 PM
RE: Reading a html file - by snippsat - Aug-20-2018, 02:57 PM
RE: Reading a html file - by peterl - Aug-20-2018, 03:16 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
Lightbulb Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB BrandonKastning 5 3,000 Dec-29-2021, 02:26 AM
Last Post: BrandonKastning
  show csv file in flask template.html rr28rizal 8 35,011 Apr-12-2021, 09:24 AM
Last Post: adamabusamra
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,744 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Open and read a tab delimited file from html using python cgi luffy 2 2,726 Aug-24-2020, 06:25 AM
Last Post: luffy
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,412 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Sending file html ? JohnnyCoffee 3 76,639 Sep-06-2019, 04:32 PM
Last Post: snippsat
  Problem parsing website html file thefpgarace 2 3,255 May-01-2018, 11:09 AM
Last Post: Standard_user
  bs4 : output html content into a txt file smallabc 2 23,404 Jan-02-2018, 04:18 PM
Last Post: snippsat
  How to print particular text areas fron an HTML file (not site) Chris 10 7,179 Dec-11-2017, 09:20 AM
Last Post: j.crater
  read text file using python and display its output to html using django amit 0 18,361 Jul-23-2017, 06:14 AM
Last Post: amit

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020