Reading a html file

peterl · Aug-20-2018, 01:37 PM

Hi guys,

I'm trying to read an html file and then print it on IDLE. The issue is that IDLE replaces < to < and > to >. Does anyone know how to solve that?

Thank you

***metulburr*** · (This post was last modified: Aug-20-2018, 01:49 PM by metulburr.)

Do you mean you are reading the html from a website and want to save it to a file?

I am not sure what you mean when you say you are printing it in IDLE. IDLE is a text editor and an embedded python prompt. The only storage there is the actually file you save code to.

Quote:< to < and > to >

This is the code for < and > as HTML defines tags with them. &lessthan; and &greaterthan;

peterl · Aug-20-2018, 02:25 PM

I have saved a website in my HD as html file and I try to use BeautifulSoup to scrape it.
the problem is that for some reason < gets replaces to < and > to >

example:
from bs4 import BeautifulSoup as soup
my_url = open('test2.html', 'r')
page_soup = soup(my_url, "html.parser")
print (page_soup)

***snippsat*** · (This post was last modified: Aug-20-2018, 02:57 PM by snippsat.)

(Aug-20-2018, 02:25 PM)peterl Wrote: I have saved a website in my HD as html file and I try to use BeautifulSoup to scrape it.

You have to be careful when you save so you keep same encoding.
Use always Requests for reading site.

When shall give source to BS save it as bytes wb.

import requests

url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
response = requests.get(url)
with open('simple.html', 'wb') as f_out:
    f_out.write(response.content)

Read it from BS,now will BS handle Unicode which in this case is UTF-8.

bs4 Wrote:Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding

from bs4 import BeautifulSoup

my_url = open('simple.html')
page_soup = BeautifulSoup(my_url, "html.parser")
print(page_soup)

Output:<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

Read it with Requests alone then use text

import requests

url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
response = requests.get(url)

Usage:

# See that requests always get site encoding back
>>> response.encoding
'utf-8'

>>> print(response.text)
<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>

Requests and BS together.

import requests
from bs4 import BeautifulSoup

url = 'https://dataquestio.github.io/web-scraping-pages/simple.html'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)

Output:
A simple example page

peterl · Aug-20-2018, 03:16 PM

Thanks that solved my problem

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Python Obstacles \| Kung-Fu \| Full File HTML Document Scrape and Store it in MariaDB	BrandonKastning	5	2,910	Dec-29-2021, 02:26 AM Last Post: BrandonKastning
	show csv file in flask template.html	rr28rizal	8	34,762	Apr-12-2021, 09:24 AM Last Post: adamabusamra
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,640	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Open and read a tab delimited file from html using python cgi	luffy	2	2,677	Aug-24-2020, 06:25 AM Last Post: luffy
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,366	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	Sending file html ?	JohnnyCoffee	3	67,069	Sep-06-2019, 04:32 PM Last Post: snippsat
	Problem parsing website html file	thefpgarace	2	3,207	May-01-2018, 11:09 AM Last Post: Standard_user
	bs4 : output html content into a txt file	smallabc	2	23,288	Jan-02-2018, 04:18 PM Last Post: snippsat
	How to print particular text areas fron an HTML file (not site)	Chris	10	7,069	Dec-11-2017, 09:20 AM Last Post: j.crater
	read text file using python and display its output to html using django	amit	0	18,315	Jul-23-2017, 06:14 AM Last Post: amit

Reading a html file

User Panel Messages

Announcements