Reading a html file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Reading a html file (/thread-12326.html) |
Reading a html file - peterl - Aug-20-2018 Hi guys, I'm trying to read an html file and then print it on IDLE. The issue is that IDLE replaces < to < and > to >. Does anyone know how to solve that? Thank you RE: Reading a html file - metulburr - Aug-20-2018 Do you mean you are reading the html from a website and want to save it to a file? I am not sure what you mean when you say you are printing it in IDLE. IDLE is a text editor and an embedded python prompt. The only storage there is the actually file you save code to. Quote:< to < and > to >This is the code for < and > as HTML defines tags with them. &lessthan; and &greaterthan; RE: Reading a html file - peterl - Aug-20-2018 I have saved a website in my HD as html file and I try to use BeautifulSoup to scrape it. the problem is that for some reason < gets replaces to < and > to > example: from bs4 import BeautifulSoup as soup my_url = open('test2.html', 'r') page_soup = soup(my_url, "html.parser") print (page_soup) RE: Reading a html file - snippsat - Aug-20-2018 (Aug-20-2018, 02:25 PM)peterl Wrote: I have saved a website in my HD as html file and I try to use BeautifulSoup to scrape it.You have to be careful when you save so you keep same encoding. Use always Requests for reading site. When shall give source to BS save it as bytes wb .import requests url = 'https://dataquestio.github.io/web-scraping-pages/simple.html' response = requests.get(url) with open('simple.html', 'wb') as f_out: f_out.write(response.content)Read it from BS,now will BS handle Unicode which in this case is UTF-8. bs4 Wrote:Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding from bs4 import BeautifulSoup my_url = open('simple.html') page_soup = BeautifulSoup(my_url, "html.parser") print(page_soup)
Read it with Requests alone then use text import requests url = 'https://dataquestio.github.io/web-scraping-pages/simple.html' response = requests.get(url)Usage: # See that requests always get site encoding back >>> response.encoding 'utf-8' >>> print(response.text) <!DOCTYPE html> <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html> Requests and BS together. import requests from bs4 import BeautifulSoup url = 'https://dataquestio.github.io/web-scraping-pages/simple.html' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser') print(soup.find('title').text)
RE: Reading a html file - peterl - Aug-20-2018 Thanks that solved my problem |