bs4 : output html content into a txt file

bs4 : output html content into a txt file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: bs4 : output html content into a txt file (/thread-7277.html)

bs4 : output html content into a txt file - smallabc - Jan-02-2018

Self-learning python. The following code returns UnicodeEncodeError. How should I fix it? Thanks.

import bs4, requests
#----------------------------------------------------------------------------
URL = "https://learnxinyminutes.com/docs/r"
#----------------------------------------------------------------------------
soup = bs4.BeautifulSoup(requests.get(URL).text, "lxml")
with open( r"C:\Users\User\Desktop\Test.txt" ,"w") as oFile:
    oFile.write(str(soup.html))
    oFile.close()

UnicodeEncodeError: 'cp950' codec can't encode character '\xf8' in position 20242: illegal multibyte sequence

RE: bs4 : output html content into a txt file - buran - Jan-02-2018

You are using python2, so change line#7 to

oFile.write(str(soup.html.encode('utf8'))

Even better would be to use python3 (given that you start learning python now), as support for python2 would end soon

RE: bs4 : output html content into a txt file - snippsat - Jan-02-2018

Like this,use content when read in.
Set utf-8 in open and no str convert with use of prettify().
This is a Python 3 solution which we gone be more strict to advice in 2018.
So i will not post a Python 2 solution for this Snooty

import requests
from bs4 import BeautifulSoup

url = "https://learnxinyminutes.com/docs/r"
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')

with open('url.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(soup.prettify())