Right way to open files with different encodings? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Right way to open files with different encodings? (/thread-42020.html) |
Right way to open files with different encodings? - Winfried - Apr-23-2024 Hello, Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8. Is try/except the right way to do it? #with open(file, 'r') as f: #with open(file, 'r',encoding='utf-8') as f: #latin1, iso9959-1, cp1252 with open(file, 'r',encoding='latin-1') as f: content_text = f.read() soup = BeautifulSoup(content_text, 'html.parser')Thank you. RE: Right way to open files with different encodings? - Gribouillis - Apr-23-2024 (Apr-23-2024, 08:49 AM)Winfried Wrote: Is try/except the right way to do it?Normally, there is no way to decode a file having an unknown unicode encoding. Specialized modules such as chardet contain tools to guess the encoding of a file. It is probably the best solution, but read the FAQ of the chardet module first. Python is not equipped with tools to guess encodings, so attempting to decode and catch exceptions will succeed in diagnosing that some encodings are not the actual encoding of the file, but a success does not mean that it is the correct encoding an the result can be a mojibake RE: Right way to open files with different encodings? - snippsat - Apr-23-2024 (Apr-23-2024, 08:49 AM)Winfried Wrote: Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.As these are .html some advice if you are making or saving this these files,then there is a way if using Requests and BS to always save as utf-8.If files are already made then as bye Gribouillis there is chardet. So eg if i have one .html file which(i make to be latin-1) and one in utf-8. λ chardetect page_latin.html page_latin.html: ISO-8859-1 with confidence 0.73 G:\div_code\html_utf λ chardetect page_utf8.html page_utf8.html: utf-8 with confidence 0.7525 from bs4 import BeautifulSoup with open('page_latin.html', encoding='latin-1') as fp: soup = BeautifulSoup(fp, 'lxml') h1_tag = soup.find('h1') print(h1_tag) # Utf-8 the default with open('html_new.html') as fp: soup = BeautifulSoup(fp, 'lxml') h1_tag = soup.find('h1') print(h1_tag) So all works as it should,if take away encoding='latin-1' it break and get UnicodeDecodeError .Can also convert to utf-8 as this happens when open a file in Beautiful Soup: Bs4 Doc Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. So from latin-1 to utf-8. from bs4 import BeautifulSoup with open('page_latin.html', 'rb') as fp,open('html_new.html', 'w', encoding='utf-8') as fp_out: file_out = fp.read() # When open a file in BS it will be Unicode soup = BeautifulSoup(file_out, 'lxml') fp_out.write(soup.prettify()) λ chardetect html_new.html html_new.html: utf-8 with confidence 0.7525File used in test,same just with different encoding. <html lang="en"> <head> <title>Here is site title</title> </head> <body> <h1>Jalapeñod je pèle</h1> </body> </html> |