Oct-29-2021, 05:12 AM
Hello, everyone
I apologize for my english.
I have a Python script that extracts the complete text from a domain and every single subdomain. So then I have practically the entire text of the website. It also works without any problems, but every time I get strange characters and emoji's. Does anyone know how to filter out this text. Because I tried several times with BeautifulSoup to ignore this text, but it didn't work.
For example:
webscraper.py (Size: 2.3 KB / Downloads: 312)
I apologize for my english.
I have a Python script that extracts the complete text from a domain and every single subdomain. So then I have practically the entire text of the website. It also works without any problems, but every time I get strange characters and emoji's. Does anyone know how to filter out this text. Because I tried several times with BeautifulSoup to ignore this text, but it didn't work.
For example:
bytes(text, 'utf-8').decode('utf-8','ignore')My full script is attached.