It can be a difficult task because a html page is not structured as document with line and lines break.
Rarely dos someone need all text on a web-page,
but rather a section of text that can be parsed out based on html tags.
There are solution
html2text,that dos a reason good job.
BeautifulSoup has
soup.find_all(text=True)
but the get a lot of text like link,JavaScripts,CSS ect...
NLTK package had a
clean_html()
function but they dropped it,to much trouble.
A test:
pip install html2text
The best result to i get when first also pass it trough BeautifulSoup(
soup.prettify()
) to get Unicode right.
import requests
from bs4 import BeautifulSoup
import html2text
url = 'https://www.wikipedia.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(soup.prettify())
print(text)
Sample output:
Output:
# 
Wikipedia
**The Free Encyclopedia**
**English** 5 592 000+ articles
**Español** 1 397 000+ artículos
**Deutsch** 2 164 000+ Artikel
**日本語** 1 099 000+ 記事
**Русский** 1 460 000+ статей
**Français** 1 966 000+ articles
**Italiano** 1 424 000+ voci
**中文** 996 000+ 條目
**Português** 995 000+ artigos
**Polski** 1 270 000+ haseł
العربية Azərbaycanca Български Bân-lâm-gú / Hō-ló-oē Беларуская
(Акадэмічная) Català Čeština Dansk Deutsch Eesti Ελληνικά English
Español Esperanto Euskara فارسی Français Galego 한국어 Հայերեն हिन्दी
Hrvatski Bahasa Indonesia Italiano עברית ქართული Latina Lietuvių Magyar
Bahasa Melayu Bahaso Minangkabau Nederlands 日本語 Norsk (Bokmål) Norsk
(Nynorsk) Нохчийн Oʻzbekcha / Ўзбекча Polski Português Қазақша / Qazaqşa
/ قازاقشا Română Русский Simple English Sinugboanong Binisaya Slovenčina
Slovenščina Српски / Srpski Srpskohrvatski / Српскохрватски Suomi Svenska
தமிழ் ภาษาไทย Türkçe Українська اردو Tiếng Việt Volapük Winaray 中文
__
__
__ Read Wikipedia in your language __
## 1 000 000+
* Deutsch
* English
* Español
* Français
* Italiano
* Nederlands
* 日本語
* Polski
* Русский
* Sinugboanong Binisaya
* Svenska
* Tiếng Việt
* Winaray
## 100 000+
* العربية
* Azərbaycanca
* Български
* Bân-lâm-gú / Hō-ló-oē
* Беларуская (Акадэмічная)
* Català
* Čeština
* Dansk
* Eesti
* Ελληνικά
* Esperanto
* Euskara
* فارسی
* Galego
* 한국어
* Հայերեն
* हिन्दी
* Hrvatski
* Bahasa Indonesia
* עברית
* ქართული
* Latina
* Lietuvių
* Magyar
* Bahasa Melayu
* Bahaso Minangkabau
* Norsk
* Bokmål
* Nynorsk
Here a quick test with
find_all(text=True)
,
this will not work well on whole web-page.
But can work reasonable well if first do parsing with tags has text that's needed.
from bs4 import BeautifulSoup
html = '''\
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title>
</head>
<body>
<h1>Main page</h1>
<p>lots of text here car is fun to ride on a sunny day.</p>
<a href="https://python-forum.io/">Best Python forum</a>
<style type="text/css">
<div>This should not be visible</div>
</style>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
text = soup.find_all(text=True)
clean_up = [i for i in text if not i == '\n']
for texts in clean_up:
print(texts.strip())
Output:
Title of the document
Main page
lots of text here car is fun to ride on a sunny day.
Best Python forum
<div>This should not be visible</div>