Python Forum
how to get all lines and text from a webpage
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
how to get all lines and text from a webpage
#1
I am pratheep..
I am working on a project..
I have afault in my code..

THIS IS MY CODE:
import requests
from bs4 import BeautifulSoup
the_word = 'code'
url = 'http://www.wikipedia.org'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
    
When i run this code i am only getting the heading of the webpage, but i need the whole text in the webpage..

THIS IS MY OUTPUT:
none
but i need all the letters and words from the webpage..
please help me to solve the problem...
THANK YOU.
Reply
#2
It can be a difficult task because a html page is not structured as document with line and lines break.
Rarely dos someone need all text on a web-page,
but rather a section of text that can be parsed out based on html tags.

There are solution html2text,that dos a reason good job.
BeautifulSoup has soup.find_all(text=True) but the get a lot of text like link,JavaScripts,CSS ect...
NLTK package had a clean_html() function but they dropped it,to much trouble.

A test:
pip install html2text
The best result to i get when first also pass it trough BeautifulSoup(soup.prettify()) to get Unicode right.
import requests
from bs4 import BeautifulSoup
import html2text

url = 'https://www.wikipedia.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(soup.prettify())
print(text)
Sample output:
Output:
# ![Wikipedia](portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png) Wikipedia **The Free Encyclopedia** **English** 5 592 000+ articles **Español** 1 397 000+ artículos **Deutsch** 2 164 000+ Artikel **日本語** 1 099 000+ 記事 **Русский** 1 460 000+ статей **Français** 1 966 000+ articles **Italiano** 1 424 000+ voci **中文** 996 000+ 條目 **Português** 995 000+ artigos **Polski** 1 270 000+ haseł العربية Azərbaycanca Български Bân-lâm-gú / Hō-ló-oē Беларуская (Акадэмічная) Català Čeština Dansk Deutsch Eesti Ελληνικά English Español Esperanto Euskara فارسی Français Galego 한국어 Հայերեն हिन्दी Hrvatski Bahasa Indonesia Italiano עברית ქართული Latina Lietuvių Magyar Bahasa Melayu Bahaso Minangkabau Nederlands 日本語 Norsk (Bokmål) Norsk (Nynorsk) Нохчийн Oʻzbekcha / Ўзбекча Polski Português Қазақша / Qazaqşa / قازاقشا Română Русский Simple English Sinugboanong Binisaya Slovenčina Slovenščina Српски / Srpski Srpskohrvatski / Српскохрватски Suomi Svenska தமிழ் ภาษาไทย Türkçe Українська اردو Tiếng Việt Volapük Winaray 中文 __ __ __ Read Wikipedia in your language __ ## 1 000 000+ * Deutsch * English * Español * Français * Italiano * Nederlands * 日本語 * Polski * Русский * Sinugboanong Binisaya * Svenska * Tiếng Việt * Winaray ## 100 000+ * العربية * Azərbaycanca * Български * Bân-lâm-gú / Hō-ló-oē * Беларуская (Акадэмічная) * Català * Čeština * Dansk * Eesti * Ελληνικά * Esperanto * Euskara * فارسی * Galego * 한국어 * Հայերեն * हिन्दी * Hrvatski * Bahasa Indonesia * עברית * ქართული * Latina * Lietuvių * Magyar * Bahasa Melayu * Bahaso Minangkabau * Norsk * Bokmål * Nynorsk

Here a quick test with find_all(text=True),
this will not work well on whole web-page.
But can work reasonable well if first do parsing with tags has text that's needed.
from bs4 import BeautifulSoup

html = '''\
<html>
  <head>
    <meta charset="UTF-8">
    <title>Title of the document</title>
  </head>
  <body>
    <h1>Main page</h1>
    <p>lots of text here car is fun to ride on a sunny day.</p>
    <a href="https://python-forum.io/">Best Python forum</a>
    <style type="text/css">
      <div>This should not be visible</div>
    </style>
  </body>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
text = soup.find_all(text=True)
clean_up = [i for i in text if not i == '\n']
for texts in clean_up:
    print(texts.strip())
Output:
Title of the document Main page lots of text here car is fun to ride on a sunny day. Best Python forum <div>This should not be visible</div>
Reply
#3
from bs4 import BeautifulSoup
import requests

html = requests.get('https://www.wikipedia.org/').content
soup = BeautifulSoup(html, 'lxml')

for element in soup:
    try:
        print(element.text)
    except AttributeError:
        pass
I just tested this and it get all CSS too :D
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020