how to get all lines and text from a webpage

pratheep · Mar-31-2018, 06:35 AM

I am pratheep..
I am working on a project..
I have afault in my code..

THIS IS MY CODE:

import requests
from bs4 import BeautifulSoup
the_word = 'code'
url = 'http://www.wikipedia.org'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)

When i run this code i am only getting the heading of the webpage, but i need the whole text in the webpage..

THIS IS MY OUTPUT:

none

but i need all the letters and words from the webpage..
please help me to solve the problem...
THANK YOU.

***snippsat*** · (This post was last modified: Mar-31-2018, 11:34 AM by snippsat.)

It can be a difficult task because a html page is not structured as document with line and lines break.
Rarely dos someone need all text on a web-page,
but rather a section of text that can be parsed out based on html tags.

There are solution html2text,that dos a reason good job.
BeautifulSoup has soup.find_all(text=True) but the get a lot of text like link,JavaScripts,CSS ect...
NLTK package had a clean_html() function but they dropped it,to much trouble.

A test:
pip install html2text
The best result to i get when first also pass it trough BeautifulSoup(soup.prettify()) to get Unicode right.

import requests
from bs4 import BeautifulSoup
import html2text

url = 'https://www.wikipedia.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(soup.prettify())
print(text)

Sample output:

Output:#  ![Wikipedia](portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png)

Wikipedia

**The Free Encyclopedia**

**English** 5 592 000+  articles

**Español** 1 397 000+  artículos

**Deutsch** 2 164 000+  Artikel

**日本語** 1 099 000+  記事

**Русский** 1 460 000+  статей

**Français** 1 966 000+  articles

**Italiano** 1 424 000+  voci

**中文** 996 000+  條目

**Português** 995 000+  artigos

**Polski** 1 270 000+  haseł

العربية  Azərbaycanca  Български  Bân-lâm-gú / Hō-ló-oē  Беларуская
(Акадэмічная)  Català  Čeština  Dansk  Deutsch  Eesti  Ελληνικά  English
Español  Esperanto  Euskara  فارسی  Français  Galego  한국어  Հայերեն  हिन्दी
Hrvatski  Bahasa Indonesia  Italiano  עברית  ქართული  Latina  Lietuvių  Magyar
Bahasa Melayu  Bahaso Minangkabau  Nederlands  日本語  Norsk (Bokmål)  Norsk
(Nynorsk)  Нохчийн  Oʻzbekcha / Ўзбекча  Polski  Português  Қазақша / Qazaqşa
/ قازاقشا  Română  Русский  Simple English  Sinugboanong Binisaya  Slovenčina
Slovenščina  Српски / Srpski  Srpskohrvatski / Српскохрватски  Suomi  Svenska
தமிழ்  ภาษาไทย  Türkçe  Українська  اردو  Tiếng Việt  Volapük  Winaray  中文

__

__

__ Read Wikipedia in your language  __

##  1 000 000+

  * Deutsch 
  * English 
  * Español 
  * Français 
  * Italiano 
  * Nederlands 
  * 日本語 
  * Polski 
  * Русский 
  * Sinugboanong Binisaya 
  * Svenska 
  * Tiếng Việt 
  * Winaray 

##  100 000+

  * العربية 
  * Azərbaycanca 
  * Български 
  * Bân-lâm-gú / Hō-ló-oē 
  * Беларуская (Акадэмічная) 
  * Català 
  * Čeština 
  * Dansk 
  * Eesti 
  * Ελληνικά 
  * Esperanto 
  * Euskara 
  * فارسی 
  * Galego 
  * 한국어 
  * Հայերեն 
  * हिन्दी 
  * Hrvatski 
  * Bahasa Indonesia 
  * עברית 
  * ქართული 
  * Latina 
  * Lietuvių 
  * Magyar 
  * Bahasa Melayu 
  * Bahaso Minangkabau 
  * Norsk 
    * Bokmål 
    * Nynorsk

Here a quick test with find_all(text=True),
this will not work well on whole web-page.
But can work reasonable well if first do parsing with tags has text that's needed.

from bs4 import BeautifulSoup

html = '''\
<html>
  <head>
    <meta charset="UTF-8">
    <title>Title of the document</title>
  </head>
  <body>
    <h1>Main page</h1>
    <p>lots of text here car is fun to ride on a sunny day.</p>
    <a href="https://python-forum.io/">Best Python forum</a>
    <style type="text/css">
      <div>This should not be visible</div>
    </style>
  </body>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
text = soup.find_all(text=True)
clean_up = [i for i in text if not i == '\n']
for texts in clean_up:
    print(texts.strip())

Output:Title of the document
Main page
lots of text here car is fun to ride on a sunny day.
Best Python forum
<div>This should not be visible</div>

wavic · (This post was last modified: Mar-31-2018, 02:52 PM by wavic.)

from bs4 import BeautifulSoup
import requests

html = requests.get('https://www.wikipedia.org/').content
soup = BeautifulSoup(html, 'lxml')

for element in soup:
    try:
        print(element.text)
    except AttributeError:
        pass

I just tested this and it get all CSS too :D

how to get all lines and text from a webpage

User Panel Messages

Announcements