Python Forum

Full Version: [split] How to find a specific word in a webpage and How to count it.
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hey, I'm new here

Can someone please explain this line:
words = soup.find(text=lambda text: text and the_word in text)
I don't understand what is happening in lambda (I know what it is)

Thanks in advance
What version of bs did you use?
The string argument is a new name for text argument, that was in previous versions of BS.
Since v.4.4.0 text renamed to string

soup.find(text = func)
if string (or text) argument is a function, it should return True or False (from docs).
This function is applied to each text fragment within tags, if it returns True this fragment is returned. find_all searches for all such occurrences , find stops on the first one.
text=lambda text: text and the_word in text condition is simple: it search for non-empty string that includes the_word; It could be rewritten, e.g. as text = lambda x: x and the_word in x. Probably, you can try omit first condition, i.e. remove x and, but this could cause an error, if x became, e.g., None. This additional condition (x and) defends from errors that could rise when x becomes, e.g. None. In this case the_word in None would lead to TypeError.
(Mar-11-2019, 11:53 PM)scidam Wrote: [ -> ]Since v.4.4.0 text renamed to string
Yes i agree that it say that in doc,but i think they not done it.
In docstring for 4.7.1 it still say text,but both work text or string.
>>> bs4.__version__
'4.7.1'

>>> help(soup.find)
Help on method find in module bs4.element:

find(name=None, attrs={}, recursive=True, text=None, **kwargs) method of bs4.BeautifulSoup instance
    Return only the first child of this Tag matching the given
    criteria.
Good expatiation about the None stuff Thumbs Up

Can show a example of both,and i would also trow in a regex for it to be an all text search.
This is not the most normal usage of a parser,usually want more specific content that do a full text sreach of a web-page.
from bs4 import BeautifulSoup
import re

html = '''\
<p>Hello world and and python</p>
<td>python is a good language</td>
<td>not present in this text</td>
<div>Hello from python</div>'''

soup = BeautifulSoup(html, 'lxml')
the_word = 'python'
tags_found = soup.find_all(re.compile(".*"), text=lambda text: text and the_word in text)
print(tags_found)
print('-' * 15)
print([s.text for s in tags_found])
Output:
[<p>Hello world and and python</p>, <td>python is a good language</td>, <div>Hello from python</div>] --------------- ['Hello world and and python', 'python is a good language', 'Hello from python']

Without lambda(anonymous function no name),now a normal function with name.
from bs4 import BeautifulSoup
import re

html = '''\
<p>Hello world and and python</p>
<td>python is a good language</td>
<td>not present in this text</td>
<div>Hello from python</div>'''

def contains_word(text):
    return text and the_word in text

soup = BeautifulSoup(html, 'lxml')
the_word = 'python'
tags_found = soup.find_all(re.compile(".*"), text=contains_word)
print(tags_found)
print('-' * 15)
print([s.text for s in tags_found])
Output:
[<p>Hello world and and python</p>, <td>python is a good language</td>, <div>Hello from python</div>] --------------- ['Hello world and and python', 'python is a good language', 'Hello from python']