[split] How to find a specific word in a webpage and How to count it.

[split] How to find a specific word in a webpage and How to count it. - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: [split] How to find a specific word in a webpage and How to count it. (/thread-16722.html)

[split] How to find a specific word in a webpage and How to count it. - marpop - Mar-11-2019

Hey, I'm new here

Can someone please explain this line:

words = soup.find(text=lambda text: text and the_word in text)

I don't understand what is happening in lambda (I know what it is)

Thanks in advance

RE: [split] How to find a specific word in a webpage and How to count it. - scidam - Mar-11-2019

What version of bs did you use?
The string argument is a new name for text argument, that was in previous versions of BS.
Since v.4.4.0 text renamed to string

soup.find(text = func)

if string (or text) argument is a function, it should return True or False (from docs).
This function is applied to each text fragment within tags, if it returns True this fragment is returned. find_all searches for all such occurrences , find stops on the first one.
text=lambda text: text and the_word in text condition is simple: it search for non-empty string that includes the_word; It could be rewritten, e.g. as text = lambda x: x and the_word in x. Probably, you can try omit first condition, i.e. remove x and, but this could cause an error, if x became, e.g., None. This additional condition (x and) defends from errors that could rise when x becomes, e.g. None. In this case the_word in None would lead to TypeError.

RE: [split] How to find a specific word in a webpage and How to count it. - snippsat - Mar-12-2019

(Mar-11-2019, 11:53 PM)scidam Wrote: Since v.4.4.0 text renamed to string

Yes i agree that it say that in doc,but i think they not done it.
In docstring for 4.7.1 it still say text,but both work text or string.

>>> bs4.__version__
'4.7.1'

>>> help(soup.find)
Help on method find in module bs4.element:

find(name=None, attrs={}, recursive=True, text=None, **kwargs) method of bs4.BeautifulSoup instance
    Return only the first child of this Tag matching the given
    criteria.

Good expatiation about the None stuff Thumbs Up

Can show a example of both,and i would also trow in a regex for it to be an all text search.
This is not the most normal usage of a parser,usually want more specific content that do a full text sreach of a web-page.

from bs4 import BeautifulSoup
import re

html = '''\
<p>Hello world and and python</p>
<td>python is a good language</td>
<td>not present in this text</td>
<div>Hello from python</div>'''

soup = BeautifulSoup(html, 'lxml')
the_word = 'python'
tags_found = soup.find_all(re.compile(".*"), text=lambda text: text and the_word in text)
print(tags_found)
print('-' * 15)
print([s.text for s in tags_found])

Output:[<p>Hello world and and python</p>, <td>python is a good language</td>, <div>Hello from python</div>]
---------------
['Hello world and and python', 'python is a good language', 'Hello from python']

Without lambda(anonymous function no name),now a normal function with name.

from bs4 import BeautifulSoup
import re

html = '''\
<p>Hello world and and python</p>
<td>python is a good language</td>
<td>not present in this text</td>
<div>Hello from python</div>'''

def contains_word(text):
    return text and the_word in text

soup = BeautifulSoup(html, 'lxml')
the_word = 'python'
tags_found = soup.find_all(re.compile(".*"), text=contains_word)
print(tags_found)
print('-' * 15)
print([s.text for s in tags_found])

Output:[<p>Hello world and and python</p>, <td>python is a good language</td>, <div>Hello from python</div>]
---------------
['Hello world and and python', 'python is a good language', 'Hello from python']