Python Forum

Full Version: SoupStrainer: example
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
def is_short_string(string):
    return len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\sstrainer.py", line 19, in <module> print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings). prettify()) File "C:\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__ self._feed() File "C:\Python36\lib\site-packages\bs4\__init__.py", line 289, in _feed self.builder.feed(self.markup) File "C:\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed parser.feed(markup) File "C:\Python36\lib\html\parser.py", line 111, in feed self.goahead(0) File "C:\Python36\lib\html\parser.py", line 171, in goahead k = self.parse_starttag(i) File "C:\Python36\lib\html\parser.py", line 345, in parse_starttag self.handle_starttag(tag, attrs) File "C:\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 90, in h andle_starttag tag = self.soup.handle_starttag(name, None, None, attr_dict) File "C:\Python36\lib\site-packages\bs4\__init__.py", line 461, in handle_star ttag or not self.parse_only.search_tag(name, attrs))): File "C:\Python36\lib\site-packages\bs4\element.py", line 1676, in search_tag if not self._matches(attr_value, match_against): File "C:\Python36\lib\site-packages\bs4\element.py", line 1736, in _matches return match_against(markup) File "C:\Python36\kodovi\sstrainer.py", line 17, in is_short_string return len(string) < 10 TypeError: object of type 'NoneType' has no len()
I assume that the problem here is that computer doesn't know what argument string is but not sure how to solve this problem.
no, it's telling you that you can't calculate length on empty string.
you can modify:
def is_short_string(string):
    if string:
        return len(string) < 10
The output is literally nothing.

By the way, it is interesting that that code from my message is taken from BeautifulSoup docs. It surprises me that this mistake is neglected.

and with this print code
print(soup.find_all(only_short_strings))
it gives
Output:
[]
well, nothing in, nothing out!

try this:
def is_short_string(string):
    print('string{}'.format(string)
    if string:
        return len(string) < 10
    else:
        return 0
by the way, you should use another name. At some point you're going to run unto an error, since string is a built-in package
e.g 'import string'
I see your point - string None (although I prefer to use f-string lol)
Now I'll have to think how to add html_doc to this function.
Without using function it's simple:
only_a_tags = SoupStrainer("a")
print(BeautifulSoup(html_doc, "html.parser", parse_only = only_a_tags).prettify())
SoupStrainer with is_short_string is wrong on there website.
I have only tested SoupStrainer a couple of times,so if it useful can be questionable.

Can write solution that not using SoupStrainer.
Can take both sentence(what SoupStrainer give back) and also length of all words.
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

def by_size(words, size):
    return [word for word in words if len(word) < size]

soup = BeautifulSoup(html_doc, 'html.parser')
#words = soup.text.split()
sentence = soup.text.split('\n')
print(by_size(words, 10))
Output:
['', '', 'Elsie,', 'Lacie and', 'Tillie;', '...', '']
words = soup.text.split()
print(by_size(words, 4))
Output:
['The', 'The', 'a', 'and', 'and', 'and', 'at', 'the', 'of', 'a', '...']
A very BeautifulSoup. Cool

By the way, line 18 looks very pythonic. Is there any topic/page that you know that explains this "trick" more thorough?
(Sep-25-2018, 11:29 PM)Truman Wrote: [ -> ]Is there any topic/page that you know that explains this "trick" more thorough?
there is a detailed web scraping tutorial on our forum by snippsat
https://python-forum.io/Thread-Web-Scraping-part-1
Or were you asking about the list comprehension?
line 18 is a list comprehension, many video's and tutorials on that.
I recommend one of the best tutorials by David Beazley here (I Can't swear to it, but I think iterators (list comprehension is one) are covered): http://www.dabeaz.com/generators/index.html

This video I think covers it as well: https://www.youtube.com/watch?v=D1twn9kLmYg
(If not you will get a ton of other goodies)
Pages: 1 2