Python Forum

Pages: 1 2

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
def is_short_string(string):
    return len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())

Error:Traceback (most recent call last):
  File "C:\Python36\kodovi\sstrainer.py", line 19, in <module>
    print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).
prettify())
  File "C:\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
    self._feed()
  File "C:\Python36\lib\site-packages\bs4\__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "C:\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in
feed
    parser.feed(markup)
  File "C:\Python36\lib\html\parser.py", line 111, in feed
    self.goahead(0)
  File "C:\Python36\lib\html\parser.py", line 171, in goahead
    k = self.parse_starttag(i)
  File "C:\Python36\lib\html\parser.py", line 345, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "C:\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 90, in h
andle_starttag
    tag = self.soup.handle_starttag(name, None, None, attr_dict)
  File "C:\Python36\lib\site-packages\bs4\__init__.py", line 461, in handle_star
ttag
    or not self.parse_only.search_tag(name, attrs))):
  File "C:\Python36\lib\site-packages\bs4\element.py", line 1676, in search_tag
    if not self._matches(attr_value, match_against):
  File "C:\Python36\lib\site-packages\bs4\element.py", line 1736, in _matches
    return match_against(markup)
  File "C:\Python36\kodovi\sstrainer.py", line 17, in is_short_string
    return len(string) < 10
TypeError: object of type 'NoneType' has no len()

I assume that the problem here is that computer doesn't know what argument string is but not sure how to solve this problem.

no, it's telling you that you can't calculate length on empty string.
you can modify:

def is_short_string(string):
    if string:
        return len(string) < 10

The output is literally nothing.

By the way, it is interesting that that code from my message is taken from BeautifulSoup docs. It surprises me that this mistake is neglected.

and with this print code

print(soup.find_all(only_short_strings))

it gives

Output:
[]

well, nothing in, nothing out!

try this:

def is_short_string(string):
    print('string{}'.format(string)
    if string:
        return len(string) < 10
    else:
        return 0

by the way, you should use another name. At some point you're going to run unto an error, since string is a built-in package
e.g 'import string'

I see your point - string None (although I prefer to use f-string lol)
Now I'll have to think how to add html_doc to this function.
Without using function it's simple:

only_a_tags = SoupStrainer("a")
print(BeautifulSoup(html_doc, "html.parser", parse_only = only_a_tags).prettify())

SoupStrainer with is_short_string is wrong on there website.
I have only tested SoupStrainer a couple of times,so if it useful can be questionable.

Can write solution that not using SoupStrainer.
Can take both sentence(what SoupStrainer give back) and also length of all words.

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

def by_size(words, size):
    return [word for word in words if len(word) < size]

soup = BeautifulSoup(html_doc, 'html.parser')
#words = soup.text.split()
sentence = soup.text.split('\n')
print(by_size(words, 10))

Output:
['', '', 'Elsie,', 'Lacie and', 'Tillie;', '...', '']

words = soup.text.split()
print(by_size(words, 4))

Output:
['The', 'The', 'a', 'and', 'and', 'and', 'at', 'the', 'of', 'a', '...']

A very BeautifulSoup. Cool

By the way, line 18 looks very pythonic. Is there any topic/page that you know that explains this "trick" more thorough?

(Sep-25-2018, 11:29 PM)Truman Wrote: [ -> ]Is there any topic/page that you know that explains this "trick" more thorough?

there is a detailed web scraping tutorial on our forum by snippsat
https://python-forum.io/Thread-Web-Scraping-part-1

Or were you asking about the list comprehension?

line 18 is a list comprehension, many video's and tutorials on that.
I recommend one of the best tutorials by David Beazley here (I Can't swear to it, but I think iterators (list comprehension is one) are covered): http://www.dabeaz.com/generators/index.html

This video I think covers it as well: https://www.youtube.com/watch?v=D1twn9kLmYg
(If not you will get a ton of other goodies)

Pages: 1 2

Truman

Larz60+

Truman

Larz60+

Truman

snippsat

Truman

metulburr

ichabod801

Larz60+