Extracting Elements From A Website List

knight2000 · Jul-19-2021, 08:08 AM

Hi all,

I'm practicing some webscraping and have come up to an obstacle that has me stuck.

I'm trying to use code that will go through a few different pages of the same website and extract certain text from a list that is visible in each page. The challenge is, that the site list can have a different number of elements, so I'm not how to handle that if an element is not available.

Let me demo this to make it clearer...

Example html of one page in the website:

<ul class="nb-type-md nb-list-undecorated undefined">
	<li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Blue</span></li>
	<li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Designed in China</span></li>
	<li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>http://www.mysupersite.com</span></li>
</ul>

And here's an example of another page in the same website:

<ul class="nb-type-md nb-list-undecorated undefined">
	<li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Green</span></li>
	<li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Designed in England</span></li>
	<li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Shadow Chrome Painted</span></li>
	<li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>http://www.mydifferentsite.com</span></li>
</ul>

As you can see, the first page has 3 items, whilst the second page has 4.

So if for example I'm trying to extract the url from these two pages (ie- http://www.mysupersite.com and http://www.mydifferentsite.com), how would I go about doing that?

My latest trial:

    for wa in lists.find_all('li'):
        if wa[3] is KeyError:
            wa[2]
        else:
            wa[3]

I get:

Error:Traceback (most recent call last):
  File "C:/Users/testscrape.py", line 28, in <module>
    if wa[3] is KeyError:
  File "C:\Users\lib\site-packages\bs4\element.py", line 1406, in __getitem__
    return self.attrs[key]
KeyError: 3

I thought an IF statement would be what works- something like: IF wa[3] doesn't exist, then use wa[2], else wa[3]- but I don't know how to translate that into code Undecided

Could someone please enlighten me how to handle these sort's of optional indexes?

Thanks a lot.

***snippsat*** · Jul-19-2021, 05:30 PM

(Jul-19-2021, 08:08 AM)knight2000 Wrote: So if for example I'm trying to extract the url from these two pages (ie- http://www.mysupersite.com and http://www.mydifferentsite.com), how would I go about doing that?

Can get exact with CSS selector or find_all() with index.
To give a example.

from bs4 import BeautifulSoup

html = '''\
<ul class="nb-type-md nb-list-undecorated undefined">
    <li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Green</span></li>
    <li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Designed in England</span></li>
    <li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>Shadow Chrome Painted</span></li>
    <li class=""><span><div class="nb-icon-small nb-inline-block nb-text-gray-200 nb-mr-2xs nb-align-middle"></div>http://www.mydifferentsite.com</span></li>
</ul>'''

soup = BeautifulSoup(html, 'lxml')
tag = soup.select_one('ul > li:nth-child(4)')
print(tag.text)

Output:
http://www.mydifferentsite.com

Find find_all().

>>> tag_ul = soup.find('ul', class_="nb-type-md")
>>> tag_li = tag_ul.find_all('li')
>>> tag_li[3].text
'http://www.mydifferentsite.com'

# Loop
>>> for item in tag_li:
...     print(item.text)
...     
Green
Designed in England
Shadow Chrome Painted
http://www.mydifferentsite.com

Look at Web-Scraping part-1.

knight2000

Hi snippsat,

Thanks a lot for your explanation and direction- appreciate you taking time out to help me understand more.

Since posting, I also kept trying different things and I eventually came across a way to extract the url text where on one webpage it was the 3rd item of the list and on another webpage, it was the forth item on the list (as per my example)

I'm not sure whether this is a good or bad way of doing things- but it works and may help someone :)

soup = BeautifulSoup(r.text, 'html.parser')
    lists = soup.find_all('ul', {'class': 'nb-type-md nb-list-undecorated undefined'})
    li = lists.find_all('li')
    number_elements = len(li)

    if number_elements == 3:
          for wa in lists.find_all('li')[2]:
                webaddress = wa.text
                
    elif number_elements == 4:
           for wa in lists.find_all('li')[3]:
                webaddress = wa.text

Thank you for the link to learn more of the basics too- I'll try and go through as much of that as I can on the weekend. Have a good week mate.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	unable to remove all elements from list based on a condition	sg_python	3	1,707	Jan-27-2024, 04:03 PM Last Post: deanhystad
	mypy unable to analyse types of tuple elements in a list comprehension	tomciodev	1	1,639	Oct-17-2023, 09:46 AM Last Post: tomciodev
	Checking if a string contains all or any elements of a list	k1llcod3	1	4,907	Jan-29-2023, 04:34 AM Last Post: deanhystad
	How to change the datatype of list elements?	mHosseinDS86	9	3,632	Aug-24-2022, 05:26 PM Last Post: deanhystad
	ValueError: Length mismatch: Expected axis has 8 elements, new values have 1 elements	ilknurg	1	8,164	May-17-2022, 11:38 AM Last Post: Larz60+
	Why am I getting list elements < 0 ?	Mark17	8	4,597	Aug-26-2021, 09:31 AM Last Post: naughtyCat
	Looping through nested elements and updating the original list	Alex_James	3	2,918	Aug-19-2021, 12:05 PM Last Post: Alex_James
	Make Groups with the List Elements	quest	2	2,703	Jul-11-2021, 09:58 AM Last Post: perfringo
	I cannot delete and the elements from the list	quest	4	4,072	May-11-2021, 12:01 PM Last Post: perfringo
	List of lists - merge sublists with common elements	medatib531	1	4,121	May-09-2021, 07:49 AM Last Post: Gribouillis

Extracting Elements From A Website List

User Panel Messages

Announcements