Python Forum

Hi there,

a beginner here and trying to get some practice with the little I know in scraping a website.

To try and make my challenge clear, I'll start off by showing you all the data and then I'll explain what I'm trying to extract from it.

Output:[<i class="icon-v5 cross"></i>
</div>, <div class="col col-1 seller-interview">
<i class="icon-v5 cross"></i>
</div>, <div class="col col-1 seller-interview">
<i class="icon-v5 check-blue"></i>
</div>, <div class="col col-1 seller-interview">
<i class="icon-v5 cross"></i>
</div>, <div class="col col-1 seller-interview">
<i class="icon-v5 check-blue"></i>
</div>, <div class="col col-1 seller-interview">
<i class="icon-v5 check-blue"></i>

I've extracted this data using the following code:

import requests


url = '[my practice url]'

page = requests.get(url)
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.text, 'html.parser')


# Get the Seller Interview tag
seller_interview = soup.findAll("div", {"class": "col col-1 seller-interview"})
print(seller_interview)

My goal is to be able to extract the text that is between the " " after the "i class= ". In other words, the text that should be returned will either be icon-v5 cross or icon-v5 check-blue for each line item.

For extracting other text between urls or text between spans, I've been able to use something like object.span.text or object.a.text but as this is between an 'i class', I'm not sure how to extract it.

I've read up on using the Regex module, and I tried something like:

matches = re.findall(r"icon-v5 cross", seller_interview)

but that gives me:

Error:
TypeError: expected string or bytes-like object

Could someone help me in how to extract the text that's in-between the "i class" on each line please?

Anybody who is trying to put together HTML and re should for starters read famous and opinionated You can't parse [X]HTML with regex

(May-26-2021, 05:39 AM)perfringo Wrote: [ -> ]Anybody who is trying to put together HTML and re should for starters read famous and opinionated You can't parse [X]HTML with regex

Thanks perfringo.

Funny read! After reading that it appears like using RE with HTML isn't the way to do it. Is their another function or what else would you suggest?

Not tested, but something along those lines could work:

for record in seller_interview:
    for i in record.find_all('i'):
        print(i.get('class'))

(May-26-2021, 07:49 AM)perfringo Wrote: [ -> ]Not tested, but something along those lines could work:
for record in seller_interview:
    for i in record.find_all('i'):
        print(i.get('class'))

Thank you perfringo. A nested loop within a nested loop- yep, I wouldn't have thought about that!

I tried what you gave me and it almost worked. An example of the output was:

Output:
['icon-v5', 'cross']

So I grabbed the second string by:

for record in seller_interview:
    for i in record.find_all("i"):
       s_interview = i.get("class")
       result = (s_interview[1])
       print(result)

I'm sure it's not the cleanest and could probably be done better, but it worked. Smile

Thank you for your help and direction with this- much appreciated.

knight2000

perfringo

knight2000

perfringo

knight2000