Python Forum
Extracting the text between each "i class"
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting the text between each "i class"
#1
Hi there,

a beginner here and trying to get some practice with the little I know in scraping a website.

To try and make my challenge clear, I'll start off by showing you all the data and then I'll explain what I'm trying to extract from it.

Output:
[<i class="icon-v5 cross"></i> </div>, <div class="col col-1 seller-interview"> <i class="icon-v5 cross"></i> </div>, <div class="col col-1 seller-interview"> <i class="icon-v5 check-blue"></i> </div>, <div class="col col-1 seller-interview"> <i class="icon-v5 cross"></i> </div>, <div class="col col-1 seller-interview"> <i class="icon-v5 check-blue"></i> </div>, <div class="col col-1 seller-interview"> <i class="icon-v5 check-blue"></i>
I've extracted this data using the following code:

import requests


url = '[my practice url]'

page = requests.get(url)
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.text, 'html.parser')


# Get the Seller Interview tag
seller_interview = soup.findAll("div", {"class": "col col-1 seller-interview"})
print(seller_interview)
My goal is to be able to extract the text that is between the " " after the "i class= ". In other words, the text that should be returned will either be icon-v5 cross or icon-v5 check-blue for each line item.

For extracting other text between urls or text between spans, I've been able to use something like object.span.text or object.a.text but as this is between an 'i class', I'm not sure how to extract it.

I've read up on using the Regex module, and I tried something like:

matches = re.findall(r"icon-v5 cross", seller_interview)
but that gives me:

Error:
TypeError: expected string or bytes-like object
Could someone help me in how to extract the text that's in-between the "i class" on each line please?
Reply
#2
Anybody who is trying to put together HTML and re should for starters read famous and opinionated You can't parse [X]HTML with regex
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#3
(May-26-2021, 05:39 AM)perfringo Wrote: Anybody who is trying to put together HTML and re should for starters read famous and opinionated You can't parse [X]HTML with regex

Thanks perfringo.

Funny read! After reading that it appears like using RE with HTML isn't the way to do it. Is their another function or what else would you suggest?
Reply
#4
Not tested, but something along those lines could work:

for record in seller_interview:
    for i in record.find_all('i'):
        print(i.get('class'))
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
(May-26-2021, 07:49 AM)perfringo Wrote: Not tested, but something along those lines could work:

for record in seller_interview:
    for i in record.find_all('i'):
        print(i.get('class'))

Thank you perfringo. A nested loop within a nested loop- yep, I wouldn't have thought about that!

I tried what you gave me and it almost worked. An example of the output was:

Output:
['icon-v5', 'cross']
So I grabbed the second string by:

for record in seller_interview:
    for i in record.find_all("i"):
       s_interview = i.get("class")
       result = (s_interview[1])
       print(result)
I'm sure it's not the cleanest and could probably be done better, but it worked. Smile

Thank you for your help and direction with this- much appreciated.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to remove footer from PDF when extracting to text jh67 3 5,048 Dec-13-2022, 06:52 AM
Last Post: DPaul
  Extracting Specific Lines from text file based on content. jokerfmj 8 2,952 Mar-28-2022, 03:38 PM
Last Post: snippsat
  Extracting all text from a video jehoshua 2 2,175 Nov-14-2021, 09:54 PM
Last Post: jehoshua
  Extracting data based on specific patterns in a text file K11 1 2,205 Aug-28-2020, 09:00 AM
Last Post: Gribouillis
  Extracting Text Evil_Patrick 6 2,897 Nov-13-2019, 08:51 AM
Last Post: buran
  saving (in text or binary) an object under a defined class cai0824 3 3,073 May-12-2019, 08:55 AM
Last Post: snippsat
  Extracting a portion of a text document alarcon032002 8 4,311 Jan-17-2019, 10:35 PM
Last Post: Larz60+
  Google Cloud Vision: Extracting Location of Text pablo_castano 0 2,653 Jun-24-2018, 02:47 AM
Last Post: pablo_castano

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020