Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping Issue with BS
#1
Greetings,

I am just going to show you an excerpt of code. Assume the function takes in the HTML from a called to bs.BeautifulSoup. I hard code the URL just for testing purposes. I am trying to grab the company name and phone number for all the listings on this page. I know how to paginate, that is not the issue. The element tags look something like this:

Company name: <h3 data-track-omni="XMD: Company Website Link" class="@text-gray-600 @px-2 md:@px-0 @text-lg md:@text-3xl @mb-2 md:@mb-4" data-v-671fc26a="">Doors Over Georgia </h3>

Telephone: <span class="@hidden md:@flex" data-v-671fc26a=""><span class="@font-bold" data-v-671fc26a=""> Call Now: </span> <span data-test="sp-phone-number" class="@ml-1" data-v-671fc26a="">(678) 798-3712</span></span>
1. I try using these classes including different combinations just to grab the phone number and company name, with no luck.
2. If I succeed at grabbing these two fields, I want them to reside as a dictionary inside of a list, to keep the relationship for each listing. I do not want to count all of the company names then phone numbers and hope they are all related to each item. I have attempted to find the surrounding container collection and grab dictionary items.

Could I possibly just grab the fields in each listing as a dictionary, without a container object?


I have read over the entire BS documentation.
Forgive me if I have made this difficult to understand.


session = requests.Session()
BASE_URL = 'https://www.homeadvisor.com/c.Garage-Garage-Doors.Atlanta.GA.-12036.html'


def get_html(session, BASE_URL):
    """ Return steamy bowl of soup for BASE_URL page.  Return None if request fails """
    try:
        PARAMS = {'startingIndex': 0}
        session = session.get(BASE_URL, headers=HEADERS, params=PARAMS)

    except requests.exceptions.ConnectionError as e:
        print('Failed to connect to host: ' + BASE_URL
        exit('Check your internet connection. Then try to open the URL in a web browser. Program exiting.')
        # exit('Could not establish connection to host. Terminating program!')

    if session.status_code == 200:

        return bs.BeautifulSoup(session.text, 'lxml')

        #html.find('p data-v-54b24b60').get_text()
        # return html.select_one('section.xmd-body-section')

    return None


def get_listings(html):

    items = html.select_one('section.xmd-body-section')
    cards = []

    for item in items:
        cards.append(

            {

                company_name: item.select('div', attrs={"data-test": "paginated-pro-card"})
                # 'link_product': HOST + item.find('div', class_='title').find('a').get('href'),
                # 'brand': item.find('div', class_='brand').get_text(strip=True),
                # 'card_image': HOST + item.find('div', class_='image').find('img').get('src')

            }

        )
    
    return cards


html = get_html(session, BASE_URL)
print(get_listings(html))
This sort of looks like a CMS based on all the inline media queries and data lists.
Thanks in advance,
Matt
Reply
#2
It's easier to help if post code(or shorter example) that we can run.
So to give a example of the parsing part.
import requests
from bs4 import BeautifulSoup
import time

url = "https://www.homeadvisor.com/c.Garage-Garage-Doors.Atlanta.GA.-12036.html"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
# time.sleep(3)
companies = soup.select("div.\@mb-2.md\:\@mb-0.\@hidden.md\:\@block > a > h3")
for company in companies:
    print(company.text.strip())
Output:
Doors Over Georgia Tailored Living featuring PremierGarage JJE General Construction Redrock Multi Services, LLC .....
Phone number will be in same way.
>>> phone_number = soup.select_one('span.\@ml-1')
>>> phone_number
<span class="@ml-1" data-test="sp-phone-number" data-v-671fc26a="">(678) 798-3712</span>
>>> phone_number.text
'(678) 798-3712'
So now to add parsed stuff to eg a list,dictionary should now be not so hard to do.
Reply
#3
You always come to my rescue. I apologize for the long piece of code. All I had to do was escape the \@? Nice!

Thank you once again
Reply
#4
Actually, I tried implementing "phone_number" and "rating" but it returned none. I think the problem may be the first companies list doesn't include the rest of the fields. That's why it's important to have a surrounding container for each listing. I might have to go up and inspect it. The HTML is a mess on this site to begin with.

I think we need to start with the outer container for each listing which is listed below:

def get_listings(html):

    companies = html.select('div.\@shadow-lg')

# Now I need to somehow drill down for the company name, phone and rating. It's driving me crazy. Any help would be appreciated.

    listings = []

    for company in companies:
        listings.append(

            {
               # 'title': company.text.strip(), # THIS WORKS
                #'phone_number': company.select_one('span.\@ml-1'), # NO WORK
                #'rating': company.select_one('span.\@text-base') #NO WORK
                

            }

        )

    return listings
Reply
#5
company.select_one() was just a demo to get one number to get all most use .select() same as used with companies.
If throw in zip() can do both in one loop.
import requests
from bs4 import BeautifulSoup
import time

url = "https://www.homeadvisor.com/c.Garage-Garage-Doors.Atlanta.GA.-12036.html"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
companies = soup.select("div.\@mb-2.md\:\@mb-0.\@hidden.md\:\@block > a > h3")
phone_numbers = soup.select('span.\@ml-1')
for company, phone in zip(companies, phone_numbers):
    print(company.text.strip(), phone.text.strip())
Output:
Doors Over Georgia (678) 798-3712 Tailored Living featuring PremierGarage (404) 946-7940 JJE General Construction (866) 907-7906 Redrock Multi Services, LLC (678) 615-1383 .....
Reply
#6
ok. I will give it a shot. Thanks.

(Dec-08-2021, 02:14 PM)snippsat Wrote: company.select_one() was just a demo to get one number to get all most use .select() same as used with companies.
If throw in zip() can do both in one loop.
import requests
from bs4 import BeautifulSoup
import time

url = "https://www.homeadvisor.com/c.Garage-Garage-Doors.Atlanta.GA.-12036.html"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
companies = soup.select("div.\@mb-2.md\:\@mb-0.\@hidden.md\:\@block > a > h3")
phone_numbers = soup.select('span.\@ml-1')
for company, phone in zip(companies, phone_numbers):
    print(company.text.strip(), phone.text.strip())
Output:
Doors Over Georgia (678) 798-3712 Tailored Living featuring PremierGarage (404) 946-7940 JJE General Construction (866) 907-7906 Redrock Multi Services, LLC (678) 615-1383 .....
Reply
#7
The last statement should be returning a list of dictionary items. It's only returning one dictionary item.


def get_listings(html):

    companies = html.select(
        "div.\@mb-2.md\:\@mb-0.\@hidden.md\:\@block > a > h3")
    phone_numbers = html.select('span.\@ml-1')
    ratings = html.select('span.\@text-base')

    listings = []

    for company, phone, rating in zip(companies, phone_numbers, ratings):

        listings.append(
                  # IF I print each one out, it will give me the dictionaries I want.
            {
                'title': company.text.strip(),
                'phone': phone.text.strip(),
                'rating': rating.text.strip()
            }

        )
        return listings


html = get_html(session, BASE_URL, 25)

print(get_listings(html))
# [{'title': 'AAA Garage Door Techs', 'phone': '(770) 696-7733', 'rating': '4.9'}]
Reply
#8
(Dec-08-2021, 07:38 PM)muzikman Wrote: Never mind. My return statement was inside of the loop.


def get_listings(html):

    companies = html.select(
        "div.\@mb-2.md\:\@mb-0.\@hidden.md\:\@block > a > h3")
    phone_numbers = html.select('span.\@ml-1')
    ratings = html.select('span.\@text-base')

    listings = []

    for company, phone, rating in zip(companies, phone_numbers, ratings):

        listings.append(
                  # IF I print each one out, it will give me the dictionaries I want.
            {
                'title': company.text.strip(),
                'phone': phone.text.strip(),
                'rating': rating.text.strip()
            }

        )
    return listings #fixed


html = get_html(session, BASE_URL, 25)

print(get_listings(html))
# [{'title': 'AAA Garage Door Techs', 'phone': '(770) 696-7733', 'rating': '4.9'}]
Reply
#9
I am trying to paginate through a listing. https://www.homeadvisor.com/c.Garage-Gar...ngIndex=50 Look at the bottom next arrow source.

<button disabled="disabled" .....> 
If this is true, break out of while loop.

I have tried using: html.button.has_attr('disabled') and other tags. Either they return false or HTML, etc....

I would prefer that it return a boolean true if the button has the attribute disabled.

Thanks
Reply
#10
I am trying to paginate through a listing. https://www.homeadvisor.com/c.Garage-Gar...ngIndex=50 Look at the bottom next arrow source.

<button disabled="disabled" .....> 
If this is true, break out of while loop.

I have tried using:
html.button.has_attr('disabled')
and other tags. Either they return false or HTML, etc....

I would prefer that it return a boolean true if the button has the attribute disabled.

Thanks
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraping Possible JavaScript issue johnboy1974 2 2,076 Apr-11-2021, 08:53 AM
Last Post: johnboy1974
  Web scraping: webbrowser issue Truman 10 7,086 Jul-11-2018, 11:57 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020