Dec-07-2021, 11:49 AM
Greetings,
I am just going to show you an excerpt of code. Assume the function takes in the HTML from a called to bs.BeautifulSoup. I hard code the URL just for testing purposes. I am trying to grab the company name and phone number for all the listings on this page. I know how to paginate, that is not the issue. The element tags look something like this:
2. If I succeed at grabbing these two fields, I want them to reside as a dictionary inside of a list, to keep the relationship for each listing. I do not want to count all of the company names then phone numbers and hope they are all related to each item. I have attempted to find the surrounding container collection and grab dictionary items.
Could I possibly just grab the fields in each listing as a dictionary, without a container object?
I have read over the entire BS documentation.
Forgive me if I have made this difficult to understand.
Thanks in advance,
Matt
I am just going to show you an excerpt of code. Assume the function takes in the HTML from a called to bs.BeautifulSoup. I hard code the URL just for testing purposes. I am trying to grab the company name and phone number for all the listings on this page. I know how to paginate, that is not the issue. The element tags look something like this:
Company name: <h3 data-track-omni="XMD: Company Website Link" class="@text-gray-600 @px-2 md:@px-0 @text-lg md:@text-3xl @mb-2 md:@mb-4" data-v-671fc26a="">Doors Over Georgia </h3> Telephone: <span class="@hidden md:@flex" data-v-671fc26a=""><span class="@font-bold" data-v-671fc26a=""> Call Now: </span> <span data-test="sp-phone-number" class="@ml-1" data-v-671fc26a="">(678) 798-3712</span></span>1. I try using these classes including different combinations just to grab the phone number and company name, with no luck.
2. If I succeed at grabbing these two fields, I want them to reside as a dictionary inside of a list, to keep the relationship for each listing. I do not want to count all of the company names then phone numbers and hope they are all related to each item. I have attempted to find the surrounding container collection and grab dictionary items.
Could I possibly just grab the fields in each listing as a dictionary, without a container object?
I have read over the entire BS documentation.
Forgive me if I have made this difficult to understand.
session = requests.Session() BASE_URL = 'https://www.homeadvisor.com/c.Garage-Garage-Doors.Atlanta.GA.-12036.html' def get_html(session, BASE_URL): """ Return steamy bowl of soup for BASE_URL page. Return None if request fails """ try: PARAMS = {'startingIndex': 0} session = session.get(BASE_URL, headers=HEADERS, params=PARAMS) except requests.exceptions.ConnectionError as e: print('Failed to connect to host: ' + BASE_URL exit('Check your internet connection. Then try to open the URL in a web browser. Program exiting.') # exit('Could not establish connection to host. Terminating program!') if session.status_code == 200: return bs.BeautifulSoup(session.text, 'lxml') #html.find('p data-v-54b24b60').get_text() # return html.select_one('section.xmd-body-section') return None def get_listings(html): items = html.select_one('section.xmd-body-section') cards = [] for item in items: cards.append( { company_name: item.select('div', attrs={"data-test": "paginated-pro-card"}) # 'link_product': HOST + item.find('div', class_='title').find('a').get('href'), # 'brand': item.find('div', class_='brand').get_text(strip=True), # 'card_image': HOST + item.find('div', class_='image').find('img').get('src') } ) return cards html = get_html(session, BASE_URL) print(get_listings(html))This sort of looks like a CMS based on all the inline media queries and data lists.
Thanks in advance,
Matt