Python Forum

Hi guys,

I am developing a program for extracting the hotels details from the hotelscombined. however, as per my trial, all texts including hotels name, location, service etc can be extracted under <div @class='hc_sr_summary>. However, if I just wanna get the hotels name. How can I make it?
In my trial, news2 = browser.find_element_by_xpath("//div[@id='u1']").text <--this can extract first hotel name only; for the rest, cannot be extracted.

I guess it is the loop issue. because the structure is as below. Please help to answer my question. Thank you.

<div class='hc_sr_summary'> <= master level
<div id='uniquehotelID1' class='hc-searchresultitem'> <== child level
<div class="hc-searchresultitem__hotelsummary">
<H3 class="hc-searchresultitem__hotelname">
<a id="searchResultHeading2679577" class="hc-searchresultitem__hotelnamelink" ....>Hotel Midtown Richardson</a>

<div id='uniquehotelID2' class='hc-searchresultitem'> <== child level
<div id='uniquehotelID3' class='hc-searchresultitem'> <== child level
<div id='uniquehotelID4' class='hc-searchresultitem'> <== child level

************************

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://www.hotelscombined.hk/Hotels/Search?destination=place%3ATaipei&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&languageCode=HK&currencyCode=HKD#destination=place:Taipei&radius=0km&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&pageSize=15&pageIndex=1&sort=Popularity-desc&showSoldOut=false&scroll=432&HotelID=&mapState=expanded%3D0')

##Get all text from <div @class='hc_sr_summary>
news = browser.find_element_by_xpath("//div[@class='hc_sr_summary']").text
print (news)

##Get 1st <h3> under  <div @class='hc_sr_summary> rather than all <h3>???
news2 = browser.find_element_by_xpath("//h3[@class='hc-searchresultitem__hotelname']").text
print (news2)

#browser.close()

find_element_by_ methods will return just one/first element
You need to use find_elements_by_ methods that will return list of multiple elements

then you will iterate over the elements in the list and extract .text property

Locating elements docs

This is how I'd do it:
** Note ** I changed language to English to make it easier for myself, you can change back

from selenium import webdriver
from bs4 import BeautifulSoup


browser = webdriver.Firefox()
# Changesd language code to english (&LanguageCode=EN)
browser.get('https://www.hotelscombined.hk/Hotels/Search?destination=place%3ATaipei&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&languageCode=EN&currencyCode=HKD#destination=place:Taipei&radius=0km&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&pageSize=15&pageIndex=1&sort=Popularity-desc&showSoldOut=false&scroll=432&HotelID=&mapState=expanded%3D0')

src = browser.page_source
soup = BeautifulSoup(src,"lxml")
hotels = soup.find('div', {'class': 'hc_sr_summary'})

hotel_names = hotels.find_all('div', {'class': 'hc-searchresultitem'})
for hname in hotel_names:
    name = hname.get('fn')
    print(f'Name: {name}')

browser.close()

results:

Output:Name: San_Want_Residences
Name: Ximen_Taipei_DreamHouse
Name: San_Want_Hotel_Taipei
Name: Urtrip_Hotel
Name: Backpackers_Hostel_Taipei_Changchun
Name: Taipei_M_Hotel_Main_Station
Name: Diary_of_Taipei_Hotel_Main_Station
Name: Go_Sleep_Hotel_Hankou
Name: Park_Taipei_Hotel
Name: FX_Hotel_Taipei_Nanjing_East_Road_Branch
Name: Space_Inn
Name: Sunworld_Dynasty_Hotel_Taipei
Name: Green_World_Hotel_Zhonghua
Name: Just_Sleep_Ximending
Name: ECFA_Hotel_Wan_Nian

(Apr-22-2019, 12:03 PM)buran Wrote: [ -> ]find_element_by_ methods will return just one/first element You need to use find_elements_by_ methods that will return list of multiple elements then you will iterate over the elements in the list and extract .text property Locating elements docs

Hi buran, thanks you for your answer even Larz60+ has showed another way of using BS.

I tried your method to change from "element" to "elements"
case 1: news3 = browser.find_elements_by_xpath("//div[@class='hc_sr_summary']/div/div/h3/a").text
however, the error showed as AttributeError: 'list' object has no attribute 'text'

case 2: I removed the text as news3 = browser.find_elements_by_xpath("//div[@class='hc_sr_summary']/div/div/h3/a")
however, the result showed some element codes which are not found in the web.

Would you share more the proper way of using find_elements_by_xpath in this case? Thank.

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c8f60e25-62c6-438d-98dc-25a7e9779656", element="af4e5737-265e-4920-a2a5-cbfaa97646fa")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c8f60e25-62c6-438d-98dc-25a7e9779656", element="1cfb5db4-0f5f-4a20-8ddb-3fa93903b60c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c8f60e25-62c6-438d-98dc-25a7e9779656", element="61e02d61-36c6-470c-9dad-b0e053979172")>,

I use a combination of Selenium and BeautifulSoup for a couple of reasons.

In this particular instance, once you have expanded the JavaScript, there's no need for selenium anymore.

Beautiful Soup is the best way to traverse the DOM and scrape the data, so after all JavaScript has been expanded, I use Beautiful Soup to grab the desired data. It speeds up the process (which, in some instances, can be a considerable amount of time), and makes it easier to grab other data if needed later on.

Like Larz a switch to English:

hotel_names = browser.find_elements_by_xpath("//h3[@class='hc-searchresultitem__hotelname']")
for hotel_name in hotel_names:
    print(hotel_name.text)

Output:Hotel Midtown Richardson
Cosmos Hotel Taipei
Energy Inn Taipei City
FN Hotel
Taipei M Hotel - Main Station
Palais De Chine
Diary of Taipei Hotel Main Station
Go Sleep Hotel - Hankou
Taipei Triple Tiger Inn
Cho Hotel
Yomi Hotel - ShuangLian
Park Taipei Hotel
FX Hotel Taipei Nanjing East Road Branch
Space Inn
Mr Lobster Secret Den design hostel

nikana

buran

Larz60+

nikana

Larz60+

buran