Python Forum

Full Version: get hotel info from hotelscombined
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi guys,

I am developing a program for extracting the hotels details from the hotelscombined. however, as per my trial, all texts including hotels name, location, service etc can be extracted under <div @class='hc_sr_summary>. However, if I just wanna get the hotels name. How can I make it?
In my trial, news2 = browser.find_element_by_xpath("//div[@id='u1']").text <--this can extract first hotel name only; for the rest, cannot be extracted.

I guess it is the loop issue. because the structure is as below. Please help to answer my question. Thank you.

<div class='hc_sr_summary'> <= master level
<div id='uniquehotelID1' class='hc-searchresultitem'> <== child level
<div class="hc-searchresultitem__hotelsummary">
<H3 class="hc-searchresultitem__hotelname">
<a id="searchResultHeading2679577" class="hc-searchresultitem__hotelnamelink" ....>Hotel Midtown Richardson</a>

<div id='uniquehotelID2' class='hc-searchresultitem'> <== child level
<div id='uniquehotelID3' class='hc-searchresultitem'> <== child level
<div id='uniquehotelID4' class='hc-searchresultitem'> <== child level

************************
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://www.hotelscombined.hk/Hotels/Search?destination=place%3ATaipei&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&languageCode=HK&currencyCode=HKD#destination=place:Taipei&radius=0km&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&pageSize=15&pageIndex=1&sort=Popularity-desc&showSoldOut=false&scroll=432&HotelID=&mapState=expanded%3D0')

##Get all text from <div @class='hc_sr_summary>
news = browser.find_element_by_xpath("//div[@class='hc_sr_summary']").text
print (news)

##Get 1st <h3> under  <div @class='hc_sr_summary> rather than all <h3>???
news2 = browser.find_element_by_xpath("//h3[@class='hc-searchresultitem__hotelname']").text
print (news2)

#browser.close()
find_element_by_ methods will return just one/first element
You need to use find_elements_by_ methods that will return list of multiple elements

then you will iterate over the elements in the list and extract .text property

Locating elements docs
This is how I'd do it:
** Note ** I changed language to English to make it easier for myself, you can change back
from selenium import webdriver
from bs4 import BeautifulSoup


browser = webdriver.Firefox()
# Changesd language code to english (&LanguageCode=EN)
browser.get('https://www.hotelscombined.hk/Hotels/Search?destination=place%3ATaipei&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&languageCode=EN&currencyCode=HKD#destination=place:Taipei&radius=0km&checkin=2019-05-05&checkout=2019-05-06&Rooms=1&adults_1=2&pageSize=15&pageIndex=1&sort=Popularity-desc&showSoldOut=false&scroll=432&HotelID=&mapState=expanded%3D0')

src = browser.page_source
soup = BeautifulSoup(src,"lxml")
hotels = soup.find('div', {'class': 'hc_sr_summary'})

hotel_names = hotels.find_all('div', {'class': 'hc-searchresultitem'})
for hname in hotel_names:
    name = hname.get('fn')
    print(f'Name: {name}')

browser.close()
results:
Output:
Name: San_Want_Residences Name: Ximen_Taipei_DreamHouse Name: San_Want_Hotel_Taipei Name: Urtrip_Hotel Name: Backpackers_Hostel_Taipei_Changchun Name: Taipei_M_Hotel_Main_Station Name: Diary_of_Taipei_Hotel_Main_Station Name: Go_Sleep_Hotel_Hankou Name: Park_Taipei_Hotel Name: FX_Hotel_Taipei_Nanjing_East_Road_Branch Name: Space_Inn Name: Sunworld_Dynasty_Hotel_Taipei Name: Green_World_Hotel_Zhonghua Name: Just_Sleep_Ximending Name: ECFA_Hotel_Wan_Nian
(Apr-22-2019, 12:03 PM)buran Wrote: [ -> ]find_element_by_ methods will return just one/first element You need to use find_elements_by_ methods that will return list of multiple elements then you will iterate over the elements in the list and extract .text property Locating elements docs

Hi buran, thanks you for your answer even Larz60+ has showed another way of using BS.

I tried your method to change from "element" to "elements"
case 1: news3 = browser.find_elements_by_xpath("//div[@class='hc_sr_summary']/div/div/h3/a").text
however, the error showed as AttributeError: 'list' object has no attribute 'text'

case 2: I removed the text as news3 = browser.find_elements_by_xpath("//div[@class='hc_sr_summary']/div/div/h3/a")
however, the result showed some element codes which are not found in the web.

Would you share more the proper way of using find_elements_by_xpath in this case? Thank.

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c8f60e25-62c6-438d-98dc-25a7e9779656", element="af4e5737-265e-4920-a2a5-cbfaa97646fa")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c8f60e25-62c6-438d-98dc-25a7e9779656", element="1cfb5db4-0f5f-4a20-8ddb-3fa93903b60c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c8f60e25-62c6-438d-98dc-25a7e9779656", element="61e02d61-36c6-470c-9dad-b0e053979172")>,
I use a combination of Selenium and BeautifulSoup for a couple of reasons.

In this particular instance, once you have expanded the JavaScript, there's no need for selenium anymore.

Beautiful Soup is the best way to traverse the DOM and scrape the data, so after all JavaScript has been expanded, I use Beautiful Soup to grab the desired data. It speeds up the process (which, in some instances, can be a considerable amount of time), and makes it easier to grab other data if needed later on.
Like Larz a switch to English:

hotel_names = browser.find_elements_by_xpath("//h3[@class='hc-searchresultitem__hotelname']")
for hotel_name in hotel_names:
    print(hotel_name.text)
Output:
Hotel Midtown Richardson Cosmos Hotel Taipei Energy Inn Taipei City FN Hotel Taipei M Hotel - Main Station Palais De Chine Diary of Taipei Hotel Main Station Go Sleep Hotel - Hankou Taipei Triple Tiger Inn Cho Hotel Yomi Hotel - ShuangLian Park Taipei Hotel FX Hotel Taipei Nanjing East Road Branch Space Inn Mr Lobster Secret Den design hostel