Python Forum

Hi,

I am trying to scrape the price of an item from an online store. The code works with sites such as ebay.com, amazon.com etc. and many others but is not working in some cases. I am using lxml and I am providing the xpath obtained using selector gadget. The code can be seen below.

import requests
from lxml import html

pagecontent=requests.get("https://www.myntra.com/watches/fossil/fossil-women-rose-gold-toned-dial-watch-es3352i/759168/buy")
tree = html.fromstring(pagecontent.content)
data=tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "pdp-price", " " ))]')
print(data[0].text);

Here is the error. It can be understood from the error that data is an empty array. I would like to know how I can resolve this issue.

Error:Traceback (most recent call last):
  File "scrape-test.py", line 7, in <module>
    print(data[0].text);
IndexError: list index out of range

Version information: python 3.4.3

I appreciate the cooperation of forum members.

Looks like the class you're looking for doesn't exist in the page source, but is generated by javascript.
Also, your xpath expression is looking specifically for ' pdp-price ', which wouldn't be found anyway.

The data does exist inside the javascript variable window.__myx though, so you'll probably be able to work with that.

As mention bye @stranac so is data generated bye JavaScript.
So lxml alone can not read that,the simplest way is to Selenium.
Can have different drivers,here i use PhantomJS to not load a browser window.
So i send browser.page_source with the rendered JavaScript.
The can use XPath to eg take out price.

from selenium import webdriver
from lxml import html

browser = webdriver.PhantomJS()
url = 'https://www.myntra.com/watches/fossil/fossil-women-rose-gold-toned-dial-watch-es3352i/759168/buy'
browser.get(url)
tree = html.fromstring(browser.page_source)
data = tree.xpath('//*[@id="mountRoot"]/div/div/main/div[2]/div[2]/div[1]/p[2]/strong')
print(data[0].text)

Output:
Rs. 6646

mgtheboss

stranac

snippsat