Python Forum

Full Version: Need help with XPath using requests,time,urllib.request and BeautifulSoup
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have an xpath expression that I know works. Using the URL:
https://www.yellowpages.com/houston-tx/m...1657186981

and XPath:
//div[@class='sales-info']/H1[1]

Should return this:
Spector Ivan

My code is posted below. Can anyone please explain why it doesn't work here?
It works using scrapy, but I cannot mulit-thread in scrapy so I'm looking for an alternate.

Thanks.

import requests,time,urllib.request, concurrent.futures, pandas as pd  #proxy cheker < https://stackoverflow.com/questions/765305/proxy-check-in-python >
from bs4 import BeautifulSoup
import time
from lxml import html

url = 'https://www.yellowpages.com/houston-tx/mip/spector-ivan-11449879?lid=1001657186981'

proxy_handler = urllib.request.ProxyHandler({'http': '149.19.32.99:8082'})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

pg=urllib.request.urlopen(url) 

soup = BeautifulSoup(pg,'lxml')

tree = html.fromstring(soup.prettify())
testdata = tree.xpath("//div[@class='sales-info']/H1[1]")
print('XPath data: ', testdata)
Maybe something more like...?

>>> tree.xpath("//div[@class='sales-info']/h1/text()")[0]
'\n        Spector  Ivan\n       '
Thanks but that didn't do it:
IndexError: list index out of range
Odd, I just changed that one line and it "works" for me.
...
#testdata = tree.xpath("//div[@class='sales-info']/H1[1]")
testdata = tree.xpath("//div[@class='sales-info']/h1/text()")[0]
print('XPath data: ', testdata)
Output:
XPath data: Spector Ivan