Python Forum

Full Version: Saving html page and reloading into selenium while developing all xpaths
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I wanted to save the homepage that I was scraping so that I wouldn't have to fetch it every time I was making changes during development.
first I tried (filename is a pathlib Path object):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'

browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'/home/Larz60p/Drivers//chromedriver')
#--| Parse
browser.get(url)
html = browser.page_source
with filename.open('w') as fp:
    fp.write(html)
time.sleep(2)
where filename was a pathlib path, and then when reading back:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'

path = f'file://{filename.resolve()}'
browser.get(path)
So far so good. But when I tried to extract information with xpath, I didn't get error, but couldn't find what I was looking for either. I wasn't able to determine what the issue was.

So...
This is a bit of a hack, but it works without flaw:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')

filename = spath.savedhtmlpath / 'homepage.html'
if not filename.exists():
    response = requests.get(url)
    if response.status_code == 200:
        with filename.open('wb') as fp:
            fp.write(response.content)

path = f'file://{filename.resolve()}'
browser.get(path)
I am almost satisfied using this method, especially as I will only use it for development, but my gut tells me there is a better way.

Anyone know of a better solution, or why 'html = browser.page_source' doesn't work? ??
(Sep-10-2018, 10:26 AM)Larz60+ Wrote: [ -> ]or why 'html = browser.page_source' doesn't work? ??
that is the normal method i use for getting the source. There might be something weird with whatever site you are going to (like an iframe). You do of course you have to make sure that the page fully loads between browser.get and browser.page_source. Ive been bit by that so many times.
i Tried it with a couple of pages, and I couldn't find anything using xpath.
Quote:Onec I downloaded with requests and saved thatm the same xpath worked fine.
There might be something weird with whatever site you are going to (like an iframe). You do of course you have to make sure that the page fully loads between browser.get and browser.page_source. I've been bit by that so many times.
there is something definately wierd about this site. It was built with wix, and full of Ajax (I think) code.
Took a while to get through that, but the hack seems to work fine. I'd be curious as to why, perhaps later, I'll try to diff the two files, but no time for that now.
(Sep-10-2018, 12:05 PM)Larz60+ Wrote: [ -> ]It was built with wix, and full of Ajax (I think) code.
It will be problem getting all source from Wix.
Here’s what Wix says.
Quote:Your Wix site and all of its content is hosted exclusively on Wix’s servers, and cannot be transferred elsewhere.
Specifically, it is not possible to export or embed files, pages or sites, created using the Wix Editor or ADI, to another external destination or host.
It's a good decision. I agree with this.