Python Forum
Saving html page and reloading into selenium while developing all xpaths
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Saving html page and reloading into selenium while developing all xpaths
#1
I wanted to save the homepage that I was scraping so that I wouldn't have to fetch it every time I was making changes during development.
first I tried (filename is a pathlib Path object):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'

browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'/home/Larz60p/Drivers//chromedriver')
#--| Parse
browser.get(url)
html = browser.page_source
with filename.open('w') as fp:
    fp.write(html)
time.sleep(2)
where filename was a pathlib path, and then when reading back:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'

path = f'file://{filename.resolve()}'
browser.get(path)
So far so good. But when I tried to extract information with xpath, I didn't get error, but couldn't find what I was looking for either. I wasn't able to determine what the issue was.

So...
This is a bit of a hack, but it works without flaw:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')

filename = spath.savedhtmlpath / 'homepage.html'
if not filename.exists():
    response = requests.get(url)
    if response.status_code == 200:
        with filename.open('wb') as fp:
            fp.write(response.content)

path = f'file://{filename.resolve()}'
browser.get(path)
I am almost satisfied using this method, especially as I will only use it for development, but my gut tells me there is a better way.

Anyone know of a better solution, or why 'html = browser.page_source' doesn't work? ??
Reply
#2
(Sep-10-2018, 10:26 AM)Larz60+ Wrote: or why 'html = browser.page_source' doesn't work? ??
that is the normal method i use for getting the source. There might be something weird with whatever site you are going to (like an iframe). You do of course you have to make sure that the page fully loads between browser.get and browser.page_source. Ive been bit by that so many times.
Recommended Tutorials:
Reply
#3
i Tried it with a couple of pages, and I couldn't find anything using xpath.
Quote:Onec I downloaded with requests and saved thatm the same xpath worked fine.
There might be something weird with whatever site you are going to (like an iframe). You do of course you have to make sure that the page fully loads between browser.get and browser.page_source. I've been bit by that so many times.
there is something definately wierd about this site. It was built with wix, and full of Ajax (I think) code.
Took a while to get through that, but the hack seems to work fine. I'd be curious as to why, perhaps later, I'll try to diff the two files, but no time for that now.
Reply
#4
(Sep-10-2018, 12:05 PM)Larz60+ Wrote: It was built with wix, and full of Ajax (I think) code.
It will be problem getting all source from Wix.
Here’s what Wix says.
Quote:Your Wix site and all of its content is hosted exclusively on Wix’s servers, and cannot be transferred elsewhere.
Specifically, it is not possible to export or embed files, pages or sites, created using the Wix Editor or ADI, to another external destination or host.
Reply
#5
It's a good decision. I agree with this.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Click on a button on web page using Selenium Pavel_47 7 4,563 Jan-05-2023, 04:20 AM
Last Post: ellapurnellrt
  selenium returns junk instead of html klaarnou 5 2,182 Mar-27-2022, 07:20 AM
Last Post: klaarnou
  Selenium/Helium loads up a blank web page firaki12345 0 2,007 Mar-23-2021, 11:51 AM
Last Post: firaki12345
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,529 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Parsing html page and working with checkbox (on a captcha) straannick 17 11,038 Feb-04-2021, 02:54 PM
Last Post: snippsat
  Using Python request without selenium on html form with javascript onclick submit but eraosa 0 3,135 Jan-09-2021, 06:08 PM
Last Post: eraosa
  API auto-refresh on HTML page using Flask toc 2 11,744 Dec-23-2020, 02:00 PM
Last Post: toc
  Selenium Parsing (unable to Parse page after loading) oneclick 7 5,889 Oct-30-2020, 08:13 PM
Last Post: tomalex
  Selenium Page Object Model with Python Cryptus 5 3,891 Aug-19-2020, 06:30 AM
Last Post: mlieqo
  Selenium on Angular page Martinelli 3 5,594 Jul-28-2020, 12:40 PM
Last Post: Martinelli

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020