I wanted to save the homepage that I was scraping so that I wouldn't have to fetch it every time I was making changes during development.
first I tried (filename is a pathlib Path object):
where filename was a pathlib path, and then when reading back:
So far so good. But when I tried to extract information with xpath, I didn't get error, but couldn't find what I was looking for either. I wasn't able to determine what the issue was.
So...
This is a bit of a hack, but it works without flaw:
I am almost satisfied using this method, especially as I will only use it for development, but my gut tells me there is a better way.
Anyone know of a better solution, or why 'html = browser.page_source' doesn't work? ??
first I tried (filename is a pathlib Path object):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
chrome_options = Options() chrome_options.add_argument( "--headless" ) chrome_options.add_argument( '--disable-gpu' ) chrome_options.add_argument( '--log-level=3' ) filename = spath.savedhtmlpath / 'homepage.html' browser = webdriver.Chrome(chrome_options = chrome_options, executable_path = r '/home/Larz60p/Drivers//chromedriver' ) #--| Parse browser.get(url) html = browser.page_source with filename. open ( 'w' ) as fp: fp.write(html) time.sleep( 2 ) |
1 2 3 4 5 6 7 8 |
chrome_options = Options() chrome_options.add_argument( "--headless" ) chrome_options.add_argument( '--disable-gpu' ) chrome_options.add_argument( '--log-level=3' ) filename = spath.savedhtmlpath / 'homepage.html' browser.get(path) |
So...
This is a bit of a hack, but it works without flaw:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
chrome_options = Options() chrome_options.add_argument( "--headless" ) chrome_options.add_argument( '--disable-gpu' ) chrome_options.add_argument( '--log-level=3' ) filename = spath.savedhtmlpath / 'homepage.html' if not filename.exists(): response = requests.get(url) if response.status_code = = 200 : with filename. open ( 'wb' ) as fp: fp.write(response.content) browser.get(path) |
Anyone know of a better solution, or why 'html = browser.page_source' doesn't work? ??