Web scraping (selenium (i think)) - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Web scraping (selenium (i think)) (/thread-15647.html) Pages:
1
2
|
Web scraping (selenium (i think)) - Larz60+ - Jan-25-2019 When scraping a new site, I like to download each page so that I can work on it offline until I perfect the code. If I save for example browser.page_source, I get the page and all of the links, etc. which is helpful. But what I'd really like to have is what is stored when from firefox, you use the 'Save Page As' from file menu, which not only saves the page, but all of the supporting images, css files, javascript, etc. in a separate directory. I could write code to do this, but not sure exactly what I need to download to be 'not too little', or 'not too much' With selenium, when the page is brought up using: caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True browser.get(url)the firefox menu is not shown, so clicking on 'Save Page As' is not an option. the question: Does anyone know how to do this? If not, does anyone know exactly what to download to be 'just enough'? I found a package 'pywebcopy' which does a great job of downloading a page, and it's peripheral files, but all of the links are missing in the html. RE: Web scraping (selenium (i think)) - metulburr - Jan-25-2019 can you do save page as via keyboard shortcuts? ActionChains(browser).key_down(Keys.COMMAND).send_keys("s").key_up(Keys.COMMAND).perform() RE: Web scraping (selenium (i think)) - Larz60+ - Jan-25-2019 I'll give it a go RE: Web scraping (selenium (i think)) - metulburr - Jan-26-2019 Here seems to be a method (responses 2 and 3) for win and linux https://stackoverflow.com/questions/10967408/save-a-web-page-with-python-selenium The linux one works for me with selenium to save as in firefox with a few modifications sudo apt-get install xautomation from subprocess import Popen, PIPE save_sequence = b"""keydown Control_L key S keyup Control_L keydown Return """ def keypress(sequence): p = Popen(['xte'], stdin=PIPE) p.communicate(input=sequence) keypress(save_sequence)it downloads the css, js, images in a directory metulburr@ubuntu:~/Downloads$ cd Twitter_files/ metulburr@ubuntu:~/Downloads/Twitter_files$ ls 0.commons.en.e39ce78c2d3da.js saved_resource(5).html 8.pages_home.en.01b37cfab9b.js saved_resource(6).html analytics.js saved_resource(7).html default_profile_bigger.png saved_resource(8).html default_profile_normal.png saved_resource(9).html delight_prompt.png saved_resource.html init.en.244fa41b6a57.js twitter_core.bundle.css Qtz8LqPx_bigger.jpg twitter_more_1.bundle.css saved_resource(1).html twitter_more_2.bundle.css saved_resource(2).html twitter_profile_editing.bundle.css saved_resource(3).html Vb8k670S_bigger.png saved_resource(4).html y-kyowAV_bigger.jpg metulburr@ubuntu:~/Downloads/Twitter_files$ RE: Web scraping (selenium (i think)) - Larz60+ - Jan-26-2019 Thanks, Will try thin in the A.M. and let you know how I make out. I was also thinking that I could use pywebcopy to get all of the peripheral files, and just save browser.page_source to get main page html with the links in tact. The method you show looks promising, Is xte a pipe or a filename? RE: Web scraping (selenium (i think)) - metulburr - Jan-26-2019 (Jan-26-2019, 03:12 AM)Larz60+ Wrote: Is xte a pipe or a filename?its a program sudo apt-get install xautomationhttps://linux.die.net/man/1/xte RE: Web scraping (selenium (i think)) - Larz60+ - Jan-26-2019 trying to get your subprocess to work when executing xte.communicate, getting following error: (executing keypress)code: def save_page_as(self): save_sequence = b"""keydown Control_L key S keyup Control_L keydown Return """ def keypress(save_sequence): p = Popen(['xte'], stdin=PIPE) p.communicate(input=save_sequence) keypress(save_sequence)Missing communicate? RE: Web scraping (selenium (i think)) - metulburr - Jan-26-2019 docstrings save the space before it including the indentation for the function. Which is why the first one is not in the error. >>> s = '''a ... b ... c ... ''' >>> s 'a\n b\nc\n' RE: Web scraping (selenium (i think)) - Larz60+ - Jan-27-2019 Ok, Now it executes but never stops, got local denial of service, could not kill and had to disconnect my network. I'm not knowledgeable of the xautomation process. Guess it's time to learn. RE: Web scraping (selenium (i think)) - metulburr - Jan-27-2019 you might need a keyup Return to stop it?
|