Web scraping (selenium (i think))

**Larz60+** · Jan-25-2019, 09:07 PM

When scraping a new site, I like to download each page so that I can work on it offline until I perfect the code.

If I save for example browser.page_source, I get the page and all of the links, etc. which is helpful.
But what I'd really like to have is what is stored when from firefox, you use the 'Save Page As' from file menu,
which not only saves the page, but all of the supporting images, css files, javascript, etc. in a separate directory.

I could write code to do this, but not sure exactly what I need to download to be 'not too little', or 'not too much'

With selenium, when the page is brought up using:

        
              caps = webdriver.DesiredCapabilities().FIREFOX
            caps["marionette"] = True
browser.get(url)

the firefox menu is not shown, so clicking on 'Save Page As' is not an option.

the question: Does anyone know how to do this?
If not, does anyone know exactly what to download to be 'just enough'?

I found a package 'pywebcopy' which does a great job of downloading a page, and it's peripheral files,
but all of the links are missing in the html.

***metulburr*** · Jan-25-2019, 11:39 PM

can you do save page as via keyboard shortcuts?

        
              ActionChains(browser).key_down(Keys.COMMAND).send_keys("s").key_up(Keys.COMMAND).perform()

**Larz60+** · Jan-25-2019, 11:47 PM

I'll give it a go

***metulburr*** · (This post was last modified: Jan-26-2019, 01:26 PM by metulburr.)

Here seems to be a method (responses 2 and 3) for win and linux
https://stackoverflow.com/questions/1096...n-selenium

The linux one works for me with selenium to save as in firefox with a few modifications

        
              sudo apt-get install xautomation

        
              from subprocess import Popen, PIPE
 
save_sequence = b"""keydown Control_L
key S
keyup Control_L
keydown Return
"""
 
def keypress(sequence):
    p = Popen(['xte'], stdin=PIPE)
    p.communicate(input=sequence)
 
keypress(save_sequence)

it downloads the css, js, images in a directory

        
          
          
              
              metulburr@ubuntu:~/Downloads$ cd Twitter_files/
metulburr@ubuntu:~/Downloads/Twitter_files$ ls
0.commons.en.e39ce78c2d3da.js     saved_resource(5).html
8.pages_home.en.01b37cfab9b.js  saved_resource(6).html
analytics.js                             saved_resource(7).html
default_profile_bigger.png               saved_resource(8).html
default_profile_normal.png               saved_resource(9).html
delight_prompt.png                       saved_resource.html
init.en.244fa41b6a57.js          twitter_core.bundle.css
Qtz8LqPx_bigger.jpg                      twitter_more_1.bundle.css
saved_resource(1).html                   twitter_more_2.bundle.css
saved_resource(2).html                   twitter_profile_editing.bundle.css
saved_resource(3).html                   Vb8k670S_bigger.png
saved_resource(4).html                   y-kyowAV_bigger.jpg
metulburr@ubuntu:~/Downloads/Twitter_files$ 

            

        
      

**Larz60+** · Jan-26-2019, 03:12 AM

Thanks, Will try thin in the A.M. and let you know how I make out.
I was also thinking that I could use pywebcopy to get all of the peripheral files, and
just save browser.page_source to get main page html with the links in tact.

The method you show looks promising, Is xte a pipe or a filename?

***metulburr*** · Jan-26-2019, 01:22 PM

(Jan-26-2019, 03:12 AM)Larz60+ Wrote: Is xte a pipe or a filename?

its a program

        
              sudo apt-get install xautomation

https://linux.die.net/man/1/xte

**Larz60+** · (This post was last modified: Jan-26-2019, 11:45 PM by Larz60+.)

trying to get your subprocess to work
when executing xte.communicate, getting following error:

Error:Unknown command '        key S'
Unknown command '        keyup Control_L'
Unknown command '        keydown Return'
Unknown command '       '

(executing keypress)

code:

        
              def save_page_as(self):
    save_sequence = b"""keydown Control_L
    key S
    keyup Control_L
    keydown Return
    """
    def keypress(save_sequence):
        p = Popen(['xte'], stdin=PIPE)
        p.communicate(input=save_sequence)
         
    keypress(save_sequence)

Missing communicate?

***metulburr*** · (This post was last modified: Jan-26-2019, 11:55 PM by metulburr.)

docstrings save the space before it including the indentation for the function. Which is why the first one is not in the error.

        
              >>> s = '''a
...     b
... c
... '''
>>> s
'a\n    b\nc\n'

**Larz60+** · (This post was last modified: Jan-27-2019, 12:22 AM by Larz60+.)

Ok,
Now it executes but never stops, got local denial of service,
could not kill and had to disconnect my network.

I'm not knowledgeable of the xautomation process.
Guess it's time to learn.

***metulburr*** · Jan-27-2019, 12:58 AM

you might need a keyup Return to stop it?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scraping div tags with selenium, need help	hfakoor2	1	1,940	Mar-12-2023, 08:31 AM Last Post: hfakoor2
	Web scraping cookie in URL blocks selenium	Alex06	2	3,329	Jan-10-2021, 01:43 PM Last Post: Alex06
	Web Page not opening while web scraping through python selenium	sumandas89	4	12,268	Nov-19-2018, 02:47 PM Last Post: snippsat
	web scraping with selenium and bs4	Prince_Bhatia	2	4,736	Sep-18-2018, 10:59 AM Last Post: Prince_Bhatia
	scraping javascript websites with selenium	DoctorEvil	1	4,099	Jun-08-2018, 06:40 PM Last Post: DoctorEvil
	Combining selenium and beautifulsoup for web scraping	sumandas89	3	13,376	Jan-30-2018, 02:14 PM Last Post: metulburr
	web scraping using selenium	sumandas89	3	4,548	Jan-05-2018, 01:45 PM Last Post: metulburr
	Error in Selenium: CRITICAL:root:Selenium module is not installed...Exiting program.	AcszE	1	4,512	Nov-03-2017, 08:41 PM Last Post: metulburr

Web scraping (selenium (i think))

User Panel Messages

Announcements