Python Forum

Full Version: Clicking Every Page and Attachment on Website
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,
I need to develop a code that navigates to a website and then finds all possible pages on the site and clicks on them. Also, I would like it to open any file attachments and then close them. Pages are layered (i.e. clicking one page will lead to more/new pages to open).

Selenium seems like the obvious choice and I have a code to login and could manually go to pages via xpaths, but there are too many pages to hardcode each page. I have seen a few people talking about similar projects, but nothing seems to correlate.

xpath examples from the site:
//*[@id="site-subnav"]/div[2]/div/div[1]/nav/ul[2]/li[1]/div/div/div[2]/div/span
//*[@id="infobar-290_1"]/div/div[3]/div/a

pdf file xpath example: //*[@id="item-45"]/div/div/div/div[4]/section/div[2]/div[2]/div[3]/div[2]/div/div[2]/a

I can provide more info as needed, but at this point I just need to know where to start.
Yes Selenium is fine for this.
The common mistake of trying to much at once if new at this.
Test one ting at time and see parse or downloads work.
If you navigate to another page after clicking it, then the next one will no longer keep the state of opening page DOM.
For pdf downloads may need to setup Capabilities & ChromeOptions

Basic setup that i use.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
driver = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://python-forum.io/"
driver.get(url)
title = driver.find_elements_by_css_selector("div.card-body.d-flex.p-0 > div > p")
print(title[0].text)
Output:
Welcome, we are a dedicated Python forum. We encourage back and forth discussions based on the topic of the thread. ....
For downloads may need add something like this.
options.add_argument("--browser.download.folderList=2") # Can now specify path
options.add_argument("--browser.helperApps.neverAsk.saveToDisk=application/pdf")
options.add_argument("--browser.download.dir=path/to/downloads/pics/folder")