Posts: 16
Threads: 3
Joined: Mar 2019
I did lot of google on this topics and also tried lot of solution from stack overflow but none of them didn't work. Currently I am using paid proxy so that I can avoid block during web scraping. My proxy needs authentication by user name and password with proxy port. Assume my proxy user_name:"x",proxy_password:"abc",proxy_server: "abcdfr.com" port_port:80 how I will use them in selenium
Posts: 8,135
Threads: 159
Joined: Sep 2016
(Sep-10-2020, 01:56 PM)farhan275 Wrote: and also tried lot of solution from stack overflow maybe you should show what you have tried and not worked so that people here don't waste time to suggest the same.
post some code that should work, but did not and what you get as error
Posts: 16
Threads: 3
Joined: Mar 2019
Sep-10-2020, 03:06 PM
(This post was last modified: Sep-10-2020, 03:06 PM by farhan275.)
here is my full code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time
#argument for incognito Chrome
option = Options()
option.add_argument("--incognito")
browser = webdriver.Chrome(options=option)
for page_num in range(1,20): # change the range as you want to
url = "https://www.usine-digitale.fr/annuaire-start-up/?page={}".format(page_num)
browser.get(url)
print("page url : ",url)
time.sleep(3)
# Wait 20 seconds for page to load
timeout = 20
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='texteContenu3']")))
except TimeoutException:
print("Timed out waiting for page to load")
browser.quit()
soup = BeautifulSoup(browser.page_source, "html.parser")
product_items = soup.find_all("a",{"class":"contenu"})
for item in product_items:
item_url = item.get('href')
print("https://www.usine-digitale.fr"+item_url)
browser.get("https://www.usine-digitale.fr"+item_url)
time.sleep(3)
itm_soup = BeautifulSoup(browser.page_source, "html.parser")
container = itm_soup.find_all("div",{"class":"contenuPage"})
for contain in container:
name_element = contain.find("h1", class_="titreFicheStartUp")
name = name_element.get_text() if name_element else "No name found"
description_element = contain.find("div",{"itemprop":"description"})
description = description_element.get_text() if description_element else "No description found"
product_element = contain.find("div",{"itemprop":"makesOffer"})
product = product_element.get_text() if product_element else "No product found"
creators_element = contain.find("div",{"itemprop":"founders"})
creators = creators_element.get_text() if creators_element else "No creators found"
domain_element = contain.find("a",{"itemprop":"sameAs"})
domain = creators_element.get_text() if creators_element else "No domain found"
telephone_element = contain.find("p",{"itemprop":"telephone"})
telephone = telephone_element.get_text() if telephone_element else "No number found"
email_element = contain.find("p",{"itemprop":"email"})
email = email_element.get_text() if email_element else "No number found"
time.sleep(3)
print(name,description,product,creators,domain,telephone,email)
from csv import writer
def AddToCSV(List):
with open("Output.csv", "a+", newline='') as output_file:
csv_writer = writer(output_file)
csv_writer.writerow(List)
# this can be used within your for loop
row_list = [name,description,product,creators,domain,telephone,email]
AddToCSV(row_list)
browser.quit()
Posts: 8,135
Threads: 159
Joined: Sep 2016
you said that you tried a lot of solutions but they didn't work
I asked to show what you have tried and did not work, so that people don't waste time to suggest again the same thing. You don't show what you have tried.
Posts: 16
Threads: 3
Joined: Mar 2019
Sep-10-2020, 03:11 PM
(This post was last modified: Sep-10-2020, 03:11 PM by farhan275.)
I tried this.
from seleniumwire import webdriver
from selenium import webdriver
proxy= "username:password@ip:port"
options = {'proxy': {'http': proxy, 'https': proxy, 'no_proxy': 'localhost,127.0.0.1,dev_server:8080'}}
driver = webdriver.Chrome(options=chrome_options, executable_path="path of chrome driver", seleniumwire_options=options)
Posts: 8,135
Threads: 159
Joined: Sep 2016
try without this line from selenium import webdriver
Posts: 16
Threads: 3
Joined: Mar 2019
Posts: 8,135
Threads: 159
Joined: Sep 2016
Sep-10-2020, 03:39 PM
(This post was last modified: Sep-10-2020, 03:39 PM by buran.)
Looking at the docs - I think you need to have https/http in proxy string
from seleniumwire import webdriver
options = {'proxy':{'http':"http://username:password@ip:port",
'https':"https://username:password@ip:port",
'no_proxy':'localhost,127.0.0.1,dev_server:8080'}}
driver = webdriver.Chrome(options=chrome_options, executable_path="path of chrome driver", seleniumwire_options=options) note, you pass chrome_options but it is not present in your code
also, post full traceback in error tags if you get any
Posts: 16
Threads: 3
Joined: Mar 2019
Sep-10-2020, 07:38 PM
(This post was last modified: Sep-10-2020, 07:39 PM by farhan275.)
AttributeError: 'dict' object has no attribute 'to_capabilities'
Process finished with exit code 1
Posts: 8,135
Threads: 159
Joined: Sep 2016
post the code that produce the error as well as full traceback in error tags, not the last line
the code I suggest is exactly as in the docs https://github.com/wkeeling/selenium-wire#proxies
|