Aug-23-2021, 06:24 AM
Hi all,
As a relative newbie to Python and webscrapping, I've been trying to learn more about headers and proxies- in terms of rotating them from a excel list I've created. I've watched lots of different vids and read posts, but I'm a little stuck in it's application.
Starting off with the header components, I've an Excel file with a collection of various headers. The URL will have many pages, so the goal is to try and have a different header for each page.
I'm trying to open the Excel file and grab a header from the cell and use it to form the variable for headers.
Here's what I started with:
The output from this is:
But then from there, I'm not sure how to incorporate it into requests for the page. When I've used:
I get a whole series of errors:
I also tried to format it:
but then I get:
Thanks a lot.
As a relative newbie to Python and webscrapping, I've been trying to learn more about headers and proxies- in terms of rotating them from a excel list I've created. I've watched lots of different vids and read posts, but I'm a little stuck in it's application.
Starting off with the header components, I've an Excel file with a collection of various headers. The URL will have many pages, so the goal is to try and have a different header for each page.
I'm trying to open the Excel file and grab a header from the cell and use it to form the variable for headers.
Here's what I started with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import requests import pandas as pd from bs4 import BeautifulSoup import openpyxl url = 'mytesturl' wb = openpyxl.load_workbook( 'RandomUserAgentList.xlsx' ) ws = wb[ 'Sheet1' ] headers = [] for cell in ws[ 'A' ]: random_header_variable = cell.value headers = "{'User-Agent': " + random_header_variable + "}" print (headers) |
Output:{'User-Agent': Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36}
{'User-Agent': Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36}
{'User-Agent': Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.72}
{'User-Agent': Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36}
{'User-Agent': Mozilla/5.0 (X11; CrOS aarch64 13421.99.0) AppleWebKit/537.36 (KHTML; like Gecko) Chrome/86.0.4240.198 Safari/537.36}
So that looks good. But then from there, I'm not sure how to incorporate it into requests for the page. When I've used:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import requests import pandas as pd from bs4 import BeautifulSoup import openpyxl url = 'mytesturl' wb = openpyxl.load_workbook( 'RandomUserAgentList.xlsx' ) ws = wb[ 'Sheet1' ] headers = [] for cell in ws[ 'A' ]: random_header_variable = cell.value headers = "{'User-Agent': " + random_header_variable + "}" r = requests.get(url, headers = headers) |
Error:Traceback (most recent call last):
File "C:/Users/test_headers.py", line 15, in <module>
r = requests.get(url, headers = headers)
File "C:\Users\anaconda3\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\anaconda3\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\anaconda3\lib\site-packages\requests\sessions.py", line 528, in request
prep = self.prepare_request(req)
File "C:\Users\anaconda3\lib\site-packages\requests\sessions.py", line 456, in prepare_request
p.prepare(
File "C:\Users\anaconda3\lib\site-packages\requests\models.py", line 317, in prepare
self.prepare_headers(headers)
File "C:\Users\anaconda3\lib\site-packages\requests\models.py", line 449, in prepare_headers
for header in headers.items():
AttributeError: 'str' object has no attribute 'items'
I'm presuming it's because the format is wrong? I also tried to format it:
1 |
headers = f "{'User-Agent': {random_header_variable} }" |
Error:headers = f"{'User-Agent': {random_header_variable} }"
ValueError: Invalid format specifier
Could someone please enlighten me how to format this correctly?Thanks a lot.