Python Forum

Full Version: Using Excel Cell As A Variable In A Loop
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

As a relative newbie to Python and webscrapping, I've been trying to learn more about headers and proxies- in terms of rotating them from a excel list I've created. I've watched lots of different vids and read posts, but I'm a little stuck in it's application.

Starting off with the header components, I've an Excel file with a collection of various headers. The URL will have many pages, so the goal is to try and have a different header for each page.

I'm trying to open the Excel file and grab a header from the cell and use it to form the variable for headers.

Here's what I started with:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import openpyxl

url = 'mytesturl'

wb = openpyxl.load_workbook('RandomUserAgentList.xlsx')
ws = wb['Sheet1']
headers = []

for cell in ws['A']:
    random_header_variable = cell.value
    headers = "{'User-Agent': " + random_header_variable + "}"
    print(headers)
The output from this is:
Output:
{'User-Agent': Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36} {'User-Agent': Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36} {'User-Agent': Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.72} {'User-Agent': Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36} {'User-Agent': Mozilla/5.0 (X11; CrOS aarch64 13421.99.0) AppleWebKit/537.36 (KHTML; like Gecko) Chrome/86.0.4240.198 Safari/537.36}
So that looks good.

But then from there, I'm not sure how to incorporate it into requests for the page. When I've used:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import openpyxl

url = 'mytesturl'

wb = openpyxl.load_workbook('RandomUserAgentList.xlsx')
ws = wb['Sheet1']
headers = []

for cell in ws['A']:
    random_header_variable = cell.value
    headers = "{'User-Agent': " + random_header_variable + "}"
    r = requests.get(url, headers = headers)
I get a whole series of errors:
Error:
Traceback (most recent call last): File "C:/Users/test_headers.py", line 15, in <module> r = requests.get(url, headers = headers) File "C:\Users\anaconda3\lib\site-packages\requests\api.py", line 76, in get return request('get', url, params=params, **kwargs) File "C:\Users\anaconda3\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "C:\Users\anaconda3\lib\site-packages\requests\sessions.py", line 528, in request prep = self.prepare_request(req) File "C:\Users\anaconda3\lib\site-packages\requests\sessions.py", line 456, in prepare_request p.prepare( File "C:\Users\anaconda3\lib\site-packages\requests\models.py", line 317, in prepare self.prepare_headers(headers) File "C:\Users\anaconda3\lib\site-packages\requests\models.py", line 449, in prepare_headers for header in headers.items(): AttributeError: 'str' object has no attribute 'items'
I'm presuming it's because the format is wrong?

I also tried to format it:

headers = f"{'User-Agent': {random_header_variable} }"
but then I get:
Error:
headers = f"{'User-Agent': {random_header_variable} }" ValueError: Invalid format specifier
Could someone please enlighten me how to format this correctly?

Thanks a lot.
Looks like it's saying that params can't be a str. You should probably build it as a dictionary instead.
Hi Bowlofred,

Thanks for your suggestion- I understand what you're saying, but to be honest, I don't know how to go about applying that- particularly since I'm importing part of that as a variable.



(Aug-23-2021, 06:40 AM)bowlofred Wrote: [ -> ]Looks like it's saying that params can't be a str. You should probably build it as a dictionary instead.
Try
headers = {'User-Agent': random_header_variable}
(Aug-23-2021, 10:15 AM)Yoriz Wrote: [ -> ]Try
headers = {'User-Agent': random_header_variable}

Thank you Yoriz.

I gave that a go and good news is that I'm not getting any error message...bad news is that nothing seems to happen- when I run it, it's doing something but never ends until I manually stop it! I'm guessing it could be stuck in a loop.
You can add a print statement in your loop to verify it's making progress.

...
r = requests.get(url, headers = headers)
print(f"Got {r.status} code from {r.url} lookup")
In your original code, you don't seem to be doing anything with the result of the get. How are you using it?
Thank you.

I was running small print statements as I was going to verify and everything looked good up until adding the last requests line.

So the last print statement where everything was working was:
or cell in ws['A']:
    random_header_variable = cell.value
    headers = "{'User-Agent': " + random_header_variable + "}"
    print(headers) 
That printed all the headers and it looked correct. As soon as I added:
r = requests.get(url, headers = headers)
(I didn't try a print statement at that point), I got the errors as posted above.

I think I may give up on this one! Maybe I might use a few full header variations in a list in the code rather than get the code to pick part of it from a spreadsheet.

Really appreciate your help and time though.

(Aug-24-2021, 02:18 AM)bowlofred Wrote: [ -> ]You can add a print statement in your loop to verify it's making progress.

...
r = requests.get(url, headers = headers)
print(f"Got {r.status} code from {r.url} lookup")
In your original code, you don't seem to be doing anything with the result of the get. How are you using it?
Using the data you posted.
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://python-forum.io/'
agent_lst = [
    {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'},
    {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'},
    {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.72'},
    {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'},
    {'User-Agent': 'Mozilla/5.0 (X11; CrOS aarch64 13421.99.0) AppleWebKit/537.36 (KHTML; like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
]
for agent in agent_lst:
    response = requests.get(url, headers=agent)
    print(response.status_code)
    soup = BeautifulSoup(response.content, 'lxml')
    print(soup.select_one('head > title'))
Output:
200 <title>Python Forum</title> 200 <title>Python Forum</title> 200 <title>Python Forum</title> 200 <title>Python Forum</title> 200 <title>Python Forum</title>
You most be making a working dictionary in headers list,now you mess it up and make string.
from pprint import pprint

# Example data from Excel
lst = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
]

headers = []
d = {}
for agent in lst:
    d['User-Agent'] = agent
    headers.append(d)
pprint(headers)
Output:
[{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 ' 'Safari/537.36'}, {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 ' 'Safari/537.36'}]
So now contain headers list working dictionary.
>>> headers[0]
{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) '
               'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 '
               'Safari/537.36'}
>>> headers[0].get('User-Agent') # get a dictionary  method 
('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, '
 'like Gecko) Chrome/87.0.4280.88 Safari/537.36')