Oct-22-2023, 07:07 AM
Hello all,
I'm trying to use the data of a spreadsheet as two variables to iterate through a test webscraper script using pandas, but I'm a little stumped as to how to use two columns for two variables as iterations. For example in the first loop, use A1 and A2, then for the next iteration B1 and B2, then C1 and C2 etc.
Here is my code:
I get the error:
Thank you for your time.
I'm trying to use the data of a spreadsheet as two variables to iterate through a test webscraper script using pandas, but I'm a little stumped as to how to use two columns for two variables as iterations. For example in the first loop, use A1 and A2, then for the next iteration B1 and B2, then C1 and C2 etc.
Here is my code:
import requests from bs4 import BeautifulSoup import pandas as pd import openpyxl headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'} heading_type = [] heading = [] keyword1 = [] url1 =[] data = {'keyword': keyword1, 'Url':url1} wb = openpyxl.load_workbook('D:/Share/Documents/importurl.xlsx') ws = wb['Sheet1'] for cell in ws['A']: print(cell.value) url = cell.value url1.append(url) # r = requests.get(url, headers=headers) # soup = BeautifulSoup(r.text, features="html.parser") for cell in ws['B']: keyword = cell.value print(keyword) keyword1.append(keyword) df = pd.DataFrame(data=data) df.index += 1 df.to_excel(f"D:/Share/Documents/summary.xlsx")
I get the error:
Error:Traceback (most recent call last):
File "D:\Share\Documents\PycharmProjects\websitelearning\main.py", line 103, in <module>
df = pd.DataFrame(data=data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 709, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py", line 481, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py", line 115, in arrays_to_mgr
index = _extract_index(arrays)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py", line 655, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
I've attached the file here so hopefully it is clear. Happy to clarify further if what I'm trying to achieve is still not clear. I guess I need to run one loop that will query both column a and column b contents at the same time and iterate to the next row- but I'm not sure how to do this. Thank you for your time.
Attached Files