Jul-09-2021, 07:10 AM
Hey guys,
In trying to learn more about webscraping and as such I've set myself a challenge to try and scrape data off a few pages within the same website.
Each page has the same attributes (handy for webscraping each page!) but obviously the end part to the url address for each page is different. So, I've gathered the different page URL's and exported them to a spreadsheet.
What I'm trying to do now (and failing miserably) is to tell Python to use a column in my excel file which contains each page url as the page to be scrapped. Once it grabs a page URL, then it should go through with parsing the page with BeautifulSoup, extract certain elements and export that onto another excel spreadsheet.
This will then need to loop through each url and do the same thing until it goes through all the urls on the spreadsheet.
The code I've got so far to open the spreadsheet and refer to the column is:
The result I get is:
If I run this:
Thanking you.
In trying to learn more about webscraping and as such I've set myself a challenge to try and scrape data off a few pages within the same website.
Each page has the same attributes (handy for webscraping each page!) but obviously the end part to the url address for each page is different. So, I've gathered the different page URL's and exported them to a spreadsheet.
What I'm trying to do now (and failing miserably) is to tell Python to use a column in my excel file which contains each page url as the page to be scrapped. Once it grabs a page URL, then it should go through with parsing the page with BeautifulSoup, extract certain elements and export that onto another excel spreadsheet.
This will then need to loop through each url and do the same thing until it goes through all the urls on the spreadsheet.
The code I've got so far to open the spreadsheet and refer to the column is:
from bs4 import BeautifulSoup import pandas as pd import openpyxl import requests for page in current_url: book = openpyxl.load_workbook("url_list.xlsx") sheet = book['Sheet2'] column_name = 'Full Page Url' for column_cell in sheet.iter_cols(1, sheet.max_column): # iterate column cell if column_cell[0].value == column_name: # check for your column j = 0 for data in column_cell[1:]: # iterate your column url_component = data.value break page = requests.get(url_component) soup = BeautifulSoup(page.text, 'html.parser') print(soup)I've tried print(soup) there just to check that it's referencing a url from the spreadsheet.
The result I get is:
Output:Process finished with exit code 0
But there's no html data- so it's doesn't appear to be working. If I run this:
from bs4 import BeautifulSoup import pandas as pd import openpyxl import requests book = openpyxl.load_workbook("url_list.xlsx") sheet = book['Sheet2'] column_name = 'Full Page Url' for column_cell in sheet.iter_cols(1, sheet.max_column): # iterate column cell if column_cell[0].value == column_name: # check for your column j = 0 for data in column_cell[1:]: # iterate your column url_component = data.value breakIt's correctly giving me each url (so it's reading and referencing the Excel file and column correctly). For example the code above gives:
Output:https://www.samplesite.com/360/
https://www.samplesite.com/3d-checker/
Could someone please help me understand where I'm going wrong? Thanking you.