Python Forum
Using Excel Cell As A Variable In A Loop
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Using Excel Cell As A Variable In A Loop
#1
Hey guys,

In trying to learn more about webscraping and as such I've set myself a challenge to try and scrape data off a few pages within the same website.

Each page has the same attributes (handy for webscraping each page!) but obviously the end part to the url address for each page is different. So, I've gathered the different page URL's and exported them to a spreadsheet.

What I'm trying to do now (and failing miserably) is to tell Python to use a column in my excel file which contains each page url as the page to be scrapped. Once it grabs a page URL, then it should go through with parsing the page with BeautifulSoup, extract certain elements and export that onto another excel spreadsheet.

This will then need to loop through each url and do the same thing until it goes through all the urls on the spreadsheet.

The code I've got so far to open the spreadsheet and refer to the column is:

from bs4 import BeautifulSoup
import pandas as pd
import openpyxl
import requests 

for page in current_url:
    book = openpyxl.load_workbook("url_list.xlsx")
    sheet = book['Sheet2']
    column_name = 'Full Page Url'
    for column_cell in sheet.iter_cols(1, sheet.max_column):  # iterate column cell
        if column_cell[0].value == column_name:    # check for your column
            j = 0
            for data in column_cell[1:]:    # iterate your column
                url_component = data.value
                
            break

     page = requests.get(url_component)
     soup = BeautifulSoup(page.text, 'html.parser')
     print(soup)
I've tried print(soup) there just to check that it's referencing a url from the spreadsheet.

The result I get is:
Output:
Process finished with exit code 0
But there's no html data- so it's doesn't appear to be working.

If I run this:

from bs4 import BeautifulSoup
import pandas as pd
import openpyxl
import requests 

book = openpyxl.load_workbook("url_list.xlsx")
sheet = book['Sheet2']
    column_name = 'Full Page Url'
    for column_cell in sheet.iter_cols(1, sheet.max_column):  # iterate column cell
        if column_cell[0].value == column_name:    # check for your column
            j = 0
            for data in column_cell[1:]:    # iterate your column
                url_component = data.value
                
            break
It's correctly giving me each url (so it's reading and referencing the Excel file and column correctly). For example the code above gives:

Output:
https://www.samplesite.com/360/ https://www.samplesite.com/3d-checker/
Could someone please help me understand where I'm going wrong?

Thanking you.
Reply
#2
here's a better way, doesn't require excel or pandas, and can be reused for any sites of the type you mention.

I have included a sample which scrapes the bird pages for species listed (compare to your 'endpart')
you can try with your url's

the pages will be downloaded and placed in directory 'renderhtml', ready to be parsed with Beautifulsoup.

note that this class used pathlib (python built-in) to create posix path.

It uses lxml parser, and beautifulsoup which you may have to install:
(from command line):
pip install lxml
pip install Beautifulsoup4
Name this module:
RenderUrl.py
import os
from pathlib import Path
import requests


class RenderUrl:
    def __init__(self, baseurl=None):
        self.base_url = baseurl

        # Create new savepath if needed
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.savepath = Path('.') / 'renderhtml'
        self.savepath.mkdir(exist_ok=True)

        # temp storage for url suffixes
        self.suffixlist = []

    def url_emmitter(self):
        page = None
        suffix = None
        n = len(self.suffixlist)
        i = 0
        while i < n:
            suffix = self.suffixlist[i]
            url = f"{self.base_url}{suffix}"
            yield url
            i += 1
    
    def get_pages(self, suffixlist, cache=False):
        self.suffixlist = suffixlist
        for url in self.url_emmitter():
            print(f"fetching: {url}")
            fname = (url.split('/')[-1]).replace('-','_')
            filename = self.savepath / f"{fname}.html"
            if cache and filename.exists():
                with filename.open('rb') as fp:
                    page = fp.read()
            else:
                response = requests.get(url)
                if response.status_code == 200:
                    page = response.content
                    with filename.open('wb') as fp:
                        fp.write(page)
Here's how to use it (put both files in same directory):
TryRenderUrl.py
from RenderUrl import RenderUrl

def main():
    '''
        AudubonBirds:
    '''
    baseurl = "https://www.massaudubon.org/learn/nature-wildlife/birds/"
    birdlist = ['american-goldfinches','american-kestrels','american-robins','bald-eagles',
    'baltimore-orchard-o','baltimore-orchard-orioles', 'birds-of-prey']
    rurl = RenderUrl(baseurl)
    rurl.get_pages(birdlist, cache=True)

if __name__ == '__main__':
    main()
then run python TryRenderUrl.py
Output:
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-goldfinches fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-kestrels fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-robins fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/bald-eagles fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-o fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-orioles fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/birds-of-prey
Now look in the directory renderhtml
you will find:
Output:
american_goldfinches.html american_kestrels.html american_robins.html bald_eagles.html baltimore_orchard_orioles.html birds_of_prey.html
ready to be parsed.

Now try with your data.
knight2000 likes this post
Reply
#3
Hi Larz60+,

A huge thank you for spending so much time in compiling your reply and for including so much detail- I can't wait to give this a go. As soon as I do, I will definitely be back and let you know.

Have a great day ahead mate.
Reply
#4
There are serval problem or missing stuff with your fist code.
Start with for page in current_url: there is no current_url reference to loop over?
You read from a Excel file that have url list,so look like first loop is no needed at all.
Line 18,19,20 need to inside loop block.

Here is basic test if have urls in a Excel file and iterate over column A.
import openpyxl

wb = openpyxl.load_workbook('url.xlsx')
ws = wb['url_info']
for cell in ws['A']:
    print(cell.value)
Output:
https://python-forum.io/ https://www.google.no/ https://edition.cnn.com/
So if want open urls in BS it would be like this.
import openpyxl
from bs4 import BeautifulSoup
import requests

wb = openpyxl.load_workbook('url.xlsx')
ws = wb['url_info']
for cell in ws['A']:
    print(cell.value)
    response = requests.get(cell.value)
    soup = BeautifulSoup(response.content, 'lxml')
    print(soup.find('title'))
Output:
https://python-forum.io/ <title>Python Forum</title> https://www.google.no/ <title>Google</title> https://edition.cnn.com/ <title>CNN International - Breaking News, US News, World News and Video</title>
Reply
#5
Hi Larz60+,

I'm sorry for the very late reply.

Thank you for providing me with this cool sample code to extract the html locally for each page. I've tried to go through the code and I sort of understand it (wouldn't be able to write it myself yet Rolleyes ) and I've applied it to my test project and it works very well indeed. Big Grin

I've copied the code into my notes to use if this scenario comes up again. It's nice having the local html to then play around to extract other attributes- especially when practicing.

Thanks again mate and have a great week ahead.


(Jul-09-2021, 11:04 PM)Larz60+ Wrote: here's a better way, doesn't require excel or pandas, and can be reused for any sites of the type you mention.

I have included a sample which scrapes the bird pages for species listed (compare to your 'endpart')
you can try with your url's

the pages will be downloaded and placed in directory 'renderhtml', ready to be parsed with Beautifulsoup.

note that this class used pathlib (python built-in) to create posix path.

It uses lxml parser, and beautifulsoup which you may have to install:
(from command line):
pip install lxml
pip install Beautifulsoup4
Name this module:
RenderUrl.py
import os
from pathlib import Path
import requests


class RenderUrl:
    def __init__(self, baseurl=None):
        self.base_url = baseurl

        # Create new savepath if needed
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.savepath = Path('.') / 'renderhtml'
        self.savepath.mkdir(exist_ok=True)

        # temp storage for url suffixes
        self.suffixlist = []

    def url_emmitter(self):
        page = None
        suffix = None
        n = len(self.suffixlist)
        i = 0
        while i < n:
            suffix = self.suffixlist[i]
            url = f"{self.base_url}{suffix}"
            yield url
            i += 1
    
    def get_pages(self, suffixlist, cache=False):
        self.suffixlist = suffixlist
        for url in self.url_emmitter():
            print(f"fetching: {url}")
            fname = (url.split('/')[-1]).replace('-','_')
            filename = self.savepath / f"{fname}.html"
            if cache and filename.exists():
                with filename.open('rb') as fp:
                    page = fp.read()
            else:
                response = requests.get(url)
                if response.status_code == 200:
                    page = response.content
                    with filename.open('wb') as fp:
                        fp.write(page)
Here's how to use it (put both files in same directory):
TryRenderUrl.py
from RenderUrl import RenderUrl

def main():
    '''
        AudubonBirds:
    '''
    baseurl = "https://www.massaudubon.org/learn/nature-wildlife/birds/"
    birdlist = ['american-goldfinches','american-kestrels','american-robins','bald-eagles',
    'baltimore-orchard-o','baltimore-orchard-orioles', 'birds-of-prey']
    rurl = RenderUrl(baseurl)
    rurl.get_pages(birdlist, cache=True)

if __name__ == '__main__':
    main()
then run python TryRenderUrl.py
Output:
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-goldfinches fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-kestrels fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-robins fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/bald-eagles fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-o fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-orioles fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/birds-of-prey
Now look in the directory renderhtml
you will find:
Output:
american_goldfinches.html american_kestrels.html american_robins.html bald_eagles.html baltimore_orchard_orioles.html birds_of_prey.html
ready to be parsed.

Now try with your data.
Reply
#6
I usually cache the files when in initial development phase, but then remove it prior to releasing the code, unless the data is quite static.
It's ok to leave cacheing in place if you check the files age, a simple method can do this:

    def determine_file_age(self, filename):
        last_mod_time = filename.stat().st_mtime
        localtime = time.time()
        fileage = float(localtime - last_mod_time) / 3600.0
        return fileage
This check returns number of hours since file was last modified, you can include a default 'number of hours between changes' and update the files based on their age.
remember to add import time at top of script.

example, set line 29 from: def get_pages(self, suffixlist, cache=False):
to def get_pages(self, suffixlist, cache=False, update_every=48):
then call determine_file_age to see if an update should be forced: file_age = self.determine_file_age(filename)
Reply
#7
Hi Snippsat,

A big thanks for your help and direction too. I spent time to learn from and apply what you suggested, which included fixing up some problems with my initial code, such as the defining the variable's value at the top for starters.

I then used your code to test opening an Excel file and refer it to a specific column and it worked beautifully. Smile After trying with your example, I also tried my own URLs in Excel and it also referenced those perfectly too.

One of the key takeaway's I learnt from your code was how to reference a column into a loop.

For example:

]for cell in ws['C']:
    url = cell.value
Thank you for teaching me how to do this- really appreciate your time.



(Jul-10-2021, 11:58 AM)snippsat Wrote: There are serval problem or missing stuff with your fist code.
Start with for page in current_url: there is no current_url reference to loop over?
You read from a Excel file that have url list,so look like first loop is no needed at all.
Line 18,19,20 need to inside loop block.

Here is basic test if have urls in a Excel file and iterate over column A.
import openpyxl

wb = openpyxl.load_workbook('url.xlsx')
ws = wb['url_info']
for cell in ws['A']:
    print(cell.value)
Output:
https://python-forum.io/ https://www.google.no/ https://edition.cnn.com/
So if want open urls in BS it would be like this.
import openpyxl
from bs4 import BeautifulSoup
import requests

wb = openpyxl.load_workbook('url.xlsx')
ws = wb['url_info']
for cell in ws['A']:
    print(cell.value)
    response = requests.get(cell.value)
    soup = BeautifulSoup(response.content, 'lxml')
    print(soup.find('title'))
Output:
https://python-forum.io/ <title>Python Forum</title> https://www.google.no/ <title>Google</title> https://edition.cnn.com/ <title>CNN International - Breaking News, US News, World News and Video</title>
Reply
#8
Very cool Larz60+!

That looks to be very useful. I didn't even think about cacheing. I'm going to add that to my notes- thank you for the tip and explaining it.



(Jul-18-2021, 08:21 AM)Larz60+ Wrote: I usually cache the files when in initial development phase, but then remove it prior to releasing the code, unless the data is quite static.
It's ok to leave cacheing in place if you check the files age, a simple method can do this:

    def determine_file_age(self, filename):
        last_mod_time = filename.stat().st_mtime
        localtime = time.time()
        fileage = float(localtime - last_mod_time) / 3600.0
        return fileage
This check returns number of hours since file was last modified, you can include a default 'number of hours between changes' and update the files based on their age.
remember to add import time at top of script.

example, set line 29 from: def get_pages(self, suffixlist, cache=False):
to def get_pages(self, suffixlist, cache=False, update_every=48):
then call determine_file_age to see if an update should be forced: file_age = self.determine_file_age(filename)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Using Excel Cell As A Variable In A Loop knight2000 7 388 Aug-25-2021, 12:43 PM
Last Post: snippsat
Smile Set 'Time' format cell when writing data to excel and not 'custom' limors 3 1,137 Mar-29-2021, 09:36 PM
Last Post: Larz60+
  Simple Variable Saving in Loop DevDev 3 854 Mar-09-2021, 07:17 PM
Last Post: Fre3k
  How to append a value to specific excel cell using openpyxl hobbyist 0 1,304 Mar-05-2021, 07:14 PM
Last Post: hobbyist
  While Loop Variable Freezing? stylingpat 13 1,556 Feb-25-2021, 10:42 AM
Last Post: Abdullah
  Change variable value during a while loop? penahuse 2 1,028 Nov-15-2020, 11:53 PM
Last Post: penahuse
  Variable in for loop samuelbachorik 4 1,191 Jul-21-2020, 08:45 AM
Last Post: ndc85430
  Need help with for loop and variable value substitution in a function rsurathu 2 907 Jul-21-2020, 06:47 AM
Last Post: rsurathu
  data frame excel cell calulation buunaanaa 1 837 Jul-04-2020, 06:00 PM
Last Post: buunaanaa
  Loop Excel Range Kristenl2784 2 1,227 Jun-18-2020, 04:49 PM
Last Post: Kristenl2784

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020