Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping a webpage with BS4
#1
Greetings,

I would like to build a tool to open a .csv link on a webpage. I have written the following code and have identified that there is in fact a link to a csv file on the webpage. At this point I am not sure how to proceed. Speaking in pseudo code, I would probably like to define a variable for the csv file and open it. I would then like to run vba code to grab the desired information from the csv file (any ideas on best libraries with these types of tools, that is to run VB from a python mod)

import requests 

page = requests.get("http://webpage")

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

#print (soup)

for link in soup.find_all('a'):
    print(link.get('href'))
Reply
#2
are you trying to get the url from the for loop, lines 11 and 12,
and you say they are .csv files
if so this (untested code) should download and save the files:
for n, link in soup.find_all('a'):
    if 'href' in link.attrs:
        url = link.get('href')
        filename = url.split('/')[-1]
        response = requests.get(url)
        if response.status_code == 200:
            with open(filename, 'w') as fp:
                fp.write(response.text)
    else:
        print('Unable to download {}'.format(filename))
Reply
#3
thanks! as I am totally new to python, but with some OOP experience, can I pick your brain in regards to the above code?

what does the [-1] refer to in the url.split line? also, what does the 'w' represent in the 'open' statement?

I don't totally follow the code. I received an error on the first line when trying to run the code. What does the 'n' refer to?

initially when I ran the code with the line
for link in soup.find_all('a'):
    print(link.get('href'))
I received a collection of 'href' values. Do I need to copy and paste the desired value into a variable (perhaps 'n') in the code you've provided me with?
Reply
#4
Quote:what does the [-1] refer to in the url.split line
the [-1] is an index to the last item in the slice of the split, so it gets the file name.
you have to be careful with this though, and look for additional attributes after the file name, like '?ei=wvJQXPryB6m2ggfQx76wAg&q', etc. which I didn't do here as it didn't appear that it would be an issue.
Quote:I don't totally follow the code. I received an error on the first line when trying to run the code. What does the 'n' refer to?
please note that I stated code was untested. If that line was used, it would should read:
    for n, link in enumerate(soup.find_all('a')):
but since n is not needed, it should be removed, and the line should be:
    for link in soup.find_all('a'):
what enumerate does is return the current iteration of the loop
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020