Scraping a webpage with BS4

SBF12345 · Jan-28-2019, 10:10 PM

Greetings,

I would like to build a tool to open a .csv link on a webpage. I have written the following code and have identified that there is in fact a link to a csv file on the webpage. At this point I am not sure how to proceed. Speaking in pseudo code, I would probably like to define a variable for the csv file and open it. I would then like to run vba code to grab the desired information from the csv file (any ideas on best libraries with these types of tools, that is to run VB from a python mod)

import requests 

page = requests.get("http://webpage")

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

#print (soup)

for link in soup.find_all('a'):
    print(link.get('href'))

**Larz60+** · (This post was last modified: Jan-29-2019, 04:20 AM by Larz60+.)

are you trying to get the url from the for loop, lines 11 and 12,
and you say they are .csv files
if so this (untested code) should download and save the files:

for n, link in soup.find_all('a'):
    if 'href' in link.attrs:
        url = link.get('href')
        filename = url.split('/')[-1]
        response = requests.get(url)
        if response.status_code == 200:
            with open(filename, 'w') as fp:
                fp.write(response.text)
    else:
        print('Unable to download {}'.format(filename))

SBF12345 · Jan-29-2019, 07:18 PM

thanks! as I am totally new to python, but with some OOP experience, can I pick your brain in regards to the above code?

what does the [-1] refer to in the url.split line? also, what does the 'w' represent in the 'open' statement?

I don't totally follow the code. I received an error on the first line when trying to run the code. What does the 'n' refer to?

initially when I ran the code with the line

for link in soup.find_all('a'):
    print(link.get('href'))

I received a collection of 'href' values. Do I need to copy and paste the desired value into a variable (perhaps 'n') in the code you've provided me with?

**Larz60+** · (This post was last modified: Jan-30-2019, 12:48 AM by Larz60+.)

Quote:what does the [-1] refer to in the url.split line

the [-1] is an index to the last item in the slice of the split, so it gets the file name.
you have to be careful with this though, and look for additional attributes after the file name, like '?ei=wvJQXPryB6m2ggfQx76wAg&q', etc. which I didn't do here as it didn't appear that it would be an issue.

Quote:I don't totally follow the code. I received an error on the first line when trying to run the code. What does the 'n' refer to?

please note that I stated code was untested. If that line was used, it would should read:

    for n, link in enumerate(soup.find_all('a')):

but since n is not needed, it should be removed, and the line should be:

    for link in soup.find_all('a'):

what enumerate does is return the current iteration of the loop

Scraping a webpage with BS4

User Panel Messages

Announcements