Python Forum

I'm writing to see if you guys can help me take my code to the next level.

As it is now it takes a list of urls from a CSV file list, it processes them and spits out the content into a CSV.

This is the code:

import requests
from bs4 import BeautifulSoup
import csv

filename = "siteslist.csv"
f = open(filename, "r")
url_list = f.read().split()
f.close()

for link in url_list:
    r = requests.get(link)
    r.encoding = 'utf-8'
    html_content = r.text
    #print(html_content) # this helped prove that the full content from all in list is listed
    f = csv.writer(open('output.csv', 'w'))
    f.writerow([html_content]) # these two lines add it to the CSV instead of terminal

I'd like to add in a loop so that it will create not just one CSV but one for each of the urls in the source list CSV.

The idea naming convention will be something like the first 20 characters of the url or if it is less then all from (link).

I'm not asking you to write it for me but if you can give me some pointers on how to approach it and any links to similar code that would be great!

Thanks

You don't close the file, so it's likely to be missing data at the end. Instead of using open, do it like this:

    with csv.writer(open('output.csv', 'w')) as f:
        f.writerow([html_content]) # these two lines add it to the CSV instead of terminal

That guarantees closure (automatically)
As for the loop, it's already reading all the input files,
you just need to create new filenames for each loop. If I knew what the individual site URL's looked
like, I'd split off the filename at the end, split that name into it's component parts, and use the same
prefix, followed by _out for each output file name.

Assuming your url's look something like: 'https://www.rfc-editor.org/in-notes/rfc1000.txt'
the following code will do the trick. Since you gave no sample url's, have to assume
format of data is already comma separated.

If not, you will have to add code to make them so.

import requests


class ScrapeFromList:
    def __init__(self):
        self.page = None

    def GetUrl(self, url):
        response = None
        res = requests.get(url)
        if res.status_code == 200:
            response = res.text
        return response

    def get_list(self, url_list):
        for url in url_list:
            fnam = url[url.rfind('/') + 1:].split('.')
            outfile_name = '{}_out.csv'.format(fnam[0])
            response = self.GetUrl(url)
            if response is not None:
                with open(outfile_name, 'w') as f:
                    f.write(response)

def testit():
    sl = ScrapeFromList()
    my_url_list = [
        'https://www.rfc-editor.org/in-notes/rfc1000.txt',
        'https://www.rfc-editor.org/in-notes/rfc1007.txt'
    ]
    sl.get_list(my_url_list)

if __name__ == '__main__':
    testit()

Thank you Larz, I need to have a work on it this week :-)

If I had of known this was homework, I wouldn't have shown all code.
But you made an effort, and may need to still do some work, so OK this time.

revo

Larz60+

Larz60+

revo

Larz60+