Mar-10-2020, 02:52 PM
Hello,
I'm working on an assignment to scrape unique links from a website and if they aren't already absolute links, turn them into absolute links. I've written some code to do just that, but it should be returning 122 links, but instead it;s returning 92. From what I can tell, the code is correct EXCEPT it's not returning the urls that are already absoliute links. I'm struggling to figure out why. Any ideas?
I'm working on an assignment to scrape unique links from a website and if they aren't already absolute links, turn them into absolute links. I've written some code to do just that, but it should be returning 122 links, but instead it;s returning 92. From what I can tell, the code is correct EXCEPT it's not returning the urls that are already absoliute links. I'm struggling to figure out why. Any ideas?
import bs4 import requests from bs4 import BeautifulSoup, SoupStrainer import csv import re url = "https://www.census.gov/programs-surveys/popest.html" r = requests.get(url) raw_html = r.text soup = BeautifulSoup(raw_html, 'html.parser') results = soup.find_all("a") print('Number of links retrieved: ', len(results)) print(results) total_urls = [] for link in results: link = link.get("href") if link == "#content": pass elif link is None: continue else: if re.match(r"https://", link): total_urls.append(link) unique_urls = set(total_urls) print("Total unique urls:", len(set(total_urls))) print(unique_urls) with open('final.csv', 'w') as csv_file: w = csv.writer(csv_file, lineterminator='\n') header = ['Urls'] w.writerow(header) for link in unique_urls : w.writerow([link])