Mar-11-2020, 06:03 AM
Start after unique_links,then loop over unique_links.
Set base url and join together base with relative.
Then it look like this.
Set base url and join together base with relative.
Then it look like this.
import bs4 import requests from bs4 import BeautifulSoup, SoupStrainer import csv import re url = "https://www.census.gov/programs-surveys/popest.html" r = requests.get(url) raw_html = r.text soup = BeautifulSoup(raw_html, 'html.parser') results = soup.find_all("a") unique_links = [] for link in results: link = link.get("href") if link is None: pass elif link.startswith('#content'): pass else: #print(link) unique_links.append(link) unique_links = set(unique_links) base_url = 'https://www.census.gov' absolute_urls = [] for link in unique_links: if link.startswith('https'): absolute_urls.append(link) elif link.startswith('/'): absolute_urls.append(f'{base_url}{link}')Now have 120 absolute links if count,now can you try to add the 2 tricky ones that mention so get 122 total.
Output:>>> len(absolute_urls)
120