Python Forum
Web Scrapping unique links
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Scrapping unique links
#4
Start after unique_links,then loop over unique_links.
Set base url and join together base with relative.
Then it look like this.
import bs4
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
import re

url = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(url)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
results = soup.find_all("a")
unique_links = []
for link in results:
    link = link.get("href")
    if link is None:
        pass
    elif link.startswith('#content'):
        pass
    else:
        #print(link)
        unique_links.append(link)

unique_links = set(unique_links)
base_url = 'https://www.census.gov'
absolute_urls = []
for link in unique_links:
    if link.startswith('https'):
        absolute_urls.append(link)
    elif link.startswith('/'):
        absolute_urls.append(f'{base_url}{link}')
Now have 120 absolute links if count,now can you try to add the 2 tricky ones that mention so get 122 total.
Output:
>>> len(absolute_urls) 120
Reply


Messages In This Thread
Web Scrapping unique links - by deadendstreet - Mar-10-2020, 02:52 PM
RE: Web Scrapping unique links - by snippsat - Mar-10-2020, 04:24 PM
RE: Web Scrapping unique links - by deadendstreet - Mar-10-2020, 08:06 PM
RE: Web Scrapping unique links - by snippsat - Mar-11-2020, 06:03 AM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020