Python Forum

Full Version: Web Scrapping unique links
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

I'm working on an assignment to scrape unique links from a website and if they aren't already absolute links, turn them into absolute links. I've written some code to do just that, but it should be returning 122 links, but instead it;s returning 92. From what I can tell, the code is correct EXCEPT it's not returning the urls that are already absoliute links. I'm struggling to figure out why. Any ideas?


import bs4
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
import re

url = "https://www.census.gov/programs-surveys/popest.html"

r = requests.get(url)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
results = soup.find_all("a")

print('Number of links retrieved: ', len(results))
print(results)

total_urls = []
for link in results:
    link = link.get("href")
    if link == "#content":
        pass
    elif link is None:
        continue
    else:
        if re.match(r"https://", link):
            total_urls.append(link)
unique_urls = set(total_urls)
print("Total unique urls:", len(set(total_urls)))

print(unique_urls)

with open('final.csv', 'w') as csv_file:       
    w = csv.writer(csv_file, lineterminator='\n')    
    header = ['Urls']   
    w.writerow(header)      

    for link in unique_urls : 

        w.writerow([link])         
Change code a little so None is passed away first.
import bs4
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
import re

url = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(url)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
results = soup.find_all("a")
unique_links = []
for link in results:
    link = link.get("href")
    if link is None:
        pass
    elif link.startswith('#content'):
        pass
    else:
        print(link)
        unique_links.append(link)

unique_links = set(unique_links)
So if count now.
Output:
>>> len(unique_links) 122
deadendstree Wrote:scrape unique links from a website and if they aren't already absolute links, turn them into absolute links.
If look at result you see url's that is relative start with /,that you need to add base url.
Hint there are 2 tricky one that start with #.
Output:
{'#', '#uscb-nav-skip-header',
So when loop trough and make absolute urls's check is start with #.
Then base url for these is https://www.census.gov/programs-surveys/popest.html
So these two will be:
Output:
https://www.census.gov/programs-surveys/popest.html#uscb-nav-skip-local https://www.census.gov/programs-surveys/popest.html#
Thank you! I updated the code with your suggestions as well as added some additional code because I still had some relative links and the two you mentioned that started with #. It all looks clean, but when I write to a csv file, it's has relative links and the links that start with #, even though they are not showing up in the output in Jupyter.

Here's the code. I've posted all of it just in case
import bs4
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
import re

url = "https://www.census.gov/programs-surveys/popest.html"

r = requests.get(url)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
results = soup.find_all("a")

print('Number of links retrieved: ', len(results))

unique_links = []
for link in results:
    link = link.get("href")
    if link is None:
        pass
    elif link.startswith('#content'):
        pass
    elif link.startswith('/'):
        unique_links.append(link)
    elif link.startswith('#'):
        unique_links.append(link)
    else:
        print(link)
        unique_links.append(link)
 
unique_links = set(unique_links)
print(len(unique_links))


with open('finals.csv', 'w') as csv_file:       
    w = csv.writer(csv_file, lineterminator='\n')    
    header = ['Urls']   
    w.writerow(header)      

    for link in unique_links : 

        w.writerow([link])   
Start after unique_links,then loop over unique_links.
Set base url and join together base with relative.
Then it look like this.
import bs4
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
import re

url = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(url)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
results = soup.find_all("a")
unique_links = []
for link in results:
    link = link.get("href")
    if link is None:
        pass
    elif link.startswith('#content'):
        pass
    else:
        #print(link)
        unique_links.append(link)

unique_links = set(unique_links)
base_url = 'https://www.census.gov'
absolute_urls = []
for link in unique_links:
    if link.startswith('https'):
        absolute_urls.append(link)
    elif link.startswith('/'):
        absolute_urls.append(f'{base_url}{link}')
Now have 120 absolute links if count,now can you try to add the 2 tricky ones that mention so get 122 total.
Output:
>>> len(absolute_urls) 120