Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Saving links as text
#1
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages=set()
def getLinks(pageUrl):
    global pages
    html=urlopen("https://heppa.hippos.fi"+pageUrl)
    bsobj=BeautifulSoup(html, 'lxml')
    for link in bsobj.findAll("a", href=re.compile("^(/heppa/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage=link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks("")


I'm new in Python and web scraping. I found this code somewhere. Trying modify code so I can save links to file, but I cant.

Please help me.
Thanks in advance.
Sad
Reply
#2
This code may have worked in the past, (and still may) saving should be simple, but...
the webpage is almost entirely javaScript, so to properly scrape you should use selenium.
there are two tutorials you on this site you should run through (doesn't take long):
web scraping part1
web scraping part2
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  webscrapping links and then enter those links to scrape data kirito85 2 3,195 Jun-13-2019, 02:23 AM
Last Post: kirito85

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020