Python Forum

Full Version: Web Scraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi folks,

This is my first time posting so apologies if there is any errors. I currently have a file with a list of URLs, and I am trying to create a python program which will go to the URLs and grab the text from the HTML page and save it in a .txt file. I am currently using beautifulsoup to scrap these sites and many of them are throwing errors which im unsure how to solve. I am looking for a better way to this: I have posted by code below.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.request import Request
import datefinder
from dateutil.parser import parse
import json
import re
import random
import time
import scrapy
import requests
import urllib
import os.path
from os import path

#extracts page contents using beautifulSoup
def page_extract(url):
    req = Request(url,
                  headers={'User-Agent': 'Mozilla/5.0'})
    webpage = uReq(req, timeout=5).read()
    page_soup = soup(webpage, "lxml")
    return page_soup

#opens file that contains the links
file1 = open('links.txt', 'r')
lines = file1.readlines()

#for loop that iterates through the list of urls I have
for i in range(0, len(lines)):
    fileName = str(i)+".txt"
    url = str(lines[i])
    print(i)
    try:
        #if the scraping is successful i would like it to save the text contents in a text file with the text file name 
        # being the index
        soup2 = page_extract(url)
        text = soup2.text
        f = open("Politifact Files/"+fileName,"x")
        f.write(str(text))
        f.close()
        print(url)
    except:
        #otherwise save it to another folder which contains all the sites that threw an error
        f = open("Politifact Files Not Completed/"+fileName,"x")
        f.close()
        print("NOT DONE: "+url)
So I was able to find a solution to this problem by looking into python libraries that are able to webscrap the important text off of a webpage. I came across one called 'Trafilatura' which is able to accomplish this task. The documentation for this library is here at: https://pypi.org/project/trafilatura/.
Quote: many of them are throwing errors which im unsure how to solve
Please post the errors (in bbcode error tags)
Complete and unaltered (other than x-ing out personal data) please as important data is included with traceback.