Feb-18-2021, 03:03 PM
Hi folks,
This is my first time posting so apologies if there is any errors. I currently have a file with a list of URLs, and I am trying to create a python program which will go to the URLs and grab the text from the HTML page and save it in a .txt file. I am currently using beautifulsoup to scrap these sites and many of them are throwing errors which im unsure how to solve. I am looking for a better way to this: I have posted by code below.
This is my first time posting so apologies if there is any errors. I currently have a file with a list of URLs, and I am trying to create a python program which will go to the URLs and grab the text from the HTML page and save it in a .txt file. I am currently using beautifulsoup to scrap these sites and many of them are throwing errors which im unsure how to solve. I am looking for a better way to this: I have posted by code below.
from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup from urllib.request import Request import datefinder from dateutil.parser import parse import json import re import random import time import scrapy import requests import urllib import os.path from os import path #extracts page contents using beautifulSoup def page_extract(url): req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) webpage = uReq(req, timeout=5).read() page_soup = soup(webpage, "lxml") return page_soup #opens file that contains the links file1 = open('links.txt', 'r') lines = file1.readlines() #for loop that iterates through the list of urls I have for i in range(0, len(lines)): fileName = str(i)+".txt" url = str(lines[i]) print(i) try: #if the scraping is successful i would like it to save the text contents in a text file with the text file name # being the index soup2 = page_extract(url) text = soup2.text f = open("Politifact Files/"+fileName,"x") f.write(str(text)) f.close() print(url) except: #otherwise save it to another folder which contains all the sites that threw an error f = open("Politifact Files Not Completed/"+fileName,"x") f.close() print("NOT DONE: "+url)