Mar-11-2020, 05:10 PM
It's been since last Summer that I was able to work on this. Originally posted on the board here as MidnightDreamer and don't know where my credentials are.
Three main questions:
1) How do I setup this script so that it cycles through a list of links (i.e for the locally downloaded and stored website you are looking parse)
2) How do I set this script so it cycles from 1 file in a directory 12345test1.html (save the 1st one to csv) then cycle to the next one to save the 2nd one to the very same csv appended on a different row until it runs out of files within the directory
3) How to properly write the scraped data pulled into by python as database table entries in mysql rather than write directly to csv?
Below is the code that was working in July, 2019 for some of the requirements of the script I started.
Any pointers would be extremely helpful!
Best Regards,
Brandon Kastning
Three main questions:
1) How do I setup this script so that it cycles through a list of links (i.e for the locally downloaded and stored website you are looking parse)
2) How do I set this script so it cycles from 1 file in a directory 12345test1.html (save the 1st one to csv) then cycle to the next one to save the 2nd one to the very same csv appended on a different row until it runs out of files within the directory
3) How to properly write the scraped data pulled into by python as database table entries in mysql rather than write directly to csv?
Below is the code that was working in July, 2019 for some of the requirements of the script I started.
Any pointers would be extremely helpful!
Best Regards,
Brandon Kastning
# Justia Court Opinion Scraper # Works - Scrapes opinion with HTML tags # Works - Scrapes opinion with HTML tags stripped # Works - Write to CSV with HTML tags # Works - Write to CSV without HTML tags # July, 14, 2019 # localhost and law.justia.com are interchangeable! from urllib.request import urlopen from bs4 import BeautifulSoup #html = urlopen("http://localhost/cases/federal/appellate-courts/F2/1/18/1506993/") html = urlopen("http://localhost/cases/federal/appellate-courts/F2/999/663/308588/") #html = urlopen("http://localhost/cases/federal/appellate-courts/F3/491/1/510017/") #html = urlopen("http://localhost/cases/federal/us/385/206/case.html") <--- DOES NOT WORK with id="opinion" bsObj = BeautifulSoup(html.read()) #bsObj.findAll(id="opinion") allOpinion = bsObj.findAll(id="opinion") # Want the TITLE of the Page in a Variable import requests import pymysql from bs4 import BeautifulSoup url = "http://localhost/cases/federal/appellate-courts/F2/999/663/308588/" allTitle = bsObj.findAll({"title"}) allURL = url #print(allOpinion[0].get_text()) # ^ Will Strip HTML tags and only store plain-text # Column 1 [ ] # / / of URL (third to last) (i.e /1/) # Column 2 [ ] # / / of URL (second to last) (i.e /18) # Column 3 [ ] # / / of URL (last) (i.e /1506993/) # Column 4 [ allOpinion w/ HTML Tags ] # Column 5 [ allOpinion w/ Stripped HTML Tags - Plaintext lump ] # Store allOpinion to CSV File w/ Tags db = pymysql.connect(host="localhost", user="brandon", password="_yLKVPSiTfEQowz_v745H5xKSUkFDUyvtyW_", db="JustiaPython", charset='utf8') print(allOpinion) print(allTitle) print(allURL) import csv csvRow = [allOpinion,allTitle,allURL] csvfile = "current_F2_opinion_with_tags_current.csv" with open(csvfile, "a") as fp: wr = csv.writer(fp, dialect='excel') wr.writerow(csvRow) # wr.writerow(['1']) # ^ Works with retaining all the HTML tags; NEXT - Store allOpinion to a CSV, then MySQL. # Loop w/ Stripping HTML Tags for allOpinion and it's CSV output print(allOpinion[0].get_text(),url) import csv csvRow = [allOpinion[0].get_text(),allTitle[0].get_text(),allURL] csvfile = "current_F2_opinion_without_tags_current.csv" with open(csvfile, "a") as fp: wr = csv.writer(fp, dialect='excel') wr.writerow(csvRow) # wr.writerow(['1'])
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)
“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)
#LetHISPeopleGo
“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)
#LetHISPeopleGo