Jul-15-2019, 01:50 AM
# Justia Court Opinion Scraper # Works - Scrapes opinion with HTML tags # Works - Scrapes opinion with HTML tags stripped # Works - Write to CSV with HTML tags # Works - Write to CSV without HTML tags # July, 14, 2019 # localhost and law.justia.com are interchangeable! from urllib.request import urlopen from bs4 import BeautifulSoup #html = urlopen("http://localhost/cases/federal/appellate-courts/F2/1/18/1506993/") html = urlopen("http://localhost/cases/federal/appellate-courts/F2/999/663/308588/") #html = urlopen("http://localhost/cases/federal/appellate-courts/F3/491/1/510017/") #html = urlopen("http://localhost/cases/federal/us/385/206/case.html") <--- DOES NOT WORK with id="opinion" bsObj = BeautifulSoup(html.read()) #bsObj.findAll(id="opinion") allOpinion = bsObj.findAll(id="opinion") # Want the TITLE of the Page in a Variable import requests import pymysql from bs4 import BeautifulSoup url = "http://localhost/cases/federal/appellate-courts/F2/999/663/308588/" allTitle = bsObj.findAll({"title"}) allURL = url #print(allOpinion[0].get_text()) # ^ Will Strip HTML tags and only store plain-text # Column 1 [ ] # / / of URL (third to last) (i.e /1/) # Column 2 [ ] # / / of URL (second to last) (i.e /18) # Column 3 [ ] # / / of URL (last) (i.e /1506993/) # Column 4 [ allOpinion w/ HTML Tags ] # Column 5 [ allOpinion w/ Stripped HTML Tags - Plaintext lump ] # Store allOpinion to CSV File w/ Tags db = pymysql.connect(host="localhost", user="brandon", password="_yLKVPSiTfEQowz_v745H5xKSUkFDUyvtyW_", db="JustiaPython", charset='utf8') print(allOpinion) print(allTitle) print(allURL) import csv csvRow = [allOpinion,allTitle,allURL] csvfile = "current_F2_opinion_with_tags_current.csv" with open(csvfile, "a") as fp: wr = csv.writer(fp, dialect='excel') wr.writerow(csvRow) # wr.writerow(['1']) # ^ Works with retaining all the HTML tags; NEXT - Store allOpinion to a CSV, then MySQL. # Loop w/ Stripping HTML Tags for allOpinion and it's CSV output print(allOpinion[0].get_text(),url) import csv csvRow = [allOpinion[0].get_text(),allTitle[0].get_text(),allURL] csvfile = "current_F2_opinion_without_tags_current.csv" with open(csvfile, "a") as fp: wr = csv.writer(fp, dialect='excel') wr.writerow(csvRow) # wr.writerow(['1'])I am tring to figure out a few things to make this a functional script. I would like to learn how to my pymysql work correctly and be able to create a row with allTitle allURL allOpinion with MariaDB and write appended results.
I also am trying to figure out how to store certain parts of the URL as variables ; such as "999" and "663" and "308588"
My long term goal is I have a couple folders of these opinions I would like to scrape and store properly with these variables. How can I go about doing html = urlopen() from a link list rather than a single URL; I am guessing at the end of this script; I will be wanting to write a loop to go to the next court opinion.
Thanks for any help!