Python Forum
Court Opinion Scraper in Python w/ BS4 (Currently exports to CSV) need help with SQL
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Court Opinion Scraper in Python w/ BS4 (Currently exports to CSV) need help with SQL
#1
# Justia Court Opinion Scraper
# Works - Scrapes opinion with HTML tags
# Works - Scrapes opinion with HTML tags stripped
# Works - Write to CSV with HTML tags
# Works - Write to CSV without HTML tags
# July, 14, 2019
# localhost and law.justia.com are interchangeable!

from urllib.request import urlopen
from bs4 import BeautifulSoup
#html = urlopen("http://localhost/cases/federal/appellate-courts/F2/1/18/1506993/")
html = urlopen("http://localhost/cases/federal/appellate-courts/F2/999/663/308588/")
#html = urlopen("http://localhost/cases/federal/appellate-courts/F3/491/1/510017/")
#html = urlopen("http://localhost/cases/federal/us/385/206/case.html") <--- DOES NOT WORK with id="opinion"
bsObj = BeautifulSoup(html.read())
#bsObj.findAll(id="opinion")
allOpinion = bsObj.findAll(id="opinion")

# Want the TITLE of the Page in a Variable

import requests
import pymysql
from bs4 import BeautifulSoup

url = "http://localhost/cases/federal/appellate-courts/F2/999/663/308588/"

allTitle = bsObj.findAll({"title"})

allURL = url

#print(allOpinion[0].get_text())
# ^ Will Strip HTML tags and only store plain-text

# Column 1 [ ]
# / / of URL (third to last) (i.e /1/)
# Column 2 [ ]
# / / of URL (second to last) (i.e /18)
# Column 3 [ ]
# / / of URL (last) (i.e /1506993/)

# Column 4 [ allOpinion w/ HTML Tags ]

# Column 5 [ allOpinion w/ Stripped HTML Tags - Plaintext lump ]

# Store allOpinion to CSV File w/ Tags

db = pymysql.connect(host="localhost",
                 user="brandon",
                 password="_yLKVPSiTfEQowz_v745H5xKSUkFDUyvtyW_",
                 db="JustiaPython",
                 charset='utf8')



print(allOpinion)
print(allTitle)
print(allURL)

import csv
csvRow = [allOpinion,allTitle,allURL]
csvfile = "current_F2_opinion_with_tags_current.csv"
with open(csvfile, "a") as fp:
    wr = csv.writer(fp, dialect='excel')
    wr.writerow(csvRow)
#    wr.writerow(['1'])
# ^ Works with retaining all the HTML tags; NEXT - Store allOpinion to a CSV, then MySQL.


# Loop w/ Stripping HTML Tags for allOpinion and it's CSV output

print(allOpinion[0].get_text(),url)

import csv
csvRow = [allOpinion[0].get_text(),allTitle[0].get_text(),allURL]
csvfile = "current_F2_opinion_without_tags_current.csv"
with open(csvfile, "a") as fp:
    wr = csv.writer(fp, dialect='excel')
    wr.writerow(csvRow)
#    wr.writerow(['1'])
I am tring to figure out a few things to make this a functional script. I would like to learn how to my pymysql work correctly and be able to create a row with allTitle allURL allOpinion with MariaDB and write appended results.

I also am trying to figure out how to store certain parts of the URL as variables ; such as "999" and "663" and "308588"

My long term goal is I have a couple folders of these opinions I would like to scrape and store properly with these variables. How can I go about doing html = urlopen() from a link list rather than a single URL; I am guessing at the end of this script; I will be wanting to write a loop to go to the next court opinion.

Thanks for any help!
Reply
#2
It's been since last Summer that I was able to work on this. Originally posted on the board here as MidnightDreamer and don't know where my credentials are.

Three main questions:

1) How do I setup this script so that it cycles through a list of links (i.e for the locally downloaded and stored website you are looking parse)

2) How do I set this script so it cycles from 1 file in a directory 12345test1.html (save the 1st one to csv) then cycle to the next one to save the 2nd one to the very same csv appended on a different row until it runs out of files within the directory

3) How to properly write the scraped data pulled into by python as database table entries in mysql rather than write directly to csv?

Below is the code that was working in July, 2019 for some of the requirements of the script I started.

Any pointers would be extremely helpful!

Best Regards,

Brandon Kastning

# Justia Court Opinion Scraper
# Works - Scrapes opinion with HTML tags
# Works - Scrapes opinion with HTML tags stripped
# Works - Write to CSV with HTML tags
# Works - Write to CSV without HTML tags
# July, 14, 2019
# localhost and law.justia.com are interchangeable!
 
from urllib.request import urlopen
from bs4 import BeautifulSoup
#html = urlopen("http://localhost/cases/federal/appellate-courts/F2/1/18/1506993/")
html = urlopen("http://localhost/cases/federal/appellate-courts/F2/999/663/308588/")
#html = urlopen("http://localhost/cases/federal/appellate-courts/F3/491/1/510017/")
#html = urlopen("http://localhost/cases/federal/us/385/206/case.html") <--- DOES NOT WORK with id="opinion"
bsObj = BeautifulSoup(html.read())
#bsObj.findAll(id="opinion")
allOpinion = bsObj.findAll(id="opinion")
 
# Want the TITLE of the Page in a Variable
 
import requests
import pymysql
from bs4 import BeautifulSoup
 
url = "http://localhost/cases/federal/appellate-courts/F2/999/663/308588/"
 
allTitle = bsObj.findAll({"title"})
 
allURL = url
 
#print(allOpinion[0].get_text())
# ^ Will Strip HTML tags and only store plain-text
 
# Column 1 [ ]
# / / of URL (third to last) (i.e /1/)
# Column 2 [ ]
# / / of URL (second to last) (i.e /18)
# Column 3 [ ]
# / / of URL (last) (i.e /1506993/)
 
# Column 4 [ allOpinion w/ HTML Tags ]
 
# Column 5 [ allOpinion w/ Stripped HTML Tags - Plaintext lump ]
 
# Store allOpinion to CSV File w/ Tags
 
db = pymysql.connect(host="localhost",
                 user="brandon",
                 password="_yLKVPSiTfEQowz_v745H5xKSUkFDUyvtyW_",
                 db="JustiaPython",
                 charset='utf8')
 
 
 
print(allOpinion)
print(allTitle)
print(allURL)
 
import csv
csvRow = [allOpinion,allTitle,allURL]
csvfile = "current_F2_opinion_with_tags_current.csv"
with open(csvfile, "a") as fp:
    wr = csv.writer(fp, dialect='excel')
    wr.writerow(csvRow)
#    wr.writerow(['1'])
# ^ Works with retaining all the HTML tags; NEXT - Store allOpinion to a CSV, then MySQL.
 
 
# Loop w/ Stripping HTML Tags for allOpinion and it's CSV output
 
print(allOpinion[0].get_text(),url)
 
import csv
csvRow = [allOpinion[0].get_text(),allTitle[0].get_text(),allURL]
csvfile = "current_F2_opinion_without_tags_current.csv"
with open(csvfile, "a") as fp:
    wr = csv.writer(fp, dialect='excel')
    wr.writerow(csvRow)
#    wr.writerow(['1'])
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply
#3
your original post is here: https://python-forum.io/Thread-Court-Opi...p-with-SQL
Reply
#4
I merged both threads
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
(Mar-11-2020, 08:43 PM)buran Wrote: I merged both threads

buran,

Thank you very much!
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraper tomenzo123 8 4,331 Aug-18-2023, 12:45 PM
Last Post: Gaurav_Kumar
  Web scraper not populating .txt with scraped data BlackHeart 5 1,496 Apr-03-2023, 05:12 PM
Last Post: snippsat
  Image Scraper (beautifulsoup), stopped working, need to help see why woodmister 9 4,023 Jan-12-2021, 04:10 PM
Last Post: woodmister
  Python using BS scraper paulfearn100 1 2,527 Feb-07-2020, 10:22 PM
Last Post: snippsat
  web scraper using pathlib Larz60+ 1 3,193 Oct-16-2017, 05:27 PM
Last Post: Larz60+
  Need alittle hlpl with an image scraper. Blue Dog 8 7,690 Dec-24-2016, 08:09 PM
Last Post: Blue Dog
  Made a very simple email grabber(scraper) Blue Dog 4 6,854 Dec-13-2016, 06:25 AM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020