Python Forum

Full Version: Need Pointers/Advise for Cleaning up BS4 XPATH Data
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
No syntax errors. Script runs!

I am having problems with BS4 & XPATH. BS4 (tag) = column1(courtlistener_case_name) and works perfect. BS4 (xpath) column2(courtlistener_jurisdiction) works perfect, column3(courtlistener_filed) ---Formatting is crazy. Spaces (how to filter this?) to make it "only the date", column4(courtlistener_precedential_status) only shows the Header and not the true Value, column5(courtlistener_docket_number) -- also has crazy spaces (docket # not only value in column), column6(courtlistener_pdf_opinion_url_storage) -- getting Null value instead of pdf URL and column7(courtlistener_pdf_opinion_url_gov) --getting Null value instead of pdf URL ...

If you comment the URL line and un-comment another (you can switch Jurisdictions); the XPATH's I am scraping; seem to inter-change properly between the different jurisdictions. So it must be XPATH scraping from the template system.

Any pointers / advise would be helpful!

Dragon Breath Augmentation [A01].py:

# Source 1: https://itsmycode.com/python-urllib-error-httperror-http-error-403-forbidden/ | User-Agent Spoofing
# Source 2: https://www.quora.com/How-do-I-extract-text-from-multiple-web-site-URLs-at-once | BS4 Multiple URLS Example

# Target URL for Scraping: https://www.courtlistener.com/opinion/4902955/state-v-schierman/
# Beginning URL: https://www.courtlistener.com
# End URL: column: courtlistener_absolute_url
# courtlistener_absolute_url = /opinion/4902955/state-v-schierman/

# Thanks to irc.libera.chat #python - nedbat (Debugging) :)
# And to Repiphany and Falc who also offered to assist with full tracebacks!
# And jinsun__ who keyed me into upgrading to 3.10/3.11 for major syntax error updates + other good things
# 03/08/2022 @ Approx. 02:30

# Database Name: EXODUS_CL_FULL_ENOUGH_WASH_SUPREME
# MariaDB Table Name: 4706734_Dropped_Columns_5
# Database Schema: utf8mb4_unicode_ci
# Database Schema Details:
# id / INT / 11 / AUTO_INCREMENT / PRIMARY_KEY
# courtlistener_case_name / TEXT / NULL
# courtlistener_jurisdiction / TEXT / NULL
# courtlistener_filed / TEXT / NULL
# courtlistener_precedential_status / TEXT / NULL
# courtlistener_docket_number / TEXT / NULL
# exodus_courtlistener_dropped_columns_entry_timestamp / DATETIME / CURRENT_TIMESTAMP

# 5 Mapped HTML Tags | MariaDB Column Name -> HTML Tag:
# courtlistener_case_name = htmlParse.find_all("h2")[0].get_text()
# courtlistener_jurisdiction = htmlParse.find_all("h3")[0].get_text()
# courtlistener_filed = htmlParse.find_all("p")[0].get_text()
# courtlistener_precedential_status = htmlParse.find_all("p").[1].get_text()
# courtlistener_docket_number = htmlParse.find_all("p").[2].get_text()

# 2 Additional HTML Tags | MariaDB Column Name -> HTML Tag:
# courtlistener_pdf_opinion_url_storage = htmlParse.find_all("")[0].get_text()
# courtlistener_pdf_opinion_url_gov = htmlParse.find_all("")[0].get_text()

# Example - courtlistener_pdf_opinion_url_storage = https://storage.courtlistener.com/pdf/2018/04/12/state_v._schierman_1.pdf
# Example - courtlistener_pdf_opinion_url_gov = http://www.courts.wa.gov/opinions/pdf/846146.pdf

# Content Needed for New Columns on DragonBreath [F01]
input("Dragon Breath Augmentation: Dropped Columns [A01] is activated. Now firing up all Systems... | Press ENTER to Continue...")
print("Now importing Python Module Libraries Required for Dragon Breath Augmentation: Dropped Columns [A01] to run successfully...")
import urllib.request
import pymysql
import pymysql.cursors
from bs4 import BeautifulSoup
import re

# Added from Source 2 - Read URLs from Text File
#list_open = open("json.courtlistener.exodus.opinion.for.current.dataset.urls.txt")
#read_list = list_open.read()
#line_in_list = read_list.split("\n")

# Setup a function iteration for the User-Agent with Requests
#for url in line_in_list:
# soup = BeautifulSoup(urllib2.urlopen(url).read(),'html')
#    from urllib.request import Request, urlopen
#    req = Request(read_list, headers={'User-Agent': 'Mozilla/5.0'})
#    html = urlopen(req).read()

# User-Agent Spoofing - urllib Request
from urllib.request import Request, urlopen
req = Request('https://www.courtlistener.com/opinion/4902955/state-v-schierman/', headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
html = urlopen(req,timeout=10).read()

# Import Additional Python3 Require Module Libraries
from bs4 import BeautifulSoup
from lxml import etree
import requests

# Assign a Python Variable to Beautiful Soup 4's HTML Parser
htmlParse = BeautifulSoup(html, 'html.parser')

# Select HTML Element for Data Parsing Column #1 [courtlistener_case_name]
htmlParse.find_all("h2")[0].get_text()

pvar_bs4_tag_courtlistener_case_name = htmlParse.find_all("h2")[0].get_text()
print(htmlParse.find_all("h2")[0].get_text())
#input1 = input("Prep Parsing Column 1 - courtlistener_case_name - Type CONTINUE to Continue: ")
#print(input1)

#input("Column 1 of 7 ... courtlistener_case_name ... Parsed Correctly? Press ENTER to Continue...")

# Select HTML Element for Data Parsing Column #2 [courtlistener_jurisdiction]
#htmlParse.find_all("h3")[2].get_text()
# XPATH : /html/body/div[1]/div[1]/article/h3

# WASH APPELLATE COURT (URL):
URL = "https://www.courtlistener.com/opinion/4340668/state-of-washington-v-jose-jesus-mancilla/"

# WASH SUPREME COURT (URL):
#URL = "https://www.courtlistener.com/opinion/4902955/state-v-schierman/"

# 9TH CIRCUIT COURT (URL):
#URL = "https://www.courtlistener.com/opinion/525248/united-states-v-irma-nuno-para-united-states-of-america-v-jesus/"

# US SCOTUS (URL):
#URL = "https://www.courtlistener.com/opinion/92810/late-corp-of-church-of-jesus-christ-of-latter-day-saints-v-united-states/"

# First Instance of XPATH Request / Parse Requirements & Variables (Thanks to chevignon93 on reddit.com!)

# Select HTML Element for Data Parsing Column #2 (Cont.) [courtlistener_jurisdiction]
HEADERS = ({'User-Agent':
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
            (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',\
            'Accept-Language': 'en-US, en;q=0.5'})

webpage = requests.get(URL, headers=HEADERS)
xsoup = BeautifulSoup(webpage.content, "html5lib")
dom = etree.HTML(str(xsoup))

pvar_dom_xpath_courtlistener_jurisdiction = dom.xpath('/html/body/div[1]/div[1]/article/h3')[0].text
print(dom.xpath('/html/body/div[1]/div[1]/article/h3')[0].text)
#input2 = input("Prep Parsing Column 2 - courtlistener_jurisdiction - Type CONTINUE to Continue: ")
#print(input2)
#input("Column 2 of 7 ... courtlistener_jurisdiction ... Parsed Correctly? Press ENTER to Continue...")

#input("STOP: RESULTS, CORRECT? ENTER FOR (Y)es")
#print("Now Stopping Dragon Breath Augment : [A01]...")
#exit()

# Select HTML Element for Data Parsing Column #3 [courtlistener_filed]
#htmlParse.find_all("p")[0].get_text()
# XPATH : /html/body/div[1]/div[1]/article/p[1]/span[2]

#pvar_dom_xpath_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]span[2]')[0].text
#pvar_dom_xpath_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text
pvar_dom_xpath_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text
#print(dom.xpath('/html/body/div[1]/div[1]/article/p[1]span[2]')[0].text)
print(dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text)
#input3 = input("Prep Parsing Column 3 - courtlistener_filed - Type CONTINUE to Continue: ")
#print(input3)
#input("Column 3 of 7 ... courtlistener_filed ... Parsed Correctly? Press ENTER to Continue...")
#input("Column 3 of 7 : courtlistener_filed ... Parsed Correctly? Press ENTER to Continue...")

# Select HTML Element for Data Parsing Column #4 [courtlistener_precedential_status]
#htmlParse.find_all("p")[1].get_text()

# XPATH: /html/body/div[1]/div[1]/article/p[2]/span[1]

pvar_dom_xpath_courtlistener_precedential_status = dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[2]')[0].text
#print(dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[1]')[0].text)
print(dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[2]')[0].text)
#input4 = input("Prep Parsing Column 4 - courtlistener_precedential_status - Type CONTINUE to Continue: ")
#print(input4)
#input("Column 4 of 7 ... courtlistener_precedential_status ... Parsed Correctly? Press ENTER to Continue...")

# Select HTML Element for Data Parsing Column #5 [courtlistener_docket_number]
#htmlParse.find_all("p")[2].get_text()
# XPATH: /html/body/div[1]/div[1]/article/p[4]/span[1]
pvar_dom_xpath_courtlistener_docket_number = dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text
#print(pvar_dom_xpath_courtlistener_docket_number)
#print(dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[1]')[0].text)
print(dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text)
#input5 = input("Prep Parsing Column 5 - courtlistener_docket_number - Type CONTINUE to Continue: ")
#print(input5)
#input("Column 5 of 7 ... courtlistener_precedential_status ... Parsed Correctly? Press ENTER to Continue...")

# Select HTML Button Element for Download Opinion PDF Copy - [STORAGE] - Column #6 [courtlistener_pdf_opinion_url_storage]

#bs4_courtlistener_pdf_opinion_url_storage = /html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a
pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a')[0].text
#print(pvar_dom_xpath_courtlistener_pdf_opinion_url_storage)
print(dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a')[0].text)
#input6 = input("Prep Parsing Column 6 - courtlistener_pdf_opinion_url_storage - Type CONTINUE to Continue: ")
#print(input6)
#input("Column 6 of 7 ... courtlistener_pdf_opinion_url_storage ... Parsed Correctly? Press ENTER to Continue...")

# Select HTML Button Element for Download Opinion PDF Copy - [GOV] - Column #7 [courtlistener_pdf_opinion_url_gov]
pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a')[0].text
#print(pvar_dom_xpath_courtlistener_pdf_opinion_url_storage)
print(dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a')[0].text)
#input7 = input("Prep Parsing Column 7 - courtlistener_pdf_opinion_url_gov - Type CONTINUE to Continue: ")
#print(input7)
#input("Column 7 of 7 ... courtlistener_pdf_opinion_url_gov ... Parsed Correctly? Press ENTER to Continue...")

# BREAK | CHECK OUTPUT BEFORE FINALIZING MARIADB COMMITS:
input("Dragon Breath Augmentation: [A01] - Part A: Completed | Press ENTER to exit Python3")

# Connection to MariaDB 10.5.x with a Database selected using PyMySQL
connection = pymysql.connect(host='localhost',
                 user='username',
                 password='password',
                 db='EXODUS_CL_FULL_ENOUGH_WASH_SUPREME',
                 charset='utf8mb4',
                 cursorclass=pymysql.cursors.DictCursor)

# Assign a Variable to BeautifulSoup4 Parsing using soup.find_all("") function
# which is telling BeautifulSoup4 to find all <h1> & <p> tags
# and store tags signified as zero [0]  and then
# strips the tags using soup.find_all("h1").get_text() & soup.find_all("p").get_text()
# leaving us only the text to pass to MariaDB for storage
bs4_courtlistener_case_name = htmlParse.find_all("h2")[0].get_text()
bs4_courtlistener_jurisdiction = dom.xpath('/html/body/div[1]/div[1]/article/h3')[0].text
bs4_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text
bs4_courtlistener_precedential_status = dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[1]')[0].text
bs4_courtlistener_docket_number = dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text
bs4_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a')[0].text
bs4_courtlistener_pdf_opinion_url_gov = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a')[0].text

#bs4_courtlistener_jurisdiction = htmlParse.find_all("h3")[0].get_text()
#bs4_courtlistener_filed = htmlParse.find_all("p")[0].get_text()
#bs4_courtlistener_precedential_status = htmlParse.find_all("p")[1].get_text()
#bs4_courtlistener_docket_number = htmlParse.find_all("p")[2].get_text()

#bs4_courtlistener_pdf_opinion_url_storage = "/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a"
#bs4_courtlistener_pdf_opinion_url_gov = "/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a"

#h1_text = htmlParse.find_all("h1")[0].get_text()
#p_text = htmlParse.find_all("p")[0].get_text()

try: 
    with connection.cursor() as cursor: 
            sql = "INSERT INTO `4706734_Dropped_Columns_5` (`courtlistener_case_name`,`courtlistener_jurisdiction`,`courtlistener_filed`,`courtlistener_precedential_status`,`courtlistener_docket_number`) VALUES (%s, %s, %s, %s, %s)" 
            cursor.execute(sql, (bs4_courtlistener_case_name,bs4_courtlistener_jurisdiction,bs4_courtlistener_filed,bs4_courtlistener_precedential_status,bs4_courtlistener_docket_number)) 

    connection.commit() 
finally: 
    connection.close()

# Alert Brandon | Payload Delivered Successfully!
print("Dragon Breath Augmentation: [A01] Executed && Payload Delivered Successfully!")
input("Dragon Breath Augmentation: [A01] is Finished. Press ENTER to Exit Python3")
Thank you everyone for this forum!

Best Regards,

Brandon Kastning