Mar-08-2022, 12:28 PM
No syntax errors. Script runs!
I am having problems with BS4 & XPATH. BS4 (tag) = column1(courtlistener_case_name) and works perfect. BS4 (xpath) column2(courtlistener_jurisdiction) works perfect, column3(courtlistener_filed) ---Formatting is crazy. Spaces (how to filter this?) to make it "only the date", column4(courtlistener_precedential_status) only shows the Header and not the true Value, column5(courtlistener_docket_number) -- also has crazy spaces (docket # not only value in column), column6(courtlistener_pdf_opinion_url_storage) -- getting Null value instead of pdf URL and column7(courtlistener_pdf_opinion_url_gov) --getting Null value instead of pdf URL ...
If you comment the URL line and un-comment another (you can switch Jurisdictions); the XPATH's I am scraping; seem to inter-change properly between the different jurisdictions. So it must be XPATH scraping from the template system.
Any pointers / advise would be helpful!
Dragon Breath Augmentation [A01].py:
Best Regards,
Brandon Kastning
I am having problems with BS4 & XPATH. BS4 (tag) = column1(courtlistener_case_name) and works perfect. BS4 (xpath) column2(courtlistener_jurisdiction) works perfect, column3(courtlistener_filed) ---Formatting is crazy. Spaces (how to filter this?) to make it "only the date", column4(courtlistener_precedential_status) only shows the Header and not the true Value, column5(courtlistener_docket_number) -- also has crazy spaces (docket # not only value in column), column6(courtlistener_pdf_opinion_url_storage) -- getting Null value instead of pdf URL and column7(courtlistener_pdf_opinion_url_gov) --getting Null value instead of pdf URL ...
If you comment the URL line and un-comment another (you can switch Jurisdictions); the XPATH's I am scraping; seem to inter-change properly between the different jurisdictions. So it must be XPATH scraping from the template system.
Any pointers / advise would be helpful!
Dragon Breath Augmentation [A01].py:
# Source 1: https://itsmycode.com/python-urllib-error-httperror-http-error-403-forbidden/ | User-Agent Spoofing # Source 2: https://www.quora.com/How-do-I-extract-text-from-multiple-web-site-URLs-at-once | BS4 Multiple URLS Example # Target URL for Scraping: https://www.courtlistener.com/opinion/4902955/state-v-schierman/ # Beginning URL: https://www.courtlistener.com # End URL: column: courtlistener_absolute_url # courtlistener_absolute_url = /opinion/4902955/state-v-schierman/ # Thanks to irc.libera.chat #python - nedbat (Debugging) :) # And to Repiphany and Falc who also offered to assist with full tracebacks! # And jinsun__ who keyed me into upgrading to 3.10/3.11 for major syntax error updates + other good things # 03/08/2022 @ Approx. 02:30 # Database Name: EXODUS_CL_FULL_ENOUGH_WASH_SUPREME # MariaDB Table Name: 4706734_Dropped_Columns_5 # Database Schema: utf8mb4_unicode_ci # Database Schema Details: # id / INT / 11 / AUTO_INCREMENT / PRIMARY_KEY # courtlistener_case_name / TEXT / NULL # courtlistener_jurisdiction / TEXT / NULL # courtlistener_filed / TEXT / NULL # courtlistener_precedential_status / TEXT / NULL # courtlistener_docket_number / TEXT / NULL # exodus_courtlistener_dropped_columns_entry_timestamp / DATETIME / CURRENT_TIMESTAMP # 5 Mapped HTML Tags | MariaDB Column Name -> HTML Tag: # courtlistener_case_name = htmlParse.find_all("h2")[0].get_text() # courtlistener_jurisdiction = htmlParse.find_all("h3")[0].get_text() # courtlistener_filed = htmlParse.find_all("p")[0].get_text() # courtlistener_precedential_status = htmlParse.find_all("p").[1].get_text() # courtlistener_docket_number = htmlParse.find_all("p").[2].get_text() # 2 Additional HTML Tags | MariaDB Column Name -> HTML Tag: # courtlistener_pdf_opinion_url_storage = htmlParse.find_all("")[0].get_text() # courtlistener_pdf_opinion_url_gov = htmlParse.find_all("")[0].get_text() # Example - courtlistener_pdf_opinion_url_storage = https://storage.courtlistener.com/pdf/2018/04/12/state_v._schierman_1.pdf # Example - courtlistener_pdf_opinion_url_gov = http://www.courts.wa.gov/opinions/pdf/846146.pdf # Content Needed for New Columns on DragonBreath [F01] input("Dragon Breath Augmentation: Dropped Columns [A01] is activated. Now firing up all Systems... | Press ENTER to Continue...") print("Now importing Python Module Libraries Required for Dragon Breath Augmentation: Dropped Columns [A01] to run successfully...") import urllib.request import pymysql import pymysql.cursors from bs4 import BeautifulSoup import re # Added from Source 2 - Read URLs from Text File #list_open = open("json.courtlistener.exodus.opinion.for.current.dataset.urls.txt") #read_list = list_open.read() #line_in_list = read_list.split("\n") # Setup a function iteration for the User-Agent with Requests #for url in line_in_list: # soup = BeautifulSoup(urllib2.urlopen(url).read(),'html') # from urllib.request import Request, urlopen # req = Request(read_list, headers={'User-Agent': 'Mozilla/5.0'}) # html = urlopen(req).read() # User-Agent Spoofing - urllib Request from urllib.request import Request, urlopen req = Request('https://www.courtlistener.com/opinion/4902955/state-v-schierman/', headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read() html = urlopen(req,timeout=10).read() # Import Additional Python3 Require Module Libraries from bs4 import BeautifulSoup from lxml import etree import requests # Assign a Python Variable to Beautiful Soup 4's HTML Parser htmlParse = BeautifulSoup(html, 'html.parser') # Select HTML Element for Data Parsing Column #1 [courtlistener_case_name] htmlParse.find_all("h2")[0].get_text() pvar_bs4_tag_courtlistener_case_name = htmlParse.find_all("h2")[0].get_text() print(htmlParse.find_all("h2")[0].get_text()) #input1 = input("Prep Parsing Column 1 - courtlistener_case_name - Type CONTINUE to Continue: ") #print(input1) #input("Column 1 of 7 ... courtlistener_case_name ... Parsed Correctly? Press ENTER to Continue...") # Select HTML Element for Data Parsing Column #2 [courtlistener_jurisdiction] #htmlParse.find_all("h3")[2].get_text() # XPATH : /html/body/div[1]/div[1]/article/h3 # WASH APPELLATE COURT (URL): URL = "https://www.courtlistener.com/opinion/4340668/state-of-washington-v-jose-jesus-mancilla/" # WASH SUPREME COURT (URL): #URL = "https://www.courtlistener.com/opinion/4902955/state-v-schierman/" # 9TH CIRCUIT COURT (URL): #URL = "https://www.courtlistener.com/opinion/525248/united-states-v-irma-nuno-para-united-states-of-america-v-jesus/" # US SCOTUS (URL): #URL = "https://www.courtlistener.com/opinion/92810/late-corp-of-church-of-jesus-christ-of-latter-day-saints-v-united-states/" # First Instance of XPATH Request / Parse Requirements & Variables (Thanks to chevignon93 on reddit.com!) # Select HTML Element for Data Parsing Column #2 (Cont.) [courtlistener_jurisdiction] HEADERS = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \ (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',\ 'Accept-Language': 'en-US, en;q=0.5'}) webpage = requests.get(URL, headers=HEADERS) xsoup = BeautifulSoup(webpage.content, "html5lib") dom = etree.HTML(str(xsoup)) pvar_dom_xpath_courtlistener_jurisdiction = dom.xpath('/html/body/div[1]/div[1]/article/h3')[0].text print(dom.xpath('/html/body/div[1]/div[1]/article/h3')[0].text) #input2 = input("Prep Parsing Column 2 - courtlistener_jurisdiction - Type CONTINUE to Continue: ") #print(input2) #input("Column 2 of 7 ... courtlistener_jurisdiction ... Parsed Correctly? Press ENTER to Continue...") #input("STOP: RESULTS, CORRECT? ENTER FOR (Y)es") #print("Now Stopping Dragon Breath Augment : [A01]...") #exit() # Select HTML Element for Data Parsing Column #3 [courtlistener_filed] #htmlParse.find_all("p")[0].get_text() # XPATH : /html/body/div[1]/div[1]/article/p[1]/span[2] #pvar_dom_xpath_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]span[2]')[0].text #pvar_dom_xpath_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text pvar_dom_xpath_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text #print(dom.xpath('/html/body/div[1]/div[1]/article/p[1]span[2]')[0].text) print(dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text) #input3 = input("Prep Parsing Column 3 - courtlistener_filed - Type CONTINUE to Continue: ") #print(input3) #input("Column 3 of 7 ... courtlistener_filed ... Parsed Correctly? Press ENTER to Continue...") #input("Column 3 of 7 : courtlistener_filed ... Parsed Correctly? Press ENTER to Continue...") # Select HTML Element for Data Parsing Column #4 [courtlistener_precedential_status] #htmlParse.find_all("p")[1].get_text() # XPATH: /html/body/div[1]/div[1]/article/p[2]/span[1] pvar_dom_xpath_courtlistener_precedential_status = dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[2]')[0].text #print(dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[1]')[0].text) print(dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[2]')[0].text) #input4 = input("Prep Parsing Column 4 - courtlistener_precedential_status - Type CONTINUE to Continue: ") #print(input4) #input("Column 4 of 7 ... courtlistener_precedential_status ... Parsed Correctly? Press ENTER to Continue...") # Select HTML Element for Data Parsing Column #5 [courtlistener_docket_number] #htmlParse.find_all("p")[2].get_text() # XPATH: /html/body/div[1]/div[1]/article/p[4]/span[1] pvar_dom_xpath_courtlistener_docket_number = dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text #print(pvar_dom_xpath_courtlistener_docket_number) #print(dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[1]')[0].text) print(dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text) #input5 = input("Prep Parsing Column 5 - courtlistener_docket_number - Type CONTINUE to Continue: ") #print(input5) #input("Column 5 of 7 ... courtlistener_precedential_status ... Parsed Correctly? Press ENTER to Continue...") # Select HTML Button Element for Download Opinion PDF Copy - [STORAGE] - Column #6 [courtlistener_pdf_opinion_url_storage] #bs4_courtlistener_pdf_opinion_url_storage = /html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a')[0].text #print(pvar_dom_xpath_courtlistener_pdf_opinion_url_storage) print(dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a')[0].text) #input6 = input("Prep Parsing Column 6 - courtlistener_pdf_opinion_url_storage - Type CONTINUE to Continue: ") #print(input6) #input("Column 6 of 7 ... courtlistener_pdf_opinion_url_storage ... Parsed Correctly? Press ENTER to Continue...") # Select HTML Button Element for Download Opinion PDF Copy - [GOV] - Column #7 [courtlistener_pdf_opinion_url_gov] pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a')[0].text #print(pvar_dom_xpath_courtlistener_pdf_opinion_url_storage) print(dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a')[0].text) #input7 = input("Prep Parsing Column 7 - courtlistener_pdf_opinion_url_gov - Type CONTINUE to Continue: ") #print(input7) #input("Column 7 of 7 ... courtlistener_pdf_opinion_url_gov ... Parsed Correctly? Press ENTER to Continue...") # BREAK | CHECK OUTPUT BEFORE FINALIZING MARIADB COMMITS: input("Dragon Breath Augmentation: [A01] - Part A: Completed | Press ENTER to exit Python3") # Connection to MariaDB 10.5.x with a Database selected using PyMySQL connection = pymysql.connect(host='localhost', user='username', password='password', db='EXODUS_CL_FULL_ENOUGH_WASH_SUPREME', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor) # Assign a Variable to BeautifulSoup4 Parsing using soup.find_all("") function # which is telling BeautifulSoup4 to find all <h1> & <p> tags # and store tags signified as zero [0] and then # strips the tags using soup.find_all("h1").get_text() & soup.find_all("p").get_text() # leaving us only the text to pass to MariaDB for storage bs4_courtlistener_case_name = htmlParse.find_all("h2")[0].get_text() bs4_courtlistener_jurisdiction = dom.xpath('/html/body/div[1]/div[1]/article/h3')[0].text bs4_courtlistener_filed = dom.xpath('/html/body/div[1]/div[1]/article/p[1]/span[2]')[0].text bs4_courtlistener_precedential_status = dom.xpath('/html/body/div[1]/div[1]/article/p[2]/span[1]')[0].text bs4_courtlistener_docket_number = dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text bs4_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a')[0].text bs4_courtlistener_pdf_opinion_url_gov = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a')[0].text #bs4_courtlistener_jurisdiction = htmlParse.find_all("h3")[0].get_text() #bs4_courtlistener_filed = htmlParse.find_all("p")[0].get_text() #bs4_courtlistener_precedential_status = htmlParse.find_all("p")[1].get_text() #bs4_courtlistener_docket_number = htmlParse.find_all("p")[2].get_text() #bs4_courtlistener_pdf_opinion_url_storage = "/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a" #bs4_courtlistener_pdf_opinion_url_gov = "/html/body/div[1]/div[1]/article/div[2]/ul/li[3]/a" #h1_text = htmlParse.find_all("h1")[0].get_text() #p_text = htmlParse.find_all("p")[0].get_text() try: with connection.cursor() as cursor: sql = "INSERT INTO `4706734_Dropped_Columns_5` (`courtlistener_case_name`,`courtlistener_jurisdiction`,`courtlistener_filed`,`courtlistener_precedential_status`,`courtlistener_docket_number`) VALUES (%s, %s, %s, %s, %s)" cursor.execute(sql, (bs4_courtlistener_case_name,bs4_courtlistener_jurisdiction,bs4_courtlistener_filed,bs4_courtlistener_precedential_status,bs4_courtlistener_docket_number)) connection.commit() finally: connection.close() # Alert Brandon | Payload Delivered Successfully! print("Dragon Breath Augmentation: [A01] Executed && Payload Delivered Successfully!") input("Dragon Breath Augmentation: [A01] is Finished. Press ENTER to Exit Python3")Thank you everyone for this forum!
Best Regards,
Brandon Kastning