Python Forum

Full Version: How to create Conditionals for XPATH/BS4 tag scrapes?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
How to create Conditionals for XPATH/BS4 tag scrapes?

I am working on my program which is parsing json files; then performs a remote url request and a lxml xpath scrape; stores to python variables and then injects the python variable payload to MariaDB using PyMySQL. I am working with a dataset that either has 5 Columns (no pdf links x 2 on page) vs. 7 Columns (with pdf links x 2 on page). How do I make it conditional? If XPATH error -> use this code instead.

Here are the code blocks I am working with right now:

Any pointers would be greatly appreciated. Thank you everyone for this Python Forum! :)

# How to make Conditional? If XPATH exists -> use block of code for payload here...
# vs. If XPATH doesn't exist -> use block of code for payload here...
# ... using same db & table

# XPATH Scraping pdf URL (not on all urls; hit / miss)

https://www.courtlistener.com/opinion/4631414/mcdonough-v-smith/ (has it)
https://www.courtlistener.com/opinion/141474/urban-v-hurley/ (does not)
# Assign Table4 Column 6 / 12 - MariaDB Column Name: courtlistener_pdf_opinion_url_storage

print("Dragon Breath [F.03] - [Table 4/6] - Table4_RemoteServer - [CL Jurisdiction Dataset] - Now Assigning Table4 Column Python Variable 6 out of 12...")
pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a/@href')
print(dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a/@href'))
# MariaDB Python Variable Payload

print("Dragon Breath [F.03] - [Table 4/6] - Table4_RemoteServer - [CL Jurisdiction Dataset] - Now Injecting XPATH Python Variable Payload to MariaDB...")

import pymysql
import pymysql.cursors
connection = pymysql.connect(host='localhost',
  user="brandon",
  passwd="password",
  db="EXODUS_CL_DragonBreath_F03_ICEDRAGON3"
)
print("PyMySQL Connected Successfully!")
# MariaDB Python Variable Payload Part 2 [5 Columns] (Xpath for pdf storage & pdf gov doesn't exist):

##### JURISDICTION: U.S. FEDERAL SUPREME COURT OF THE UNITED STATES #####
(5 Columns) w/o pdf storage & pdf gov

with connection:
    with connection.cursor() as cursor:
        sql = "INSERT INTO `Current_JSON_Courtlistener_Dataset_Exodus_Table4_RemoteServer` (`courtlistener_case_name`, `courtlistener_jurisdiction`, `courtlistener_filed`, `courtlistener_precedential_status`, `courtlistener_docket_number`) VALUES (%s, %s, %s, %s, %s)"
        cursor.execute(sql, (bs4_xpath_courtlistener_case_name, bs4_xpath_courtlistener_jurisdiction, bs4_xpath_courtlistener_filed, bs4_xpath_courtlistener_precedential_status, bs4_xpath_courtlistener_docket_number))
        connection.commit()


##### JURISDICTION: U.S. FEDERAL SUPREME COURT OF THE UNITED STATES #####
(7 Columns) w/ pdf storage & pdf gov

with connection:
    with connection.cursor() as cursor:
        sql = "INSERT INTO `Current_JSON_Courtlistener_Dataset_Exodus_Table4_RemoteServer` (`courtlistener_case_name`, `courtlistener_jurisdiction`, `courtlistener_filed`, `courtlistener_precedential_status`, `courtlistener_docket_number`, `courtlistener_pdf_opinion_url_storage`, `courtlistener_pdf_opinion_url_gov`) VALUES (%s, %s, %s, %s, %s, %s, %s)"
        cursor.execute(sql, (bs4_xpath_courtlistener_case_name, bs4_xpath_courtlistener_jurisdiction, bs4_xpath_courtlistener_filed, bs4_xpath_courtlistener_precedential_status, bs4_xpath_courtlistener_docket_number, bs4_xpath_courtlistener_pdf_opinion_url_storage, bs4_xpath_courtlistener_pdf_opinion_url_gov))
        connection.commit()
Thank you in advance for any pointers!

Best Regards,

Brandon Kastning
Hey everyone! I was able to resolve with help from irc.libera.chat #python by dba!

Working code solution was the following:

# Assign Table4 Column 5 / 12 - MariaDB Column Name: courtlistener_docket_number
print("Dragon Breath [F.03] - [Table 4/7] - Table4_RemoteServer - [CL Jurisdiction Dataset] - Now Assigning Table4 Column Python Variable 5 out of 12...")
pvar_dom_xpath_courtlistener_docket_number = dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text.strip()
if len(pvar_dom_xpath_courtlistener_docket_number) !=0:
    print(pvar_dom_xpath_courtlistener_docket_number)
else:
    pvar_dom_xpath_courtlistener_docket_number = "(NULL)"
print(dom.xpath('/html/body/div[1]/div[1]/article/p[4]/span[2]')[0].text.strip())
AND

# Assign Table4 Column 6 / 12 - MariaDB Column Name: courtlistener_pdf_opinion_url_storage
print("Dragon Breath [F.03] - [Table 4/7] - Table4_RemoteServer - [CL Jurisdiction Dataset] - Now Assigning Table4 Column Python Variable 6 out of 12...")
pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a/@href')
if len(pvar_dom_xpath_courtlistener_pdf_opinion_url_storage) !=0:
    print(pvar_dom_xpath_courtlistener_pdf_opinion_url_storage)
else:
    pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = "(NULL)"
print(dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a/@href'))
Thank you everyone for this Forum!

(Apr-12-2022, 06:02 AM)BrandonKastning Wrote: [ -> ]How to create Conditionals for XPATH/BS4 tag scrapes?

I am working on my program which is parsing json files; then performs a remote url request and a lxml xpath scrape; stores to python variables and then injects the python variable payload to MariaDB using PyMySQL. I am working with a dataset that either has 5 Columns (no pdf links x 2 on page) vs. 7 Columns (with pdf links x 2 on page). How do I make it conditional? If XPATH error -> use this code instead.


And
Here are the code blocks I am working with right now:

Any pointers would be greatly appreciated. Thank you everyone for this Python Forum! :)

# How to make Conditional? If XPATH exists -> use block of code for payload here...
# vs. If XPATH doesn't exist -> use block of code for payload here...
# ... using same db & table

# XPATH Scraping pdf URL (not on all urls; hit / miss)

https://www.courtlistener.com/opinion/4631414/mcdonough-v-smith/ (has it)
https://www.courtlistener.com/opinion/141474/urban-v-hurley/ (does not)
# Assign Table4 Column 6 / 12 - MariaDB Column Name: courtlistener_pdf_opinion_url_storage

print("Dragon Breath [F.03] - [Table 4/6] - Table4_RemoteServer - [CL Jurisdiction Dataset] - Now Assigning Table4 Column Python Variable 6 out of 12...")
pvar_dom_xpath_courtlistener_pdf_opinion_url_storage = dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a/@href')
print(dom.xpath('/html/body/div[1]/div[1]/article/div[2]/ul/li[1]/a/@href'))
# MariaDB Python Variable Payload

print("Dragon Breath [F.03] - [Table 4/6] - Table4_RemoteServer - [CL Jurisdiction Dataset] - Now Injecting XPATH Python Variable Payload to MariaDB...")

import pymysql
import pymysql.cursors
connection = pymysql.connect(host='localhost',
  user="brandon",
  passwd="password",
  db="EXODUS_CL_DragonBreath_F03_ICEDRAGON3"
)
print("PyMySQL Connected Successfully!")
# MariaDB Python Variable Payload Part 2 [5 Columns] (Xpath for pdf storage & pdf gov doesn't exist):

##### JURISDICTION: U.S. FEDERAL SUPREME COURT OF THE UNITED STATES #####
(5 Columns) w/o pdf storage & pdf gov

with connection:
    with connection.cursor() as cursor:
        sql = "INSERT INTO `Current_JSON_Courtlistener_Dataset_Exodus_Table4_RemoteServer` (`courtlistener_case_name`, `courtlistener_jurisdiction`, `courtlistener_filed`, `courtlistener_precedential_status`, `courtlistener_docket_number`) VALUES (%s, %s, %s, %s, %s)"
        cursor.execute(sql, (bs4_xpath_courtlistener_case_name, bs4_xpath_courtlistener_jurisdiction, bs4_xpath_courtlistener_filed, bs4_xpath_courtlistener_precedential_status, bs4_xpath_courtlistener_docket_number))
        connection.commit()


##### JURISDICTION: U.S. FEDERAL SUPREME COURT OF THE UNITED STATES #####
(7 Columns) w/ pdf storage & pdf gov

with connection:
    with connection.cursor() as cursor:
        sql = "INSERT INTO `Current_JSON_Courtlistener_Dataset_Exodus_Table4_RemoteServer` (`courtlistener_case_name`, `courtlistener_jurisdiction`, `courtlistener_filed`, `courtlistener_precedential_status`, `courtlistener_docket_number`, `courtlistener_pdf_opinion_url_storage`, `courtlistener_pdf_opinion_url_gov`) VALUES (%s, %s, %s, %s, %s, %s, %s)"
        cursor.execute(sql, (bs4_xpath_courtlistener_case_name, bs4_xpath_courtlistener_jurisdiction, bs4_xpath_courtlistener_filed, bs4_xpath_courtlistener_precedential_status, bs4_xpath_courtlistener_docket_number, bs4_xpath_courtlistener_pdf_opinion_url_storage, bs4_xpath_courtlistener_pdf_opinion_url_gov))
        connection.commit()
Thank you in advance for any pointers!

Best Regards,

Brandon Kastning