Nov-20-2021, 05:46 PM
(This post was last modified: Nov-20-2021, 05:46 PM by BrandonKastning.
Edit Reason: add a new picture
)
Python Obstacles | Karate | HTML/Scrape Specific Tag and Store it in MariaDB
I figure the best way to learn is to make a thread for every obstacle I am up against. Possibly basics for others; Difficult for types like me.
I will be using Beautiful Soup 4 to tackle this. I will scrape basic HTML and then once successful move up to a dataset of SGML files as difficulty increases.
There are two types of data inserts to MariaDB that I want to learn:
a) 1 Column Specific Insert (1 HTML Tag for target / read) and 1 Carry over Insert to a specific Table and Column.
b) Row by Row Inserts on a new Unique Table [Table representing the file] (each row representing 1 line per the file scraped or value desired)
& The above a & b w/ full tag extraction & carry over INSERT's (i.e. the actual tags with the contents inside the tags) -- (for database fetching and building of HTML pages).
I am a Disabled American Constitutional Law Student... most of my threads will be related to Law, Court Opinions, Court Rules, etc. (.gov) (American Government web URL's) or 3rd Party Resources for Legal Resources.
Thank you all for having me!
Best Regards,
Brandon
Progress so far:
01_Karate:
Source Blogs/Tutorials:
https://www.geeksforgeeks.org/beautifuls...from-html/
Goal: Scrape specific HTML
Target URL: https://law.justia.com/constitution/us/preamble.html
Target Paragraph: "Preamble"
Target Datastore: MariaDB
as regular user (linux/bsd) (in my case: brandon),
This is pulling all the paragraph tags. using "htmlParse.find_all("p)" ; I am guessing there is a htmlParse.find("p)" (Hopefully with the ability to select 1st paragraph, 2nd paragraph, etc) using [1], [2], [3] type element selectors.
and no pass through of data to a datastore yet!
Update with Partial Success:
Source Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Updated Karate.py:
I would not know how to extract paragraph #2 using this code. I just know that changing the code to htmlParse.p is now picking up the 1st Paragraph. I would like to know how to individually select paragraph #2 so I better understand what I am doing.
We are now here:
Update/Progress:
Sources (Tutorials/Blogs/etc):
https://stackoverflow.com/questions/4954...into-mysql
Install pymysql :
#1:
#2:
[b]#3 (I added a new Column and changed the name of another *Update after posting the above 2 links*):
Now to update our Python script, Karate.py:[/b]
Let's run the update of Karate.py:
I figure the best way to learn is to make a thread for every obstacle I am up against. Possibly basics for others; Difficult for types like me.
I will be using Beautiful Soup 4 to tackle this. I will scrape basic HTML and then once successful move up to a dataset of SGML files as difficulty increases.
There are two types of data inserts to MariaDB that I want to learn:
a) 1 Column Specific Insert (1 HTML Tag for target / read) and 1 Carry over Insert to a specific Table and Column.
b) Row by Row Inserts on a new Unique Table [Table representing the file] (each row representing 1 line per the file scraped or value desired)
& The above a & b w/ full tag extraction & carry over INSERT's (i.e. the actual tags with the contents inside the tags) -- (for database fetching and building of HTML pages).
I am a Disabled American Constitutional Law Student... most of my threads will be related to Law, Court Opinions, Court Rules, etc. (.gov) (American Government web URL's) or 3rd Party Resources for Legal Resources.
Thank you all for having me!
Best Regards,
Brandon
Progress so far:
01_Karate:
Source Blogs/Tutorials:
https://www.geeksforgeeks.org/beautifuls...from-html/
Goal: Scrape specific HTML
Target URL: https://law.justia.com/constitution/us/preamble.html
Target Paragraph: "Preamble"
Target Datastore: MariaDB
as regular user (linux/bsd) (in my case: brandon),
# pip install bs4 # pip install urllib (However doesn't work for me on Debian 9.13 Stretch w/ Python 3.9.9)and then Code so far:
# importing modules import urllib.request from bs4 import BeautifulSoup # providing url #url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed" url = "https://law.justia.com/constitution/us/preamble.html" # opening the url for reading html = urllib.request.urlopen(url) # parsing the html file htmlParse = BeautifulSoup(html, 'html.parser') # getting all the paragraphs for para in htmlParse.find_all("p"): print(para.get_text())So far the results are :
This is pulling all the paragraph tags. using "htmlParse.find_all("p)" ; I am guessing there is a htmlParse.find("p)" (Hopefully with the ability to select 1st paragraph, 2nd paragraph, etc) using [1], [2], [3] type element selectors.
and no pass through of data to a datastore yet!
Update with Partial Success:
Source Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Updated Karate.py:
# importing modules import urllib.request from bs4 import BeautifulSoup # providing url #url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed" url = "https://law.justia.com/constitution/us/preamble.html" # opening the url for reading html = urllib.request.urlopen(url) # parsing the html file htmlParse = BeautifulSoup(html, 'html.parser') # getting all the paragraphs #for para in htmlParse.find_all("p"): # print(para.get_text()) for para in htmlParse.p: print(para)Now it's extracting the 1st paragraph (the one I wanted); U.S. Federal Constitution of September 17, 1787's Preamble (The same Constitution that is The Supreme Law of The Land that my Countrymen and Woman have deviated from in error).
I would not know how to extract paragraph #2 using this code. I just know that changing the code to htmlParse.p is now picking up the 1st Paragraph. I would like to know how to individually select paragraph #2 so I better understand what I am doing.
We are now here:
brandon@FireDragon:~/Python/01_Karate$ python3 karate.py We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America. brandon@FireDragon:~/Python/01_Karate$
Update/Progress:
Sources (Tutorials/Blogs/etc):
https://stackoverflow.com/questions/4954...into-mysql
Install pymysql :
brandon@FireDragon:~/Python/01_Karate$ pip install pymysql Defaulting to user installation because normal site-packages is not writeable Collecting pymysql Downloading PyMySQL-1.0.2-py3-none-any.whl (43 kB) |████████████████████████████████| 43 kB 251 kB/s Installing collected packages: pymysql Successfully installed pymysql-1.0.2 WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available. You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command. brandon@FireDragon:~/Python/01_Karate$Create a MariaDB 10.4.22 (My version, anyhow) database called: Battle_Python1: (I couldn't get the following code to work; if you know why, please let me know; I used Portable Version 9.5.0.5196 HeidiSQL in Wine32 on Debian 9.13 Stretch currently which is a very nice Free and Lightweight GUI that manages MySQL/MariaDB and doesn't crash and I have put it to heavy test(s) so far in my pursuits for Big Data Creation & Management Skills.
CREATE TABLE `Karate` ( `id` int(11) NOT NULL AUTO_INCREMENT, `bs4_paragraph_1` text COLLATE utf8mb4_unicode_ci NULL, `bs4_python_timestamp` timestamp COLLATE utf8_unicode_ci CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1;Screen shots of my database built in HeidiSQL (I took a standard CREATE TABLE example and changed the values to match the ones I have setup). I should eventually hone into my manual SQL Queries upgrades eventually. I ran the query on Karate2 and it didn't do anything. I am not sure why I have failed on such a feat.
#1:
#2:
[b]#3 (I added a new Column and changed the name of another *Update after posting the above 2 links*):
Now to update our Python script, Karate.py:[/b]
# importing modules import urllib.request import pymysql from bs4 import BeautifulSoup # providing url #url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed" url = "https://law.justia.com/constitution/us/preamble.html" # opening the url for reading html = urllib.request.urlopen(url) # parsing the html file htmlParse = BeautifulSoup(html, 'html.parser') # getting all the paragraphs #for para in htmlParse.find_all("p"): # print(para.get_text()) for para in htmlParse.p: print(para) # Connection to database connection = pymysql.connect(host='localhost', user='brandon', password='password', db='Battle_Python1', # charset='latin1', charset='utf8mb4', # I use the database type utf8mb4_unicode_ci when creating a New MariaDB Database; I have read that it allows Special Characters and Fulltext Search combined abilities cursorclass=pymysql.cursors.DictCursor) # Checking Code / Error Free print ("The code is Error free to this line!")Continuing on...
Let's run the update of Karate.py:
brandon@FireDragon:~/Python/01_Karate$ python3 karate.py We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America. The code is Error free to this line! brandon@FireDragon:~/Python/01_Karate$
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)
“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)
#LetHISPeopleGo
“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)
#LetHISPeopleGo