Python Forum

Full Version: Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB Two Different Ways

A.) Entire HTML Documents w/ Tags -> MariaDB -> as 1 Column Insert

1 table can store multiple entries of single column inserts as a collection

B.) Entire HTML Documents w/ Tags -> MariaDB -> Row by Row Inserts

table = file_name

the table populates row by row by line by line of the read and parsed HTML Document / File


I am not sure where to begin here; just like last time, I will start the thread with the objectives and then use the internet as my source of learning (and anyone who would like to assist me and others, feel free to bring the python!)

Thank you everyone for making this Journey possible and enjoyable!

Best Regards,

Brandon Kastning
Disabled American Constitutional Pre-Law Student
Sources (Tutorials, Blogs, etc)

https://cmdlinetips.com/2018/01/how-to-r...in-python/

https://www.guru99.com/accessing-interne...ython.html

Database 1/2: [Kung-Fu_A]

CREATE TABLE `KungFuA` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `full_html_document_with_tags` text COLLATE utf8mb4_unicode_ci NULL,
  `python_entry_timestamp` timestamp COLLATE utf8_unicode_ci CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8_unicode_ci
  AUTO_INCREMENT=1;
Python Script 1/2: [Kung-Fu_A]

# Finalized on Python-Forum.io
# Disabled American Constitutional Pre-Law Student: BrandonKastning
# Date: 11/22/2021
# Script: Kung-Fu_A_FullDocument_1_Column_Insert.py
# Purpose: Building Block for Python + MariaDB 10.4.x
# Thread URL with Sources of Learning (Cited on Board)
# https://python-forum.io/thread-35607.html

# Import Python 3.9.9 compatible OpenSource Libraries
import urllib.request 
import pymysql
import pymysql.cursors
  
# Website HTTP URL to Grab Data From
url = "https://law.justia.com/constitution/us/preamble.html" 

# Assign a Python Variable to urllib.request URL to work with
html = urllib.request.urlopen(url)
  

# Connect to MariaDB 10.4.x with a Database selected using pymysql
connection = pymysql.connect(host='localhost',
                 user='brandon',
                 password='password',
                 db='Battle_Python1',
                 charset='utf8mb4',
                 cursorclass=pymysql.cursors.DictCursor)

# Copy Remote HTML Document into Variable

all_of_it = html.read()

print ("Remote HTML Document stored into Memory and Preparing to pass to MariaDB for INSERT")

# Assign a Python Variable to All Lines of File Read

full_html_document_with_tags = all_of_it

# INSERT Python Variable "all_of_it" into MariaDB

try: 
    with connection.cursor() as cursor: 
            sql = "INSERT INTO `KungFuA` (`full_html_document_with_tags`) VALUES (%s)" 
            cursor.execute(sql, (full_html_document_with_tags)) 
    connection.commit() 
finally: 
    connection.close() 

# Checking Code / Error Free
print ("The code is Error free to this line!")
Run Successfully 1/2 - KungFu_A:

 
brandon@FireDragon:~/Python/02_Kung-Fu$ python Kung-Fu_A_FullDocument_1_Column_Insert.py
Remote HTML Document stored into Memory and Preparing to pass to MariaDB for INSERT
The code is Error free to this line!
brandon@FireDragon:~/Python/02_Kung-Fu$
Screenshot Evidence of Successful Run 1/2 - KungFu-A:

[Image: 1-2021-11-22-01-51-19.png]

[Image: 2-2021-11-22-01-51-46.png]

[Image: 3-2021-11-22-01-52-03.png]
picture to url

Now that Part A works; Part B is going to be far more tricky! It requires us to read the target file/url line by line and then create a table (start of the script) which is equivalent to the target file/url name (or a more efficient / functional naming convention depending on the Data project) and then insert a Row to MariaDB per each line read by Python in that newly created Table from the beginning of the Script.
Kung-Fu_B: Obstacles:

One of the ways I am trying to solve this objective & script is to :

#1) Find out how to count the lines of a stored python variable

#2) How to create a list to be used by our Python Script from the variable (with split/divided variable -> x amount of variables based on lines) [Probably an easier way to solve this; however this is where my brain currently is configured for].

#3) Then passing the data should be possible line by line , insert by insert.

I want to use my previous example from Kung-Fu_A (bringing the entire HTML remote document into local Python Memory and stored into Variable "all_of_it"). Then bring it down into lines and process each line as an individual insert!

Stumped so far! If anyone has a tip or a way to key me in, please do! Thank you everyone for this forum! I will continue scouring the net for solutions!

Best Regards,

Brandon Kastning
Disabled American Constitutional Pre-Law Student
(Nov-22-2021, 07:46 PM)BrandonKastning Wrote: [ -> ]I want to use my previous example from Kung-Fu_A (bringing the entire HTML remote document into local Python Memory and stored into Variable "all_of_it"). Then bring it down into lines and process each line as an individual insert!
The paragraph with text on web-site is not well format,so one way is to say that a paragraph is the lines then go with that.

Other way have to try split of the text in paragraph to lines.
Quick look.
import requests
from bs4 import BeautifulSoup

url = "https://law.justia.com/constitution/us/preamble.html"
html = requests.get(url)
soup = BeautifulSoup(html.content, 'lxml')
all_p = soup.select('p')
>>> print(all_p[0].text)
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
>>> 
>>> print(all_p[2].text)
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution.2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them. For example, the preamble declares one object to be, ‘provide for the common defense.’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence. But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3
So the first paragraph there is really no good way to spilt it up,other maybe split in 3 based on length.
The second one could split at ..
>>> par_2 = all_p[2].text.split('.')
>>> for index, line in enumerate(par_2, 1):
...     print(f"{line} <line{index}>\n")
Output:
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution <line1> 2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them <line2> For example, the preamble declares one object to be, ‘provide for the common defense <line3> ’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence <line4> But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3 <line5>
snippsat,

I am doing my best to grasp this new code you shared. I appreciate this; do you know of a way to read each line "as is" from read file somehow? With tags included? The specific tag splitting (which is what you are demonstrating in this code, I think) of paragraph text. Very useful.

If you know how to do a read line w/ tags (I don't know how Python interprets* that, if that's the correct phrase for what I am trying to convey).

I appreciate your contribution and encouragement to learn a complicated programming language (at least for me it is).

Best Regards,

~ Brandon

(Nov-23-2021, 01:00 AM)snippsat Wrote: [ -> ]
(Nov-22-2021, 07:46 PM)BrandonKastning Wrote: [ -> ]I want to use my previous example from Kung-Fu_A (bringing the entire HTML remote document into local Python Memory and stored into Variable "all_of_it"). Then bring it down into lines and process each line as an individual insert!
The paragraph with text on web-site is not well format,so one way is to say that a paragraph is the lines then go with that.

Other way have to try split of the text in paragraph to lines.
Quick look.
import requests
from bs4 import BeautifulSoup

url = "https://law.justia.com/constitution/us/preamble.html"
html = requests.get(url)
soup = BeautifulSoup(html.content, 'lxml')
all_p = soup.select('p')
>>> print(all_p[0].text)
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
>>> 
>>> print(all_p[2].text)
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution.2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them. For example, the preamble declares one object to be, ‘provide for the common defense.’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence. But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3
So the first paragraph there is really no good way to spilt it up,other maybe split in 3 based on length.
The second one could split at ..
>>> par_2 = all_p[2].text.split('.')
>>> for index, line in enumerate(par_2, 1):
...     print(f"{line} <line{index}>\n")
Output:
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution <line1> 2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them <line2> For example, the preamble declares one object to be, ‘provide for the common defense <line3> ’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence <line4> But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3 <line5>
Coming back around to this obstacle: Entire HTML Documents w/ Tags -> MariaDB -> Row by Row Inserts

table = file_name

the table populates row by row by line by line of the read and parsed HTML Document / File!



Sources - Blogs/Tutorials:


https://stackoverflow.com/questions/4533...-text-file

Similar to what snippsat shared with me; splitting text. There is a function mentioned in this stackoverflow article that shows code rather than file.read(); using file.readlines() instead.

This is my starting point going after this goal again!

Thank you everyone again for this forum! Merry Christmas and Happy New Years!

Best Regards,

Brandon Kastning