Python Forum
Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB
#1
Lightbulb 
Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB Two Different Ways

A.) Entire HTML Documents w/ Tags -> MariaDB -> as 1 Column Insert

1 table can store multiple entries of single column inserts as a collection

B.) Entire HTML Documents w/ Tags -> MariaDB -> Row by Row Inserts

table = file_name

the table populates row by row by line by line of the read and parsed HTML Document / File


I am not sure where to begin here; just like last time, I will start the thread with the objectives and then use the internet as my source of learning (and anyone who would like to assist me and others, feel free to bring the python!)

Thank you everyone for making this Journey possible and enjoyable!

Best Regards,

Brandon Kastning
Disabled American Constitutional Pre-Law Student
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply
#2
Sources (Tutorials, Blogs, etc)

https://cmdlinetips.com/2018/01/how-to-r...in-python/

https://www.guru99.com/accessing-interne...ython.html

Database 1/2: [Kung-Fu_A]

CREATE TABLE `KungFuA` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `full_html_document_with_tags` text COLLATE utf8mb4_unicode_ci NULL,
  `python_entry_timestamp` timestamp COLLATE utf8_unicode_ci CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8_unicode_ci
  AUTO_INCREMENT=1;
Python Script 1/2: [Kung-Fu_A]

# Finalized on Python-Forum.io
# Disabled American Constitutional Pre-Law Student: BrandonKastning
# Date: 11/22/2021
# Script: Kung-Fu_A_FullDocument_1_Column_Insert.py
# Purpose: Building Block for Python + MariaDB 10.4.x
# Thread URL with Sources of Learning (Cited on Board)
# https://python-forum.io/thread-35607.html

# Import Python 3.9.9 compatible OpenSource Libraries
import urllib.request 
import pymysql
import pymysql.cursors
  
# Website HTTP URL to Grab Data From
url = "https://law.justia.com/constitution/us/preamble.html" 

# Assign a Python Variable to urllib.request URL to work with
html = urllib.request.urlopen(url)
  

# Connect to MariaDB 10.4.x with a Database selected using pymysql
connection = pymysql.connect(host='localhost',
                 user='brandon',
                 password='password',
                 db='Battle_Python1',
                 charset='utf8mb4',
                 cursorclass=pymysql.cursors.DictCursor)

# Copy Remote HTML Document into Variable

all_of_it = html.read()

print ("Remote HTML Document stored into Memory and Preparing to pass to MariaDB for INSERT")

# Assign a Python Variable to All Lines of File Read

full_html_document_with_tags = all_of_it

# INSERT Python Variable "all_of_it" into MariaDB

try: 
    with connection.cursor() as cursor: 
            sql = "INSERT INTO `KungFuA` (`full_html_document_with_tags`) VALUES (%s)" 
            cursor.execute(sql, (full_html_document_with_tags)) 
    connection.commit() 
finally: 
    connection.close() 

# Checking Code / Error Free
print ("The code is Error free to this line!")
Run Successfully 1/2 - KungFu_A:

 
brandon@FireDragon:~/Python/02_Kung-Fu$ python Kung-Fu_A_FullDocument_1_Column_Insert.py
Remote HTML Document stored into Memory and Preparing to pass to MariaDB for INSERT
The code is Error free to this line!
brandon@FireDragon:~/Python/02_Kung-Fu$
Screenshot Evidence of Successful Run 1/2 - KungFu-A:

[Image: 1-2021-11-22-01-51-19.png]

[Image: 2-2021-11-22-01-51-46.png]

[Image: 3-2021-11-22-01-52-03.png]
picture to url

Now that Part A works; Part B is going to be far more tricky! It requires us to read the target file/url line by line and then create a table (start of the script) which is equivalent to the target file/url name (or a more efficient / functional naming convention depending on the Data project) and then insert a Row to MariaDB per each line read by Python in that newly created Table from the beginning of the Script.
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply
#3
Kung-Fu_B: Obstacles:

One of the ways I am trying to solve this objective & script is to :

#1) Find out how to count the lines of a stored python variable

#2) How to create a list to be used by our Python Script from the variable (with split/divided variable -> x amount of variables based on lines) [Probably an easier way to solve this; however this is where my brain currently is configured for].

#3) Then passing the data should be possible line by line , insert by insert.

I want to use my previous example from Kung-Fu_A (bringing the entire HTML remote document into local Python Memory and stored into Variable "all_of_it"). Then bring it down into lines and process each line as an individual insert!

Stumped so far! If anyone has a tip or a way to key me in, please do! Thank you everyone for this forum! I will continue scouring the net for solutions!

Best Regards,

Brandon Kastning
Disabled American Constitutional Pre-Law Student
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply
#4
(Nov-22-2021, 07:46 PM)BrandonKastning Wrote: I want to use my previous example from Kung-Fu_A (bringing the entire HTML remote document into local Python Memory and stored into Variable "all_of_it"). Then bring it down into lines and process each line as an individual insert!
The paragraph with text on web-site is not well format,so one way is to say that a paragraph is the lines then go with that.

Other way have to try split of the text in paragraph to lines.
Quick look.
import requests
from bs4 import BeautifulSoup

url = "https://law.justia.com/constitution/us/preamble.html"
html = requests.get(url)
soup = BeautifulSoup(html.content, 'lxml')
all_p = soup.select('p')
>>> print(all_p[0].text)
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
>>> 
>>> print(all_p[2].text)
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution.2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them. For example, the preamble declares one object to be, ‘provide for the common defense.’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence. But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3
So the first paragraph there is really no good way to spilt it up,other maybe split in 3 based on length.
The second one could split at ..
>>> par_2 = all_p[2].text.split('.')
>>> for index, line in enumerate(par_2, 1):
...     print(f"{line} <line{index}>\n")
Output:
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution <line1> 2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them <line2> For example, the preamble declares one object to be, ‘provide for the common defense <line3> ’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence <line4> But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3 <line5>
BrandonKastning likes this post
Reply
#5
snippsat,

I am doing my best to grasp this new code you shared. I appreciate this; do you know of a way to read each line "as is" from read file somehow? With tags included? The specific tag splitting (which is what you are demonstrating in this code, I think) of paragraph text. Very useful.

If you know how to do a read line w/ tags (I don't know how Python interprets* that, if that's the correct phrase for what I am trying to convey).

I appreciate your contribution and encouragement to learn a complicated programming language (at least for me it is).

Best Regards,

~ Brandon

(Nov-23-2021, 01:00 AM)snippsat Wrote:
(Nov-22-2021, 07:46 PM)BrandonKastning Wrote: I want to use my previous example from Kung-Fu_A (bringing the entire HTML remote document into local Python Memory and stored into Variable "all_of_it"). Then bring it down into lines and process each line as an individual insert!
The paragraph with text on web-site is not well format,so one way is to say that a paragraph is the lines then go with that.

Other way have to try split of the text in paragraph to lines.
Quick look.
import requests
from bs4 import BeautifulSoup

url = "https://law.justia.com/constitution/us/preamble.html"
html = requests.get(url)
soup = BeautifulSoup(html.content, 'lxml')
all_p = soup.select('p')
>>> print(all_p[0].text)
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
>>> 
>>> print(all_p[2].text)
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution.2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them. For example, the preamble declares one object to be, ‘provide for the common defense.’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence. But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3
So the first paragraph there is really no good way to spilt it up,other maybe split in 3 based on length.
The second one could split at ..
>>> par_2 = all_p[2].text.split('.')
>>> for index, line in enumerate(par_2, 1):
...     print(f"{line} <line{index}>\n")
Output:
Although the preamble is not a source of power for any department of the Federal Government,1 the Supreme Court has often referred to it as evidence of the origin, scope, and purpose of the Constitution <line1> 2 “Its true office,” wrote Joseph Story in his Commentaries, “is to expound the nature and extent and application of the powers actually conferred by the Constitution, and not substantively to create them <line2> For example, the preamble declares one object to be, ‘provide for the common defense <line3> ’ No one can doubt that this does not enlarge the powers of Congress to pass any measures which they deem useful for the common defence <line4> But suppose the terms of a given power admit of two constructions, the one more restrictive, the other more liberal, and each of them is consistent with the words, but is, and ought to be, governed by the intent of the power; if one could promote and the other defeat the common defence, ought not the former, upon the soundest principles of interpretation, to be adopted?”3 <line5>
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply
#6
Coming back around to this obstacle: Entire HTML Documents w/ Tags -> MariaDB -> Row by Row Inserts

table = file_name

the table populates row by row by line by line of the read and parsed HTML Document / File!



Sources - Blogs/Tutorials:


https://stackoverflow.com/questions/4533...-text-file

Similar to what snippsat shared with me; splitting text. There is a function mentioned in this stackoverflow article that shows code rather than file.read(); using file.readlines() instead.

This is my starting point going after this goal again!

Thank you everyone again for this forum! Merry Christmas and Happy New Years!

Best Regards,

Brandon Kastning
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 1,562 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
  How can I web scrape the "alt" attribute from a "img" tag with Python? cisky 1 4,948 Aug-19-2022, 04:59 AM
Last Post: snippsat
Question Python Obstacles | Jeet-Kune-Do | BS4 (Tags > MariaDB) [URL/Local HTML] BrandonKastning 0 1,645 Feb-08-2022, 08:55 PM
Last Post: BrandonKastning
Question Securing State Constitutions (USA) from University of Maryland > MariaDB .sql BrandonKastning 1 1,806 Jan-21-2022, 06:34 PM
Last Post: BrandonKastning
Exclamation Debian 10 Buster Environment - Python 3.x (MariaDB 10.4.21) | Working Connector? BrandonKastning 9 5,138 Jan-04-2022, 08:27 PM
Last Post: BrandonKastning
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,701 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 2,098 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  Python Obstacles | American Kenpo | Wiki Scrape URL/Table and Store it in MariaDB BrandonKastning 6 3,403 Dec-29-2021, 12:38 AM
Last Post: BrandonKastning
  Python Obstacles | Karate | HTML/Scrape Specific Tag and Store it in MariaDB BrandonKastning 8 3,810 Nov-22-2021, 01:38 AM
Last Post: BrandonKastning
  show csv file in flask template.html rr28rizal 8 36,495 Apr-12-2021, 09:24 AM
Last Post: adamabusamra

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020