Posts: 89
Threads: 26
Joined: Mar 2020
Python Obstacles | American Kenpo | Wiki Scrape URL/Table and Store it in MariaDB
I have several unresolved / completed threads here on this great forum and don't want to make a habit of this. However; what I am working on requires me to continue to change obstacle while achieving level-ups for data aggregation & delivery to the human for free.
I am manually creating tables within a database currently that is taking enormous amounts of time and energy to complete. My source is wikipedia for all entries. This leaves me to pause and decide it's not the best way to re-invent the wheel and type each and every entry or copy and paste over and over until I am blue in the face and on the floor dead of old age.
Now that I have working python on this computer; I want to expedite my skills and level-up to American Kenpo; the ability to use Python to populate my database columns within my tables by using a very lightweight python script.
I am not sure the best approach; however a book I bought says BS4 (BeautifulSoup4 is the answer); I will proceed with this notion and my current success here on the board so far with these obstacle threads.
Thank you again everyone and God Bless all Those who are True!
Best Regards,
Brandon Kastning
Posts: 89
Threads: 26
Joined: Mar 2020
Dec-27-2021, 11:05 PM
(This post was last modified: Dec-27-2021, 11:05 PM by BrandonKastning.
Edit Reason: forgot some important information
)
Sources - Blogs/Tutorials:
https://www.greycampus.com/blog/data-sci...ith-python
I am struggling with this tutorial and making it function for my table that I have selected for table scraping to MariaDB.
My Chosen URL:
https://en.wikipedia.org/wiki/List_of_counties_in_Washington
My broken code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get(base_site)
html = r.content
soup = BeautifulSoup(html, 'lxml' )
County_table = soup.find( 'table' , { 'class' : 'sortable wikitable' })
links = County_table.find( 'a' )
Counties = []
for links in links:
Counties.append(link.get( 'title' ))
print (Counties)
table = soup.find_all( 'table' )
|
Getting errors:
I am guessing I changed the wrong names when trying to use my table (which has the same name, class "sortable wikitable").
1 2 3 4 5 6 |
brandon@FireDragon:~ / Python / 03_American - Kenpo$ python3 wiki_greycampus.py
Traceback (most recent call last):
File "wiki_greycampus.py" , line 14 , in <module>
links = County_table.find( 'a' )
AttributeError: 'NoneType' object has no attribute 'find'
brandon@FireDragon:~ / Python / 03_American - Kenpo$
|
My goal with this wiki table is to pull the following Columns (and understand how to choose each column to scrape into a specific database and table column); Basically an easy to edit "wiki trigger finger" to populate a MariaDB table column or several. (Either 1 at a time or all at once).
The ones I am interested in:
- County
- FIPS code
- County seat
- Est
- Population (2020)
& Possibly "Map" picture (However I feel as if that's another learning curve).
Posts: 89
Threads: 26
Joined: Mar 2020
Continuing on the same course; with a different starting point:
Sources: Blogs/Tutorials:
https://adesquared.wordpress.com/2013/06...dia-table/
My edit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
from bs4 import BeautifulSoup
import urllib2
header = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request(wiki,headers = header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
county = ""
fips_code = ""
county_seat = ""
est = ""
formed_from = ""
etymology = ""
population = ""
land_area = ""
map = ""
table = soup.find( "table" , { "class" : "wikitable sortable" })
f = open ( 'output.csv' , 'w' )
for row in table.findAll( "tr" ):
cells = row.findAll( "td" )
if len (cells) = = 9 :
county = cells[ 0 ].find(text = True )
fips_code = cells[ 1 ].findAll(text = True )
county_seat = cells[ 2 ].findAll(text = True )
est = cells[ 3 ].findAll(text = True )
formed_from = cells[ 4 ].findAll(text = True )
etymology = cells[ 5 ].findAll(text = True )
population = cells[ 6 ].findAll(text = True )
land_area = cells[ 7 ].findAll(text = True )
map = cells[ 8 ].findAll(text = True )
for x in range ( len (county)):
county_list = county[x].split( "," )
for i in range ( len (county_list)):
if ( len (county_list[i]) > 2 ) and ( len (county_list[i]) < = 5 ):
write_to_file = area + "," + county_list[i].lstrip( '\n' ).strip() + "," + county_seat + "," + est + "," + formed_from + "," + etymology + $
print (write_to_file)
f.write(write_to_file)
f.close()
|
I am getting Python 3 errors (Which appear to be Python2 code problems within this nicely looking solution)
I am unsure how to resolve at this point... The code looks amazingly perfect. The ability to change columns based on [0] - [8] is brilliant and exactly what I was looking for.
Just not 100% boxset perfect yet.
Errors:
1 2 3 4 5 6 |
brandon@FireDragon:~ / Python / 03_American - Kenpo$ python3 wiki_adesquared_modified.py
Traceback (most recent call last):
File "wiki_adesquared_modified.py" , line 2 , in <module>
import urllib2
ImportError: No module named 'urllib2'
brandon@FireDragon:~ / Python / 03_American - Kenpo$
|
Posts: 89
Threads: 26
Joined: Mar 2020
New Direction: Using only Pandas to pull the entire Table in
Source: Blogs/Tutorials:
https://alanhylands.com/how-to-web-scrap...up-pandas/
Apparently pandas can suck the Wiki Table up easily and create a "Pandas Dataframe" I believe it's called.
This is what I learned so far:
1 2 3 4 5 6 |
import pandas as pd
tables = pd.read_html(url)
tables[ 1 ]
print (tables[ 1 ])
|
I do not know why only tables[1] works; because using the demo it provided used tables[3] and it worked for it. So I am not certain if it has to do with how many Wiki Tables are on the Wiki Page Article or what factors why tables[1] works for my Counties in Washington State (Free and Independent State in USA, Constitutional Republic of America) and not the tables[3] that worked for the demo Wiki Article Page.
With this said; this is great! Now, how do I break up the data and use it to populate a MariaDB database and eventually, quickly?
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
brandon@FireDragon:~ / Python / 03_American - Kenpo$ python3 wiki_justpandas2.py
County FIPS code County seat[ 11 ] ... Population ( 2020 ) Land area[ 11 ] Map
0 Adams County 1 Ritzville ... 20613 1 , 925 sq mi( 4 , 986 km2) NaN
1 Asotin County 3 Asotin ... 22285 636 sq mi( 1 , 647 km2) NaN
2 Benton County 5 Prosser ... 206873 1 , 700 sq mi( 4 , 403 km2) NaN
3 Chelan County 7 Wenatchee ... 79074 2 , 920 sq mi( 7 , 563 km2) NaN
4 Clallam County 9 Port Angeles ... 77155 1 , 738 sq mi( 4 , 501 km2) NaN
5 Clark County 11 Vancouver ... 503311 629 sq mi( 1 , 629 km2) NaN
6 Columbia County 13 Dayton ... 3952 869 sq mi( 2 , 251 km2) NaN
7 Cowlitz County 15 Kelso ... 110730 1 , 139 sq mi( 2 , 950 km2) NaN
8 Douglas County 17 Waterville ... 42938 1 , 819 sq mi( 4 , 711 km2) NaN
9 Ferry County 19 Republic ... 7178 2 , 204 sq mi( 5 , 708 km2) NaN
10 Franklin County 21 Pasco ... 96749 1 , 242 sq mi( 3 , 217 km2) NaN
11 Garfield County 23 Pomeroy ... 2286 710 sq mi( 1 , 839 km2) NaN
12 Grant County 25 Ephrata ... 99123 2 , 680 sq mi( 6 , 941 km2) NaN
13 Grays Harbor County 27 Montesano ... 75636 1 , 902 sq mi( 4 , 926 km2) NaN
14 Island County 29 Coupeville ... 86857 209 sq mi( 541 km2) NaN
15 Jefferson County 31 Port Townsend ... 32977 1 , 804 sq mi( 4 , 672 km2) NaN
16 King County 33 Seattle ... 2269675 2 , 115 sq mi( 5 , 478 km2) NaN
17 Kitsap County 35 Port Orchard ... 275611 395 sq mi( 1 , 023 km2) NaN
18 Kittitas County 37 Ellensburg ... 44337 2 , 297 sq mi( 5 , 949 km2) NaN
19 Klickitat County 39 Goldendale ... 22735 1 , 872 sq mi( 4 , 848 km2) NaN
20 Lewis County 41 Chehalis ... 82149 2 , 403 sq mi( 6 , 224 km2) NaN
21 Lincoln County 43 Davenport ... 10876 2 , 311 sq mi( 5 , 985 km2) NaN
22 Mason County 45 Shelton ... 65726 959 sq mi( 2 , 484 km2) NaN
23 Okanogan County 47 Okanogan ... 42104 5 , 268 sq mi( 13 , 644 km2) NaN
24 Pacific County 49 South Bend ... 23365 933 sq mi( 2 , 416 km2) NaN
25 Pend Oreille County 51 Newport ... 13401 1 , 400 sq mi( 3 , 626 km2) NaN
26 Pierce County 53 Tacoma ... 921130 1 , 670 sq mi( 4 , 325 km2) NaN
27 San Juan County 55 Friday Harbor ... 17788 174 sq mi( 451 km2) NaN
28 Skagit County 57 Mount Vernon ... 129523 1 , 731 sq mi( 4 , 483 km2) NaN
29 Skamania County 59 Stevenson ... 12036 1 , 656 sq mi( 4 , 289 km2) NaN
30 Snohomish County 61 Everett ... 827957 2 , 087 sq mi( 5 , 405 km2) NaN
31 Spokane County 63 Spokane ... 539339 1 , 764 sq mi( 4 , 569 km2) NaN
32 Stevens County 65 Colville ... 46445 2 , 478 sq mi( 6 , 418 km2) NaN
33 Thurston County 67 Olympia ... 294793 722 sq mi( 1 , 870 km2) NaN
34 Wahkiakum County 69 Cathlamet ... 4422 264 sq mi( 684 km2) NaN
35 Walla Walla County 71 Walla Walla ... 62584 1 , 270 sq mi( 3 , 289 km2) NaN
36 Whatcom County 73 Bellingham ... 226847 2 , 107 sq mi( 5 , 457 km2) NaN
37 Whitman County 75 Colfax ... 47973 2 , 159 sq mi( 5 , 592 km2) NaN
38 Yakima County 77 Yakima ... 256728 4 , 296 sq mi( 11 , 127 km2) NaN
39 Washington (state) 53 Olympia (state capital) ... 7705281 66 , 544 sq mi ( 172 , 350 km2) NaN
[ 40 rows x 9 columns]
brandon@FireDragon:~ / Python / 03_American - Kenpo$ nano - c wiki_justpandas2.py
|
How to:
a) Use this existing "Pandas Dataframe" on my Target Wikipedia Article Table "Washington State Counties" and save it to CSV
b) Copy all 39 rows for each column and write them to MySQL/MariaDB
Thanks for this forum everyone!
Posts: 89
Threads: 26
Joined: Mar 2020
Dec-28-2021, 12:59 AM
(This post was last modified: Dec-28-2021, 01:00 AM by BrandonKastning.)
Solution - Part-A: Copy Wiki Articles to Pandas Dataframe and then pass it to write to CSV file!
Source Credits: (Based my code and understanding on):
https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd
tables = pd.read_html(url)
tables[ 1 ]
print (tables[ 1 ])
df = tables[ 1 ]
df.to_csv( 'wiki_pandas_df_output_csv.csv' , sep = '\t' , encoding = 'utf-8' , index = False )
|
This has successfully written: "wiki_pandas_df_output_csv.csv" which opens in LibreCalc in Debian Linux no problem! I also now see that row 1 has the column names. This is useful and I would also like to no how to df.to_csv and redact the column names completely while creating the csv.
This is good progress! Pandas is quite nice.
I just found out that when opening the .csv file in LibreOffice Calc the pop-up dialog box allows you open it starting with a specific row number. I changed it to 2 and it trimmed the column names off. I suppose a long way would be saving the file as after opening it starting on row 2 if you want to quickly create a list of several columns for MySQL transfer (my end goal).
Posts: 89
Threads: 26
Joined: Mar 2020
Working on moving the wiki tables from csv I have completed broken code. I am generating errors regarding package installation. I am not quite sure how to fix this.
Source: Blogs/Tutorials:
https://stackoverflow.com/questions/10154633/load-csv-data-into-mysql-in-python
This is what I have come up with so far:
Created a database called: "Eggnest" and a Table called "csv_wiki_counties_States_WA":
This is the CSV open in LibreOffice Calc trimming row 1 by opening starting with row 2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import pandas as pd
tables = pd.read_html(url)
tables[ 1 ]
print (tables[ 1 ])
df = tables[ 1 ]
df.to_csv( 'wiki_pandas_df_output_csv_to_sql2.csv' , sep = '\t' , encoding = 'utf-8' , index = False )
import csv
import MySQLdb
mydb = MySQLdb.connect(host = 'localhost' ,
user = "brandon" ,
passwd = "password" ,
db = "Eggnest"
)
mycursor = mydb.cursor()
dataframe = csv.reader( open ( 'wiki_pandas_df_output_csv_to_sql2.csv' ))
print (dataframe)
for row in dataframe:
mycursor.execute( 'INSERT INTO csv_wiki_counties_States_WA (american_county_name, american_county_fips_code, american_county_est, american_county_formed_from, american_county_etymology, american_county_population_census_2020, american_county_land_area") VALUES("%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' , row)
mydb.commit()
cursor.close()
|
I am getting the following output with zero successful INSERT's...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
brandon@FireDragon:~ / Python / 03_American - Kenpo$ python3 wiki_pandas_df_csv_2_sql2.py
County FIPS code County seat[ 11 ] ... Population ( 2020 ) Land area[ 11 ] Map
0 Adams County 1 Ritzville ... 20613 1 , 925 sq mi( 4 , 986 km2) NaN
1 Asotin County 3 Asotin ... 22285 636 sq mi( 1 , 647 km2) NaN
2 Benton County 5 Prosser ... 206873 1 , 700 sq mi( 4 , 403 km2) NaN
3 Chelan County 7 Wenatchee ... 79074 2 , 920 sq mi( 7 , 563 km2) NaN
4 Clallam County 9 Port Angeles ... 77155 1 , 738 sq mi( 4 , 501 km2) NaN
5 Clark County 11 Vancouver ... 503311 629 sq mi( 1 , 629 km2) NaN
6 Columbia County 13 Dayton ... 3952 869 sq mi( 2 , 251 km2) NaN
7 Cowlitz County 15 Kelso ... 110730 1 , 139 sq mi( 2 , 950 km2) NaN
8 Douglas County 17 Waterville ... 42938 1 , 819 sq mi( 4 , 711 km2) NaN
9 Ferry County 19 Republic ... 7178 2 , 204 sq mi( 5 , 708 km2) NaN
10 Franklin County 21 Pasco ... 96749 1 , 242 sq mi( 3 , 217 km2) NaN
11 Garfield County 23 Pomeroy ... 2286 710 sq mi( 1 , 839 km2) NaN
12 Grant County 25 Ephrata ... 99123 2 , 680 sq mi( 6 , 941 km2) NaN
13 Grays Harbor County 27 Montesano ... 75636 1 , 902 sq mi( 4 , 926 km2) NaN
14 Island County 29 Coupeville ... 86857 209 sq mi( 541 km2) NaN
15 Jefferson County 31 Port Townsend ... 32977 1 , 804 sq mi( 4 , 672 km2) NaN
16 King County 33 Seattle ... 2269675 2 , 115 sq mi( 5 , 478 km2) NaN
17 Kitsap County 35 Port Orchard ... 275611 395 sq mi( 1 , 023 km2) NaN
18 Kittitas County 37 Ellensburg ... 44337 2 , 297 sq mi( 5 , 949 km2) NaN
19 Klickitat County 39 Goldendale ... 22735 1 , 872 sq mi( 4 , 848 km2) NaN
20 Lewis County 41 Chehalis ... 82149 2 , 403 sq mi( 6 , 224 km2) NaN
21 Lincoln County 43 Davenport ... 10876 2 , 311 sq mi( 5 , 985 km2) NaN
22 Mason County 45 Shelton ... 65726 959 sq mi( 2 , 484 km2) NaN
23 Okanogan County 47 Okanogan ... 42104 5 , 268 sq mi( 13 , 644 km2) NaN
24 Pacific County 49 South Bend ... 23365 933 sq mi( 2 , 416 km2) NaN
25 Pend Oreille County 51 Newport ... 13401 1 , 400 sq mi( 3 , 626 km2) NaN
26 Pierce County 53 Tacoma ... 921130 1 , 670 sq mi( 4 , 325 km2) NaN
27 San Juan County 55 Friday Harbor ... 17788 174 sq mi( 451 km2) NaN
28 Skagit County 57 Mount Vernon ... 129523 1 , 731 sq mi( 4 , 483 km2) NaN
29 Skamania County 59 Stevenson ... 12036 1 , 656 sq mi( 4 , 289 km2) NaN
30 Snohomish County 61 Everett ... 827957 2 , 087 sq mi( 5 , 405 km2) NaN
31 Spokane County 63 Spokane ... 539339 1 , 764 sq mi( 4 , 569 km2) NaN
32 Stevens County 65 Colville ... 46445 2 , 478 sq mi( 6 , 418 km2) NaN
33 Thurston County 67 Olympia ... 294793 722 sq mi( 1 , 870 km2) NaN
34 Wahkiakum County 69 Cathlamet ... 4422 264 sq mi( 684 km2) NaN
35 Walla Walla County 71 Walla Walla ... 62584 1 , 270 sq mi( 3 , 289 km2) NaN
36 Whatcom County 73 Bellingham ... 226847 2 , 107 sq mi( 5 , 457 km2) NaN
37 Whitman County 75 Colfax ... 47973 2 , 159 sq mi( 5 , 592 km2) NaN
38 Yakima County 77 Yakima ... 256728 4 , 296 sq mi( 11 , 127 km2) NaN
39 Washington (state) 53 Olympia (state capital) ... 7705281 66 , 544 sq mi ( 172 , 350 km2) NaN
[ 40 rows x 9 columns]
<_csv.reader object at 0x7fd3f8c396d8 >
Traceback (most recent call last):
File "wiki_pandas_df_csv_2_sql2.py" , line 25 , in <module>
mycursor.execute( 'INSERT INTO csv_wiki_counties_States_WA (american_county_name, american_county_fips_code, american_county_est, american_county_formed_from, american_county_etymology, american_county_population_census_2020, american_county_land_area") VALUES("%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' , row)
File "/usr/lib/python3/dist-packages/MySQLdb/cursors.py" , line 210 , in execute
query = query % args
TypeError: not enough arguments for format string
brandon@FireDragon:~ / Python / 03_American - Kenpo$
|
If anyone has an idea why this doesn't work! I would really like to know why! Thank you! :)
Posts: 89
Threads: 26
Joined: Mar 2020
Dec-29-2021, 12:38 AM
(This post was last modified: Dec-29-2021, 12:38 AM by BrandonKastning.
Edit Reason: formatting & solution update
)
Approach using Pandas to read a CSV into a dataframe and then INSERT it into MariaDB
Source - Blogs/Tutorials:
https://stackoverflow.com/questions/3814...sqlalchemy
Code Modified:
1 2 3 4 5 6 7 8 9 |
import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
myd = pd.read_csv( 'wiki_wash_counties_with_headers.csv' )
myd.to_sql(name = 'csv1' , con = engine, if_exists = 'replace' , index = False )
|
Error Output:
1 2 3 4 5 6 7 8 9 10 |
brandon@FireDragon:~ / Python / 03_American - Kenpo$ python3 pandascsv2sql1.py
Traceback (most recent call last):
File "pandascsv2sql1.py" , line 2 , in <module>
import mysql.connector
File "/home/brandon/.local/lib/python3.5/site-packages/mysql/connector/__init__.py" , line 54 , in <module>
from .connection import MySQLConnection
File "/home/brandon/.local/lib/python3.5/site-packages/mysql/connector/connection.py" , line 451
f "This connection is using {tls_version} which is now "
^
SyntaxError: invalid syntax
|
Another great looking solution with broken code! :)
SOLVED:
The error is resolved using the following code as the username executing the python script (installing as root still requires to install as lower user):
1 |
pip3 install mysql - connector - python - rf
|
Source for Fix:
https://stackoverflow.com/questions/3287...amed-mysql
New Execute & Output:
1 2 |
brandon@FireDragon:~ / Python / 03_American - Kenpo$ python3 pandascsv2sql1.py
brandon@FireDragon:~ / Python / 03_American - Kenpo$
|
It's a little messy with extra columns (Not sure how I managed that); but this code works!
Note: The header columns (names of each column on row 1) I edited to match the existing table I had created. However when using this code; it creates a column with the new table for each column within the CSV file.
Also the formatting for the cells (columns) in LibreOffice Calc for the CSV didn't hold the formatting for american_county_fips column (which contains numbers); however they are supposed to be 001 for 1 and 010 for 10. Even when adding leading zeros and saved the CSV before running this code; MariaDB is showing them as different types of values for numerics. I haven't figured this out as I just got this far!
Thank you again everyone for this forum!
Best Regards,
Brandon Kastning
"In The Trenches, Taking Grenades!"
|