Python Obstacles | American Kenpo | Wiki Scrape URL/Table and Store it in MariaDB

BrandonKastning · Dec-27-2021, 07:20 AM

Python Obstacles | American Kenpo | Wiki Scrape URL/Table and Store it in MariaDB

I have several unresolved / completed threads here on this great forum and don't want to make a habit of this. However; what I am working on requires me to continue to change obstacle while achieving level-ups for data aggregation & delivery to the human for free.

I am manually creating tables within a database currently that is taking enormous amounts of time and energy to complete. My source is wikipedia for all entries. This leaves me to pause and decide it's not the best way to re-invent the wheel and type each and every entry or copy and paste over and over until I am blue in the face and on the floor dead of old age.

Now that I have working python on this computer; I want to expedite my skills and level-up to American Kenpo; the ability to use Python to populate my database columns within my tables by using a very lightweight python script.

I am not sure the best approach; however a book I bought says BS4 (BeautifulSoup4 is the answer); I will proceed with this notion and my current success here on the board so far with these obstacle threads.

Thank you again everyone and God Bless all Those who are True!

Best Regards,

Brandon Kastning

BrandonKastning

Sources - Blogs/Tutorials:

https://www.greycampus.com/blog/data-sci...ith-python

I am struggling with this tutorial and making it function for my table that I have selected for table scraping to MariaDB.

My Chosen URL:

https://en.wikipedia.org/wiki/List_of_counties_in_Washington

My broken code:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
base_site = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
r = requests.get(base_site)
 
html = r.content
 
soup = BeautifulSoup(html, 'lxml')
 
County_table = soup.find('table', {'class':'sortable wikitable'})
 
links = County_table.find('a')
 
Counties = []
for links in links:
  Counties.append(link.get('title'))
 
print(Counties)
 
table = soup.find_all('table')

Getting errors:

I am guessing I changed the wrong names when trying to use my table (which has the same name, class "sortable wikitable").

        
              brandon@FireDragon:~/Python/03_American-Kenpo$ python3 wiki_greycampus.py
Traceback (most recent call last):
  File "wiki_greycampus.py", line 14, in <module>
    links = County_table.find('a')
AttributeError: 'NoneType' object has no attribute 'find'
brandon@FireDragon:~/Python/03_American-Kenpo$

My goal with this wiki table is to pull the following Columns (and understand how to choose each column to scrape into a specific database and table column); Basically an easy to edit "wiki trigger finger" to populate a MariaDB table column or several. (Either 1 at a time or all at once).

The ones I am interested in:

- County
- FIPS code
- County seat
- Est
- Population (2020)

& Possibly "Map" picture (However I feel as if that's another learning curve).

BrandonKastning · Dec-27-2021, 11:53 PM

Continuing on the same course; with a different starting point:

Sources: Blogs/Tutorials:

https://adesquared.wordpress.com/2013/06...dia-table/

My edit:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

from bs4 import BeautifulSoup
import urllib2
  
#wiki = "http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"
wiki = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
  
#area = ""
county = ""
#district = ""
fips_code = ""
#town = ""
county_seat = ""
#county = ""
est = ""
formed_from = ""
etymology = ""
population = ""
land_area = ""
map = ""
 
table = soup.find("table", { "class" : "wikitable sortable" })
  
f = open('output.csv', 'w')
  
for row in table.findAll("tr"):
    cells = row.findAll("td")
    #For each "tr", assign each "td" to a variable.
    #if len(cells) == 4:
    #    area = cells[0].find(text=True)
    #    district = cells[1].findAll(text=True)
    #    town = cells[2].find(text=True)
    #    county = cells[3].find(text=True)
 
    if len(cells) == 9:
        county = cells[0].find(text=True)
        fips_code = cells[1].findAll(text=True)
        county_seat = cells[2].findAll(text=True)
        est = cells[3].findAll(text=True)
        formed_from = cells[4].findAll(text=True)
        etymology = cells[5].findAll(text=True)
        population = cells[6].findAll(text=True)
        land_area = cells[7].findAll(text=True)
        map = cells[8].findAll(text=True)
  
    #district(county) can be a list of lists, so we want to iterate through the top level lists first...
    for x in range(len(county)):
        #For each list, split the string
        county_list = county[x].split(",")
        #For each item in the split list...
        for i in range(len(county_list)):
            #Check it's a postcode(county) and not other text
            if (len(county_list[i]) > 2) and (len(county_list[i]) <= 5):
                #Strip out the "\n" that seems to be at the start of some postcodes(counties)
                write_to_file = area + "," + county_list[i].lstrip('\n').strip() + "," + county_seat + "," + est + "," + formed_from + "," + etymology +$
                print(write_to_file)
                f.write(write_to_file)
  
f.close()

I am getting Python 3 errors (Which appear to be Python2 code problems within this nicely looking solution)

I am unsure how to resolve at this point... The code looks amazingly perfect. The ability to change columns based on [0] - [8] is brilliant and exactly what I was looking for.

Just not 100% boxset perfect yet.

Errors:

        
              brandon@FireDragon:~/Python/03_American-Kenpo$ python3 wiki_adesquared_modified.py
Traceback (most recent call last):
  File "wiki_adesquared_modified.py", line 2, in <module>
    import urllib2
ImportError: No module named 'urllib2'
brandon@FireDragon:~/Python/03_American-Kenpo$

BrandonKastning · Dec-28-2021, 12:33 AM

New Direction: Using only Pandas to pull the entire Table in

Source: Blogs/Tutorials:

https://alanhylands.com/how-to-web-scrap...up-pandas/

Apparently pandas can suck the Wiki Table up easily and create a "Pandas Dataframe" I believe it's called.

This is what I learned so far:

1

2

3

4

5

6

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
tables = pd.read_html(url) 
tables[1]
 
print(tables[1])

I do not know why only tables[1] works; because using the demo it provided used tables[3] and it worked for it. So I am not certain if it has to do with how many Wiki Tables are on the Wiki Page Article or what factors why tables[1] works for my Counties in Washington State (Free and Independent State in USA, Constitutional Republic of America) and not the tables[3] that worked for the demo Wiki Article Page.

With this said; this is great! Now, how do I break up the data and use it to populate a MariaDB database and eventually, quickly?

Output:

        
          
          
              
              brandon@FireDragon:~/Python/03_American-Kenpo$ python3 wiki_justpandas2.py
                 County  FIPS code          County seat[11]  ...  Population (2020)               Land area[11] Map
0          Adams County          1                Ritzville  ...              20613      1,925 sq mi(4,986 km2) NaN
1         Asotin County          3                   Asotin  ...              22285        636 sq mi(1,647 km2) NaN
2         Benton County          5                  Prosser  ...             206873      1,700 sq mi(4,403 km2) NaN
3         Chelan County          7                Wenatchee  ...              79074      2,920 sq mi(7,563 km2) NaN
4        Clallam County          9             Port Angeles  ...              77155      1,738 sq mi(4,501 km2) NaN
5          Clark County         11                Vancouver  ...             503311        629 sq mi(1,629 km2) NaN
6       Columbia County         13                   Dayton  ...               3952        869 sq mi(2,251 km2) NaN
7        Cowlitz County         15                    Kelso  ...             110730      1,139 sq mi(2,950 km2) NaN
8        Douglas County         17               Waterville  ...              42938      1,819 sq mi(4,711 km2) NaN
9          Ferry County         19                 Republic  ...               7178      2,204 sq mi(5,708 km2) NaN
10      Franklin County         21                    Pasco  ...              96749      1,242 sq mi(3,217 km2) NaN
11      Garfield County         23                  Pomeroy  ...               2286        710 sq mi(1,839 km2) NaN
12         Grant County         25                  Ephrata  ...              99123      2,680 sq mi(6,941 km2) NaN
13  Grays Harbor County         27                Montesano  ...              75636      1,902 sq mi(4,926 km2) NaN
14        Island County         29               Coupeville  ...              86857          209 sq mi(541 km2) NaN
15     Jefferson County         31            Port Townsend  ...              32977      1,804 sq mi(4,672 km2) NaN
16          King County         33                  Seattle  ...            2269675      2,115 sq mi(5,478 km2) NaN
17        Kitsap County         35             Port Orchard  ...             275611        395 sq mi(1,023 km2) NaN
18      Kittitas County         37               Ellensburg  ...              44337      2,297 sq mi(5,949 km2) NaN
19     Klickitat County         39               Goldendale  ...              22735      1,872 sq mi(4,848 km2) NaN
20         Lewis County         41                 Chehalis  ...              82149      2,403 sq mi(6,224 km2) NaN
21       Lincoln County         43                Davenport  ...              10876      2,311 sq mi(5,985 km2) NaN
22         Mason County         45                  Shelton  ...              65726        959 sq mi(2,484 km2) NaN
23      Okanogan County         47                 Okanogan  ...              42104     5,268 sq mi(13,644 km2) NaN
24       Pacific County         49               South Bend  ...              23365        933 sq mi(2,416 km2) NaN
25  Pend Oreille County         51                  Newport  ...              13401      1,400 sq mi(3,626 km2) NaN
26        Pierce County         53                   Tacoma  ...             921130      1,670 sq mi(4,325 km2) NaN
27      San Juan County         55            Friday Harbor  ...              17788          174 sq mi(451 km2) NaN
28        Skagit County         57             Mount Vernon  ...             129523      1,731 sq mi(4,483 km2) NaN
29      Skamania County         59                Stevenson  ...              12036      1,656 sq mi(4,289 km2) NaN
30     Snohomish County         61                  Everett  ...             827957      2,087 sq mi(5,405 km2) NaN
31       Spokane County         63                  Spokane  ...             539339      1,764 sq mi(4,569 km2) NaN
32       Stevens County         65                 Colville  ...              46445      2,478 sq mi(6,418 km2) NaN
33      Thurston County         67                  Olympia  ...             294793        722 sq mi(1,870 km2) NaN
34     Wahkiakum County         69                Cathlamet  ...               4422          264 sq mi(684 km2) NaN
35   Walla Walla County         71              Walla Walla  ...              62584      1,270 sq mi(3,289 km2) NaN
36       Whatcom County         73               Bellingham  ...             226847      2,107 sq mi(5,457 km2) NaN
37       Whitman County         75                   Colfax  ...              47973      2,159 sq mi(5,592 km2) NaN
38        Yakima County         77                   Yakima  ...             256728     4,296 sq mi(11,127 km2) NaN
39   Washington (state)         53  Olympia (state capital)  ...            7705281  66,544 sq mi (172,350 km2) NaN
 
[40 rows x 9 columns]
brandon@FireDragon:~/Python/03_American-Kenpo$ nano -c wiki_justpandas2.py

            

        
      

How to:

a) Use this existing "Pandas Dataframe" on my Target Wikipedia Article Table "Washington State Counties" and save it to CSV

b) Copy all 39 rows for each column and write them to MySQL/MariaDB

Thanks for this forum everyone!

BrandonKastning · (This post was last modified: Dec-28-2021, 01:00 AM by BrandonKastning.)

Solution - Part-A: Copy Wiki Articles to Pandas Dataframe and then pass it to write to CSV file!

Source Credits: (Based my code and understanding on):

https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/

1

2

3

4

5

6

7

8

9

10

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
tables = pd.read_html(url) 
tables[1]
 
print(tables[1])
 
df = tables[1]
 
df.to_csv('wiki_pandas_df_output_csv.csv', sep='\t', encoding='utf-8', index=False)

This has successfully written: "wiki_pandas_df_output_csv.csv" which opens in LibreCalc in Debian Linux no problem! I also now see that row 1 has the column names. This is useful and I would also like to no how to df.to_csv and redact the column names completely while creating the csv.

This is good progress! Pandas is quite nice.

I just found out that when opening the .csv file in LibreOffice Calc the pop-up dialog box allows you open it starting with a specific row number. I changed it to 2 and it trimmed the column names off. I suppose a long way would be saving the file as after opening it starting on row 2 if you want to quickly create a list of several columns for MySQL transfer (my end goal).

[Image: wiki-wa-counties-csv-2021-12-27-16-56-28.png]

BrandonKastning · Dec-28-2021, 05:16 AM

Working on moving the wiki tables from csv I have completed broken code. I am generating errors regarding package installation. I am not quite sure how to fix this.

Source: Blogs/Tutorials:

https://stackoverflow.com/questions/10154633/load-csv-data-into-mysql-in-python

This is what I have come up with so far:

Created a database called: "Eggnest" and a Table called "csv_wiki_counties_States_WA":

[Image: csv2sql2-2021-12-27-21-07-34.png]

This is the CSV open in LibreOffice Calc trimming row 1 by opening starting with row 2:

[Image: csv2sql2-csv-open-2021-12-27-21-09-00.png]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
tables = pd.read_html(url) 
tables[1]
 
print(tables[1])
 
df = tables[1]
 
df.to_csv('wiki_pandas_df_output_csv_to_sql2.csv', sep='\t', encoding='utf-8', index=False)
 
import csv
import MySQLdb
mydb = MySQLdb.connect(host='localhost',
  user="brandon",
  passwd="password",
  db="Eggnest"
)
mycursor = mydb.cursor()
 
dataframe = csv.reader(open('wiki_pandas_df_output_csv_to_sql2.csv'))
print(dataframe)
 
for row in dataframe:
     mycursor.execute('INSERT INTO csv_wiki_counties_States_WA (american_county_name, american_county_fips_code, american_county_est, american_county_formed_from, american_county_etymology, american_county_population_census_2020, american_county_land_area") VALUES("%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")', row)
mydb.commit()
cursor.close()

I am getting the following output with zero successful INSERT's...

        
          
          
              
              brandon@FireDragon:~/Python/03_American-Kenpo$ python3 wiki_pandas_df_csv_2_sql2.py
                 County  FIPS code          County seat[11]  ...  Population (2020)               Land area[11] Map
0          Adams County          1                Ritzville  ...              20613      1,925 sq mi(4,986 km2) NaN
1         Asotin County          3                   Asotin  ...              22285        636 sq mi(1,647 km2) NaN
2         Benton County          5                  Prosser  ...             206873      1,700 sq mi(4,403 km2) NaN
3         Chelan County          7                Wenatchee  ...              79074      2,920 sq mi(7,563 km2) NaN
4        Clallam County          9             Port Angeles  ...              77155      1,738 sq mi(4,501 km2) NaN
5          Clark County         11                Vancouver  ...             503311        629 sq mi(1,629 km2) NaN
6       Columbia County         13                   Dayton  ...               3952        869 sq mi(2,251 km2) NaN
7        Cowlitz County         15                    Kelso  ...             110730      1,139 sq mi(2,950 km2) NaN
8        Douglas County         17               Waterville  ...              42938      1,819 sq mi(4,711 km2) NaN
9          Ferry County         19                 Republic  ...               7178      2,204 sq mi(5,708 km2) NaN
10      Franklin County         21                    Pasco  ...              96749      1,242 sq mi(3,217 km2) NaN
11      Garfield County         23                  Pomeroy  ...               2286        710 sq mi(1,839 km2) NaN
12         Grant County         25                  Ephrata  ...              99123      2,680 sq mi(6,941 km2) NaN
13  Grays Harbor County         27                Montesano  ...              75636      1,902 sq mi(4,926 km2) NaN
14        Island County         29               Coupeville  ...              86857          209 sq mi(541 km2) NaN
15     Jefferson County         31            Port Townsend  ...              32977      1,804 sq mi(4,672 km2) NaN
16          King County         33                  Seattle  ...            2269675      2,115 sq mi(5,478 km2) NaN
17        Kitsap County         35             Port Orchard  ...             275611        395 sq mi(1,023 km2) NaN
18      Kittitas County         37               Ellensburg  ...              44337      2,297 sq mi(5,949 km2) NaN
19     Klickitat County         39               Goldendale  ...              22735      1,872 sq mi(4,848 km2) NaN
20         Lewis County         41                 Chehalis  ...              82149      2,403 sq mi(6,224 km2) NaN
21       Lincoln County         43                Davenport  ...              10876      2,311 sq mi(5,985 km2) NaN
22         Mason County         45                  Shelton  ...              65726        959 sq mi(2,484 km2) NaN
23      Okanogan County         47                 Okanogan  ...              42104     5,268 sq mi(13,644 km2) NaN
24       Pacific County         49               South Bend  ...              23365        933 sq mi(2,416 km2) NaN
25  Pend Oreille County         51                  Newport  ...              13401      1,400 sq mi(3,626 km2) NaN
26        Pierce County         53                   Tacoma  ...             921130      1,670 sq mi(4,325 km2) NaN
27      San Juan County         55            Friday Harbor  ...              17788          174 sq mi(451 km2) NaN
28        Skagit County         57             Mount Vernon  ...             129523      1,731 sq mi(4,483 km2) NaN
29      Skamania County         59                Stevenson  ...              12036      1,656 sq mi(4,289 km2) NaN
30     Snohomish County         61                  Everett  ...             827957      2,087 sq mi(5,405 km2) NaN
31       Spokane County         63                  Spokane  ...             539339      1,764 sq mi(4,569 km2) NaN
32       Stevens County         65                 Colville  ...              46445      2,478 sq mi(6,418 km2) NaN
33      Thurston County         67                  Olympia  ...             294793        722 sq mi(1,870 km2) NaN
34     Wahkiakum County         69                Cathlamet  ...               4422          264 sq mi(684 km2) NaN
35   Walla Walla County         71              Walla Walla  ...              62584      1,270 sq mi(3,289 km2) NaN
36       Whatcom County         73               Bellingham  ...             226847      2,107 sq mi(5,457 km2) NaN
37       Whitman County         75                   Colfax  ...              47973      2,159 sq mi(5,592 km2) NaN
38        Yakima County         77                   Yakima  ...             256728     4,296 sq mi(11,127 km2) NaN
39   Washington (state)         53  Olympia (state capital)  ...            7705281  66,544 sq mi (172,350 km2) NaN
 
[40 rows x 9 columns]
<_csv.reader object at 0x7fd3f8c396d8>
Traceback (most recent call last):
  File "wiki_pandas_df_csv_2_sql2.py", line 25, in <module>
    mycursor.execute('INSERT INTO csv_wiki_counties_States_WA (american_county_name, american_county_fips_code, american_county_est, american_county_formed_from, american_county_etymology, american_county_population_census_2020, american_county_land_area") VALUES("%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")', row)
  File "/usr/lib/python3/dist-packages/MySQLdb/cursors.py", line 210, in execute
    query = query % args
TypeError: not enough arguments for format string
brandon@FireDragon:~/Python/03_American-Kenpo$

            

        
      

If anyone has an idea why this doesn't work! I would really like to know why! Thank you! :)

BrandonKastning

Approach using Pandas to read a CSV into a dataframe and then INSERT it into MariaDB

Source - Blogs/Tutorials:

https://stackoverflow.com/questions/3814...sqlalchemy

Code Modified:

1

2

3

4

5

6

7

8

9

import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
 
myd = pd.read_csv('wiki_wash_counties_with_headers.csv')
 
engine = create_engine('mysql+mysqlconnector://brandon:password@127.0.0.1/Eggnest')
 
myd.to_sql(name='csv1', con=engine, if_exists='replace', index=False)

Error Output:

        
              brandon@FireDragon:~/Python/03_American-Kenpo$ python3 pandascsv2sql1.py
Traceback (most recent call last):
  File "pandascsv2sql1.py", line 2, in <module>
    import mysql.connector
  File "/home/brandon/.local/lib/python3.5/site-packages/mysql/connector/__init__.py", line 54, in <module>
    from .connection import MySQLConnection
  File "/home/brandon/.local/lib/python3.5/site-packages/mysql/connector/connection.py", line 451
    f"This connection is using {tls_version} which is now "
                                                          ^
SyntaxError: invalid syntax

Another great looking solution with broken code! :)

SOLVED:

The error is resolved using the following code as the username executing the python script (installing as root still requires to install as lower user):

        
              pip3 install mysql-connector-python-rf

Source for Fix:

https://stackoverflow.com/questions/3287...amed-mysql

New Execute & Output:

        
              brandon@FireDragon:~/Python/03_American-Kenpo$ python3 pandascsv2sql1.py
brandon@FireDragon:~/Python/03_American-Kenpo$

[Image: csvpandas1-2021-12-28-16-30-50.png]

[Image: csvpandas2-2021-12-28-16-31-21.png]

[Image: csvpandas3-2021-12-28-16-31-38.png]

It's a little messy with extra columns (Not sure how I managed that); but this code works!

Note: The header columns (names of each column on row 1) I edited to match the existing table I had created. However when using this code; it creates a column with the new table for each column within the CSV file.

Also the formatting for the cells (columns) in LibreOffice Calc for the CSV didn't hold the formatting for american_county_fips column (which contains numbers); however they are supposed to be 001 for 1 and 010 for 10. Even when adding leading zeros and saved the CSV before running this code; MariaDB is showing them as different types of values for numerics. I haven't figured this out as I just got this far!

Thank you again everyone for this forum!

Best Regards,

Brandon Kastning
"In The Trenches, Taking Grenades!"

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scrape table from multiple pages	Nhattanktnn	1	1,753	Jun-07-2023, 09:35 AM Last Post: Larz60+
	How can I web scrape the "alt" attribute from a "img" tag with Python?	cisky	1	6,347	Aug-19-2022, 04:59 AM Last Post: snippsat
	Python Obstacles \| Jeet-Kune-Do \| BS4 (Tags > MariaDB) [URL/Local HTML]	BrandonKastning	0	1,881	Feb-08-2022, 08:55 PM Last Post: BrandonKastning
	Securing State Constitutions (USA) from University of Maryland > MariaDB .sql	BrandonKastning	1	2,154	Jan-21-2022, 06:34 PM Last Post: BrandonKastning
	Debian 10 Buster Environment - Python 3.x (MariaDB 10.4.21) \| Working Connector?	BrandonKastning	9	6,201	Jan-04-2022, 08:27 PM Last Post: BrandonKastning
	Python Obstacles \| Krav Maga \| Wiki Scraped Content [Column Copy]	BrandonKastning	4	3,283	Jan-03-2022, 06:59 AM Last Post: BrandonKastning
	Python Obstacles \| Kapap \| Wiki Scraped Content [Column Nulling]	BrandonKastning	2	2,563	Jan-03-2022, 04:26 AM Last Post: BrandonKastning
	Python Obstacles \| Kung-Fu \| Full File HTML Document Scrape and Store it in MariaDB	BrandonKastning	5	4,116	Dec-29-2021, 02:26 AM Last Post: BrandonKastning
	Python Obstacles \| Karate \| HTML/Scrape Specific Tag and Store it in MariaDB	BrandonKastning	8	4,591	Nov-22-2021, 01:38 AM Last Post: BrandonKastning
	to scrape wiki-page: getting back the results - can i use pandas also	apollo	2	3,430	Feb-09-2021, 03:57 PM Last Post: apollo

Python Obstacles | American Kenpo | Wiki Scrape URL/Table and Store it in MariaDB

User Panel Messages

Announcements