Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns)

BrandonKastning · Jan-09-2022, 05:47 AM

Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns)

My code is the following:

1

2

3

4

5

6

7

8

9

10

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama"
tables = pd.read_html(url) 
tables[1]
  
print(tables[1])
  
df = tables[1]
  
df.to_csv('AL_Counties.csv', sep='\t', encoding='utf-8', index=False)

Using Pandas,

Once the file is written; which works great. Any entries in any column; for instance as shown in the following screen shots in Libre Calc is writing the names into multiple columns. This makes my obstacles difficult. How do I remedy this?

Screenshot #1:

Screenshot #2:

Thank you everyone for this forum!

***snippsat*** · Jan-09-2022, 01:54 PM

(Jan-09-2022, 05:47 AM)BrandonKastning Wrote: for instance as shown in the following screen shots in Libre Calc

Don't take the detour trough df.to_csv.
Pandas can save to .odf format,need odfpy and openpyxl installed.
Then you get same look as DataFrame show.

        
              pip install odfpy openpyxl
# Or if use Anaconda
conda install odfpy openpyxl

Can also delete Map column.

1

2

3

4

5

6

7

8

9

10

import pandas as pd
 
url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama"
tables = pd.read_html(url)  
df = tables[1]
df = df.drop('Map', axis=1)
 
df.to_excel("wiki.ods", index=False, engine="odf") 
# To Excel would be
df.to_excel('wiki.xlsx', index=False)

BrandonKastning · Jan-09-2022, 09:35 PM

snippsat,

Thank you! This worked wonderfully! Much better formatting!

Much obliged! :)

Best Regards,

Brandon Kastning

(Jan-09-2022, 01:54 PM)snippsat Wrote:
(Jan-09-2022, 05:47 AM)BrandonKastning Wrote: for instance as shown in the following screen shots in Libre Calc
Don't take the detour trough df.to_csv.
Pandas can save to .odf format,need odfpy and openpyxl installed.
Then you get same look as DataFrame show.

1
2
3

pip install odfpy openpyxl
# Or if use Anaconda
conda install odfpy openpyxl

Can also delete Map column.

1
2
3
4
5
6
7
8
9
10

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama"
tables = pd.read_html(url)
df = tables[1]
df = df.drop('Map', axis=1)

df.to_excel("wiki.ods", index=False, engine="odf")
# To Excel would be
df.to_excel('wiki.xlsx', index=False)

BrandonKastning · Jan-09-2022, 09:52 PM

Should I open a new thread?

How to download the map images and store them (Either DB or Local) ?

***snippsat*** · Jan-10-2022, 06:52 PM

(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: Should I open a new thread?

It's part of same task,so no problem.

(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: How to download the map images and store them (Either DB or Local) ?

You have to give it try,now need to use more common scraping tool.
Here a demo how to start.

1

2

3

4

5

6

7

8

9

10

11

12

13

import requests
from bs4 import BeautifulSoup
 
url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('h1').text)
 
# First image
img = soup.find_all('a', class_="image")
img_link= img[0].find('img').get('src')
img_link = img_link.replace('//', 'http://')
print(img_link)

Output:List of counties in Alabama
http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Map_of_Alabama_highlighting_Autauga_County.svg/75px-Map_of_Alabama_highlighting_Autauga_County.svg.png

BrandonKastning

snippsat,

Thank you for this! This is a great point for me to start regarding images and I believe since you wrote the comment "First Image" that I will need to learn loops. I will be coming back to this.

In the meantime, I ran into some troubles with wikipedia panda table scraping. I changed from Counties to "Municipalities" and regardless of the tables[0], tables[1], tables[2] result in all the wrong data displayed on the wikipedia article.

Code is as follows:

1

2

3

4

5

6

7

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_municipalities_in_Alabama"
tables = pd.read_html(url) 
 
df = tables[1]
 
df.to_excel("AL_Alabama_Municipalities.ods", index=False, engine="odf")

Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]?

Best Regards,

Brandon Kastning

(Jan-10-2022, 06:52 PM)snippsat Wrote:
(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: Should I open a new thread?
It's part of same task,so no problem.

(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: How to download the map images and store them (Either DB or Local) ?
You have to give it try,now need to use more common scraping tool.
Here a demo how to start.

1
2
3
4
5
6
7
8
9
10
11
12
13

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('h1').text)

# First image
img = soup.find_all('a', class_="image")
img_link= img[0].find('img').get('src')
img_link = img_link.replace('//', 'http://')
print(img_link)
Output:List of counties in Alabama
http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Map_of_Alabama_highlighting_Autauga_County.svg/75px-Map_of_Alabama_highlighting_Autauga_County.svg.png

***snippsat*** · (This post was last modified: Jan-13-2022, 02:46 PM by snippsat.)

(Jan-13-2022, 04:56 AM)BrandonKastning Wrote: Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]?

A web site can have many tables,so have to look at site(count) or test out like tables[0], tables[1],tables[6].... and see if get wanted result.

There is match in pandas.read_html that can use string or regex to match something i table wanted.
Example Timeline of programming languages ,let say we want Python table we can match name Guido van Rossum.

1

2

3

4

import pandas as pd
  
df = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', match='Guido van Rossum')
df[0].head(13)

Output:	Year	Name	Chief developer, company	Predecessor(s)
0	1990	Sather	Steve Omohundro	Eiffel
1	1990	AMOS BASIC	François Lionet and Constantin Sotiropoulos	STOS BASIC
2	1990	AMPL	Robert Fourer, David Gay and Brian Kernighan a...	NaN
3	1990	Object Oberon	H Mössenböck, J Templ, R Griesemer	Oberon
4	1990	J	Kenneth E. Iverson, Roger Hui at Iverson Software	APL, FP
5	1990	Haskell	NaN	Miranda
6	1990	EuLisp	NaN	Common Lisp, Scheme
7	1990	Z Shell (zsh)	Paul Falstad at Princeton University	ksh
8	1991	GNU E	David J. DeWitt, Michael J. Carey	C++
9	1991	Oberon-2	Hanspeter Mössenböck, Wirth	Object Oberon
10	1991	Oz	Gert Smolka and his students	Prolog
11	1991	Q	Albert Gräf	NaN
12	1991	Python	Guido van Rossum	ABC, C

So if a match it will always be df[0].
Without match it would be table 9:

        
              df[9].head(13)

BrandonKastning

snippsat,

Thank you for this new knowledge and sharing this code. Wonderful example!

Without applying your teaching to my code (yet); I found a work around that managed to pull the data in.

I disabled:

        
              index=False

Flag/Parameter.

Then I had to do the following using Libre Calc

Once disabled; Python picked up the correct table with tables[0] (Strange; I am still unsure what index=False truly does at this time).

The output against the .ods it generated an additional row of header names. One that is different from another.

I tried removing the duplicate row w/ 3 column headings that were under a single header name. Libre Calc gave me an error; so I decided to try copy and paste and the following worked great for a CSV save in 7 steps.

Default Settings on Save As Dialogs were used and worked fine!

[Image: Screenshot-from-2022-01-13-14-27-01.png]

then I use step2.py as the Payload after Manually Naming the Column Headers in Libre Calc!

step1.py code (to generate the .ods):

1

2

3

4

5

6

7

8

9

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_municipalities_in_Alabama"
tables = pd.read_html(url) 
 
df = tables[0]
#df = df.drop('Map', axis=1)
 
#df.to_excel("AL_Alabama_Cities.ods", index=False, engine="odf")
df.to_excel("AL_Alabama_Cities.ods", engine="odf")

step2.py code:

1

2

3

4

5

6

7

8

9

import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
 
myd = pd.read_csv('AL_Alabama_Cities.CSV.csv')
 
engine = create_engine('mysql+mysqlconnector://brandon:password@127.0.0.1/Exodus_J3x_Dev_Bronson')
 
myd.to_sql(name='AL_Cities_CSV', con=engine, if_exists='replace', index=False)

Thank you again snippsat and everyone for this forum time/expertise!

Best Regards,

Brandon Kastning

(Jan-13-2022, 02:46 PM)snippsat Wrote:
(Jan-13-2022, 04:56 AM)BrandonKastning Wrote: Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]?
A web site can have many tables,so have to look at site(count) or test out like tables[0], tables[1],tables[6].... and see if get wanted result.

There is match in pandas.read_html that can use string or regex to match something i table wanted.
Example Timeline of programming languages ,let say we want Python table we can match name Guido van Rossum.

1
2
3
4

import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', match='Guido van Rossum')
df[0].head(13)
Output:	Year	Name	Chief developer, company	Predecessor(s)
0	1990	Sather	Steve Omohundro	Eiffel
1	1990	AMOS BASIC	François Lionet and Constantin Sotiropoulos	STOS BASIC
2	1990	AMPL	Robert Fourer, David Gay and Brian Kernighan a...	NaN
3	1990	Object Oberon	H Mössenböck, J Templ, R Griesemer	Oberon
4	1990	J	Kenneth E. Iverson, Roger Hui at Iverson Software	APL, FP
5	1990	Haskell	NaN	Miranda
6	1990	EuLisp	NaN	Common Lisp, Scheme
7	1990	Z Shell (zsh)	Paul Falstad at Princeton University	ksh
8	1991	GNU E	David J. DeWitt, Michael J. Carey	C++
9	1991	Oberon-2	Hanspeter Mössenböck, Wirth	Object Oberon
10	1991	Oz	Gert Smolka and his students	Prolog
11	1991	Q	Albert Gräf	NaN
12	1991	Python	Guido van Rossum	ABC, C
So if a match it will always be df[0].
Without match it would be table 9:

1

df[9].head(13)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?	BrandonKastning	4	3,097	Jan-27-2022, 04:36 AM Last Post: Larz60+
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to write 3 Columns to MariaDB?	BrandonKastning	21	11,296	Mar-23-2020, 05:51 PM Last Post: ndc85430
	Display blog posts in two columns	saladgg	3	4,280	Dec-28-2018, 05:17 AM Last Post: saladgg

Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns)

User Panel Messages

Announcements