table from wikipedia

flow50 · Jun-27-2019, 10:31 AM

Hi,
I’m a newbie in programming and web scraping.

I got this assignment:

Quote:wikipedia web site: link

From the link above transform the table Sovereign states and dependencies by population into pandas dataframe with the next columns (choose the coresponding data type and be careful with the right index !)

• Rank: (Index) - int
• Country name: - object
• Population - int
• Date - Datetime

• % of world population - int

What I’ve done so far:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

import requests
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
t.text

html_content = t.text 
html_soup = BeautifulSoup(html_content, 'html.parser')
html_soup.text

sover = []

len(sover)

output is: 0

import requests
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
t.text
html_content = t.text 
html_soup = BeautifulSoup(html_content, 'html.parser')
sover = []
sov_tables = html_soup.find_all('table', class_='jquery-tablesorter')

for table in sov_tables[0]:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    for row in rows[1:]:
        values = []
        for col in row.find_all(['th', 'td']):
            values.append(col.text)
        if values:
            cntr_dict = {headers[i]: values[i] for i in range(len(values))}
            cntr.append(cntr_dict)

I got this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-25-b2259dadb770> in <module>
----> 1 for table in sov_tables[0]:
      2     headers = []
      3     rows = table.find_all('tr')
      4     for header in table.find('tr').find_all('th'):
      5         headers.append(header.text)

IndexError: list index out of range

What am I doing wrong?

Thanks in advance for your help!

***snippsat*** · (This post was last modified: Jun-27-2019, 12:47 PM by snippsat.)

You are a little on wrong track when you start to manually parse the table.
Change your first code to this.

import pandas as pd
import requests
from bs4 import BeautifulSoup

url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")

At this point you have the table needed in sov_tables.
Now want to use Pandas with pd.read_html(),in this case you could also only used this method to get table and drop all over.

To bring sov_tables into pandas,it need to be file object or string.
So can use str()

df = pd.read_html(str(sov_tables))
df = df[0]
df

It also get easier if you use Jupyter Notebook,then the table also look nice.

At this point you can continue with task,need convert to columns to right format int,datetime..ect

flow50 · (This post was last modified: Jun-28-2019, 03:02 PM by flow50.)

Thanks for your help!
I've 'cleand a little bit the table, but I'm struggling with the datatype conversion:

import pandas as pd
import requests
from bs4 import BeautifulSoup
 
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df2.dtypes

When I want to change object to int:

import pandas as pd
import requests
from bs4 import BeautifulSoup
 
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})

df3 = df2.astype({"Rank": int, "% of world population": int})

I get this:

Quote:ValueError: invalid literal for int() with base 10: '–'

When I want to change object to datetime:

import pandas as pd
import requests
from bs4 import BeautifulSoup
 
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})

df3['Date'] = df2.to_datetime(df2['Date'], format= '%d/%m/%Y')

I get this:

Quote:AttributeError: 'DataFrame' object has no attribute 'to_datetime'

I really don't know what's wrong this time?

***snippsat*** · (This post was last modified: Jun-28-2019, 06:00 PM by snippsat.)

If look at Rank column df2['Rank'],see that not all countries get a Rank number but -.
This is correct as you see the same on website.
Change line 10 to this:

df0 = pd.read_html(str(sov_tables), na_values='–')

Do df2.dtypes

Output:Rank                     float64
Country name              object
Population                 int64
Date                      object
% of world population     object
dtype: object

No int because of NaN values.
If need int most drop NaN with dropna().

df2['Rank'] = df2['Rank'].astype('int')

For Date do this.

df2['Date'] = pd.to_datetime(df2['Date'])

Output:Rank                              int32
Country name                     object
Population                        int64
Date                     datetime64[ns]
% of world population            object

One to go.

flow50 · Jul-01-2019, 12:16 PM

Hi, thanks again for your help.

I've 'cleaned' the data.

import pandas as pd
import requests
from bs4 import BeautifulSoup

url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables), na_values='–')
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df2['Rank'] = df2['Rank'].dropna()
df2['Rank'] = df2['Rank'].fillna(0)
df2['Rank'] = df2['Rank'].astype(int)
df2['Date'] = pd.to_datetime(df2['Date'])
df2['% of world population'] = df2['% of world population'].str.rstrip('%') 
df2['% of world population'] = df2['% of world population'].astype(float)
df2['% of world population'] = df2['% of world population'].astype(int)
df2['Country name'] = df2['Country name'].replace({'\[Note 2]':'','\[Note 3]':'','\[Note 4]':'', '\[Note 5]':'', '\[Note 6]':'', '\[Note 7]':'', '\[Note 8]':'', '\[Note 9]':'', '\[Note 10]':'', '\[Note 11]':'', '\[Note 12]':'', '\[Note 13]':'','\[Note 14]':'', '\[Note 15]':'', '\[Note 16]':'', '\[Note 17]':'', '\[Note 18]':'', '\[Note 19]':'', '\[Note 20]':'', '\[Note 21]':'', '\[Note 22]':''}, regex = True)
df2.tail()

However, I have 2 more questions:

#1 How can I change (in 'Rank'):
*index 221: 0 to the number 191?
*indexes 223, 224, 225: 0 to the number 192?
etc.

screenshot:
[Image: jqg1sau.png]

#2 It's about line no. 23.
Is there any shorter/more elegant way? I went to the original website, checked/counted all the '[NoteX]' and manually put all of them in the code.

***snippsat*** · Jul-01-2019, 07:12 PM

(Jul-01-2019, 12:16 PM)flow50 Wrote: #1 How can I change (in 'Rank'):
*index 221: 0 to the number 191?
*indexes 223, 224, 225: 0 to the number 192?

Try using df.ffill().

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan, 6],
           'B': [200, np.nan, np.nan, 201, 202, 203]})

df

	A	B
0	1.0	200.0
1	2.0	NaN
2	NaN	NaN
3	4.0	201.0
4	NaN	202.0
5	6.0	203.0

df.ffill()

A	B
0	1.0	200.0
1	2.0	200.0
2	2.0	200.0
3	4.0	201.0
4	4.0	202.0
5	6.0	203.0

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?	BrandonKastning	4	3,027	Jan-27-2022, 04:36 AM Last Post: Larz60+
	fetching, parsing data from Wikipedia	apollo	2	4,342	May-06-2021, 08:08 PM Last Post: snippsat
	Need help scraping wikipedia table	bborusz2	6	4,795	Dec-01-2020, 11:31 PM Last Post: snippsat

table from wikipedia

User Panel Messages

Announcements