Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
table from wikipedia
#1
Hi,
I’m a newbie in programming and web scraping.

I got this assignment:

Quote:wikipedia web site: link

From the link above transform the table Sovereign states and dependencies by population into pandas dataframe with the next columns (choose the coresponding data type and be careful with the right index !)

• Rank: (Index) - int
• Country name: - object
• Population - int
• Date - Datetime

• % of world population - int

What I’ve done so far:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

import requests
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
t.text
html_content = t.text 
html_soup = BeautifulSoup(html_content, 'html.parser')
html_soup.text
sover = []
len(sover)
output is: 0

import requests
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
t.text
html_content = t.text 
html_soup = BeautifulSoup(html_content, 'html.parser')
sover = []
sov_tables = html_soup.find_all('table', class_='jquery-tablesorter')
for table in sov_tables[0]:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    for row in rows[1:]:
        values = []
        for col in row.find_all(['th', 'td']):
            values.append(col.text)
        if values:
            cntr_dict = {headers[i]: values[i] for i in range(len(values))}
            cntr.append(cntr_dict)
I got this error:
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-25-b2259dadb770> in <module>
----> 1 for table in sov_tables[0]:
      2     headers = []
      3     rows = table.find_all('tr')
      4     for header in table.find('tr').find_all('th'):
      5         headers.append(header.text)

IndexError: list index out of range
What am I doing wrong?

Thanks in advance for your help!
Reply
#2
You are a little on wrong track when you start to manually parse the table.
Change your first code to this.
import pandas as pd
import requests
from bs4 import BeautifulSoup

url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
At this point you have the table needed in sov_tables.
Now want to use Pandas with pd.read_html(),in this case you could also only used this method to get table and drop all over.

To bring sov_tables into pandas,it need to be file object or string.
So can use str()
df = pd.read_html(str(sov_tables))
df = df[0]
df
It also get easier if you use Jupyter Notebook,then the table also look nice.
[Image: VM444n.jpg]
At this point you can continue with task,need convert to columns to right format int,datetime..ect
Reply
#3
Thanks for your help!
I've 'cleand a little bit the table, but I'm struggling with the datatype conversion:

import pandas as pd
import requests
from bs4 import BeautifulSoup
 
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df2.dtypes
When I want to change object to int:
import pandas as pd
import requests
from bs4 import BeautifulSoup
 
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})

df3 = df2.astype({"Rank": int, "% of world population": int})
I get this:
Quote:ValueError: invalid literal for int() with base 10: '–'



When I want to change object to datetime:

import pandas as pd
import requests
from bs4 import BeautifulSoup
 
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})

df3['Date'] = df2.to_datetime(df2['Date'], format= '%d/%m/%Y')
I get this:
Quote:AttributeError: 'DataFrame' object has no attribute 'to_datetime'

I really don't know what's wrong this time?
Reply
#4
If look at Rank column df2['Rank'],see that not all countries get a Rank number but -.
This is correct as you see the same on website.
Change line 10 to this:
df0 = pd.read_html(str(sov_tables), na_values='–')
Do df2.dtypes
Output:
Rank float64 Country name object Population int64 Date object % of world population object dtype: object
No int because of NaN values.
If need int most drop NaN with dropna().
df2['Rank'] = df2['Rank'].astype('int')
For Date do this.
df2['Date'] = pd.to_datetime(df2['Date'])
Output:
Rank int32 Country name object Population int64 Date datetime64[ns] % of world population object
One to go.
Reply
#5
Hi, thanks again for your help.

I've 'cleaned' the data.

import pandas as pd
import requests
from bs4 import BeautifulSoup

url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables), na_values='–')
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df2['Rank'] = df2['Rank'].dropna()
df2['Rank'] = df2['Rank'].fillna(0)
df2['Rank'] = df2['Rank'].astype(int)
df2['Date'] = pd.to_datetime(df2['Date'])
df2['% of world population'] = df2['% of world population'].str.rstrip('%') 
df2['% of world population'] = df2['% of world population'].astype(float)
df2['% of world population'] = df2['% of world population'].astype(int)
df2['Country name'] = df2['Country name'].replace({'\[Note 2]':'','\[Note 3]':'','\[Note 4]':'', '\[Note 5]':'', '\[Note 6]':'', '\[Note 7]':'', '\[Note 8]':'', '\[Note 9]':'', '\[Note 10]':'', '\[Note 11]':'', '\[Note 12]':'', '\[Note 13]':'','\[Note 14]':'', '\[Note 15]':'', '\[Note 16]':'', '\[Note 17]':'', '\[Note 18]':'', '\[Note 19]':'', '\[Note 20]':'', '\[Note 21]':'', '\[Note 22]':''}, regex = True)
df2.tail()
However, I have 2 more questions:

#1 How can I change (in 'Rank'):
*index 221: 0 to the number 191?
*indexes 223, 224, 225: 0 to the number 192?
etc.

screenshot:
[Image: jqg1sau.png]

#2 It's about line no. 23.
Is there any shorter/more elegant way? I went to the original website, checked/counted all the '[NoteX]' and manually put all of them in the code.
Reply
#6
(Jul-01-2019, 12:16 PM)flow50 Wrote: #1 How can I change (in 'Rank'):
*index 221: 0 to the number 191?
*indexes 223, 224, 225: 0 to the number 192?
Try using df.ffill().
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan, 6],
           'B': [200, np.nan, np.nan, 201, 202, 203]})

df

	A	B
0	1.0	200.0
1	2.0	NaN
2	NaN	NaN
3	4.0	201.0
4	NaN	202.0
5	6.0	203.0

df.ffill()

A	B
0	1.0	200.0
1	2.0	200.0
2	2.0	200.0
3	4.0	201.0
4	4.0	202.0
5	6.0	203.0
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone? BrandonKastning 4 1,960 Jan-27-2022, 04:36 AM
Last Post: Larz60+
  fetching, parsing data from Wikipedia apollo 2 3,503 May-06-2021, 08:08 PM
Last Post: snippsat
  Need help scraping wikipedia table bborusz2 6 3,167 Dec-01-2020, 11:31 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020