Posts: 3
Threads: 1
Joined: Jun 2019
Hi,
I’m a newbie in programming and web scraping.
I got this assignment:
Quote:wikipedia web site: link
From the link above transform the table Sovereign states and dependencies by population into pandas dataframe with the next columns (choose the coresponding data type and be careful with the right index !)
• Rank: (Index) - int
• Country name: - object
• Population - int
• Date - Datetime
• % of world population - int
What I’ve done so far:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import requests
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
t.text html_content = t.text
html_soup = BeautifulSoup(html_content, 'html.parser')
html_soup.text sover = [] len(sover) output is: 0
import requests
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
t.text
html_content = t.text
html_soup = BeautifulSoup(html_content, 'html.parser')
sover = []
sov_tables = html_soup.find_all('table', class_='jquery-tablesorter') for table in sov_tables[0]:
headers = []
rows = table.find_all('tr')
for header in table.find('tr').find_all('th'):
headers.append(header.text)
for row in rows[1:]:
values = []
for col in row.find_all(['th', 'td']):
values.append(col.text)
if values:
cntr_dict = {headers[i]: values[i] for i in range(len(values))}
cntr.append(cntr_dict) I got this error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-b2259dadb770> in <module>
----> 1 for table in sov_tables[0]:
2 headers = []
3 rows = table.find_all('tr')
4 for header in table.find('tr').find_all('th'):
5 headers.append(header.text)
IndexError: list index out of range What am I doing wrong?
Thanks in advance for your help!
Posts: 7,320
Threads: 123
Joined: Sep 2016
Jun-27-2019, 12:47 PM
(This post was last modified: Jun-27-2019, 12:47 PM by snippsat.)
You are a little on wrong track when you start to manually parse the table.
Change your first code to this.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable") At this point you have the table needed in sov_tables .
Now want to use Pandas with pd.read_html(),in this case you could also only used this method to get table and drop all over.
To bring sov_tables into pandas,it need to be file object or string.
So can use str()
df = pd.read_html(str(sov_tables))
df = df[0]
df It also get easier if you use Jupyter Notebook,then the table also look nice.
![[Image: VM444n.jpg]](https://imagizer.imageshack.com/v2/xq90/924/VM444n.jpg)
At this point you can continue with task,need convert to columns to right format int,datetime..ect
Posts: 3
Threads: 1
Joined: Jun 2019
Jun-28-2019, 03:02 PM
(This post was last modified: Jun-28-2019, 03:02 PM by flow50.)
Thanks for your help!
I've 'cleand a little bit the table, but I'm struggling with the datatype conversion:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df2.dtypes When I want to change object to int:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df3 = df2.astype({"Rank": int, "% of world population": int}) I get this: Quote:ValueError: invalid literal for int() with base 10: '–'
When I want to change object to datetime:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables))
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df3['Date'] = df2.to_datetime(df2['Date'], format= '%d/%m/%Y') I get this: Quote:AttributeError: 'DataFrame' object has no attribute 'to_datetime'
I really don't know what's wrong this time?
Posts: 7,320
Threads: 123
Joined: Sep 2016
Jun-28-2019, 05:22 PM
(This post was last modified: Jun-28-2019, 06:00 PM by snippsat.)
If look at Rank column df2['Rank'] ,see that not all countries get a Rank number but - .
This is correct as you see the same on website.
Change line 10 to this:
df0 = pd.read_html(str(sov_tables), na_values='–') Do df2.dtypes
Output: Rank float64
Country name object
Population int64
Date object
% of world population object
dtype: object
No int because of NaN values.
If need int most drop NaN with dropna() .
df2['Rank'] = df2['Rank'].astype('int') For Date do this.
df2['Date'] = pd.to_datetime(df2['Date']) Output: Rank int32
Country name object
Population int64
Date datetime64[ns]
% of world population object
One to go.
Posts: 3
Threads: 1
Joined: Jun 2019
Hi, thanks again for your help.
I've 'cleaned' the data.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
t = requests.get(url_cntr)
html_content = t.content
html_soup = BeautifulSoup(html_content, 'html.parser')
sov_tables = html_soup.find('table', class_="wikitable sortable")
df0 = pd.read_html(str(sov_tables), na_values='–')
df0 = df0[0]
df0.head()
df1 = df0.drop("Source", axis=1)
df1.head()
df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'})
df2['Rank'] = df2['Rank'].dropna()
df2['Rank'] = df2['Rank'].fillna(0)
df2['Rank'] = df2['Rank'].astype(int)
df2['Date'] = pd.to_datetime(df2['Date'])
df2['% of world population'] = df2['% of world population'].str.rstrip('%')
df2['% of world population'] = df2['% of world population'].astype(float)
df2['% of world population'] = df2['% of world population'].astype(int)
df2['Country name'] = df2['Country name'].replace({'\[Note 2]':'','\[Note 3]':'','\[Note 4]':'', '\[Note 5]':'', '\[Note 6]':'', '\[Note 7]':'', '\[Note 8]':'', '\[Note 9]':'', '\[Note 10]':'', '\[Note 11]':'', '\[Note 12]':'', '\[Note 13]':'','\[Note 14]':'', '\[Note 15]':'', '\[Note 16]':'', '\[Note 17]':'', '\[Note 18]':'', '\[Note 19]':'', '\[Note 20]':'', '\[Note 21]':'', '\[Note 22]':''}, regex = True)
df2.tail() However, I have 2 more questions:
#1 How can I change (in 'Rank'):
*index 221: 0 to the number 191?
*indexes 223, 224, 225: 0 to the number 192?
etc.
screenshot:
#2 It's about line no. 23.
Is there any shorter/more elegant way? I went to the original website, checked/counted all the '[NoteX]' and manually put all of them in the code.
Posts: 7,320
Threads: 123
Joined: Sep 2016
(Jul-01-2019, 12:16 PM)flow50 Wrote: #1 How can I change (in 'Rank'):
*index 221: 0 to the number 191?
*indexes 223, 224, 225: 0 to the number 192? Try using df.ffill() .
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan, 6],
'B': [200, np.nan, np.nan, 201, 202, 203]})
df
A B
0 1.0 200.0
1 2.0 NaN
2 NaN NaN
3 4.0 201.0
4 NaN 202.0
5 6.0 203.0
df.ffill()
A B
0 1.0 200.0
1 2.0 200.0
2 2.0 200.0
3 4.0 201.0
4 4.0 202.0
5 6.0 203.0
|