![]() |
table from wikipedia - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: table from wikipedia (/thread-19406.html) |
table from wikipedia - flow50 - Jun-27-2019 Hi, I’m a newbie in programming and web scraping. I got this assignment: Quote:wikipedia web site: link What I’ve done so far: import numpy as np import pandas as pd import requests from bs4 import BeautifulSoup import requests url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' t = requests.get(url_cntr) t.text html_content = t.text html_soup = BeautifulSoup(html_content, 'html.parser') html_soup.text sover = [] len(sover)output is: 0 import requests url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' t = requests.get(url_cntr) t.text html_content = t.text html_soup = BeautifulSoup(html_content, 'html.parser') sover = [] sov_tables = html_soup.find_all('table', class_='jquery-tablesorter') for table in sov_tables[0]: headers = [] rows = table.find_all('tr') for header in table.find('tr').find_all('th'): headers.append(header.text) for row in rows[1:]: values = [] for col in row.find_all(['th', 'td']): values.append(col.text) if values: cntr_dict = {headers[i]: values[i] for i in range(len(values))} cntr.append(cntr_dict)I got this error: --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-25-b2259dadb770> in <module> ----> 1 for table in sov_tables[0]: 2 headers = [] 3 rows = table.find_all('tr') 4 for header in table.find('tr').find_all('th'): 5 headers.append(header.text) IndexError: list index out of rangeWhat am I doing wrong? Thanks in advance for your help! RE: table from wikipedia - snippsat - Jun-27-2019 You are a little on wrong track when you start to manually parse the table. Change your first code to this. import pandas as pd import requests from bs4 import BeautifulSoup url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' t = requests.get(url_cntr) html_content = t.content html_soup = BeautifulSoup(html_content, 'html.parser') sov_tables = html_soup.find('table', class_="wikitable sortable")At this point you have the table needed in sov_tables .Now want to use Pandas with pd.read_html(),in this case you could also only used this method to get table and drop all over. To bring sov_tables into pandas,it need to be file object or string.So can use str() df = pd.read_html(str(sov_tables)) df = df[0] dfIt also get easier if you use Jupyter Notebook,then the table also look nice. ![]() At this point you can continue with task,need convert to columns to right format int,datetime..ect RE: table from wikipedia - flow50 - Jun-28-2019 Thanks for your help! I've 'cleand a little bit the table, but I'm struggling with the datatype conversion: import pandas as pd import requests from bs4 import BeautifulSoup url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' t = requests.get(url_cntr) html_content = t.content html_soup = BeautifulSoup(html_content, 'html.parser') sov_tables = html_soup.find('table', class_="wikitable sortable") df0 = pd.read_html(str(sov_tables)) df0 = df0[0] df0.head() df1 = df0.drop("Source", axis=1) df1.head() df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'}) df2.dtypesWhen I want to change object to int: import pandas as pd import requests from bs4 import BeautifulSoup url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' t = requests.get(url_cntr) html_content = t.content html_soup = BeautifulSoup(html_content, 'html.parser') sov_tables = html_soup.find('table', class_="wikitable sortable") df0 = pd.read_html(str(sov_tables)) df0 = df0[0] df0.head() df1 = df0.drop("Source", axis=1) df1.head() df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'}) df3 = df2.astype({"Rank": int, "% of world population": int})I get this: Quote:ValueError: invalid literal for int() with base 10: '–' When I want to change object to datetime: import pandas as pd import requests from bs4 import BeautifulSoup url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' t = requests.get(url_cntr) html_content = t.content html_soup = BeautifulSoup(html_content, 'html.parser') sov_tables = html_soup.find('table', class_="wikitable sortable") df0 = pd.read_html(str(sov_tables)) df0 = df0[0] df0.head() df1 = df0.drop("Source", axis=1) df1.head() df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'}) df3['Date'] = df2.to_datetime(df2['Date'], format= '%d/%m/%Y')I get this: Quote:AttributeError: 'DataFrame' object has no attribute 'to_datetime' I really don't know what's wrong this time? RE: table from wikipedia - snippsat - Jun-28-2019 If look at Rank column df2['Rank'] ,see that not all countries get a Rank number but - .This is correct as you see the same on website. Change line 10 to this: df0 = pd.read_html(str(sov_tables), na_values='–')Do df2.dtypes No int because of NaN values.If need int most drop NaN with dropna() .df2['Rank'] = df2['Rank'].astype('int')For Date do this. df2['Date'] = pd.to_datetime(df2['Date']) One to go.
RE: table from wikipedia - flow50 - Jul-01-2019 Hi, thanks again for your help. I've 'cleaned' the data. import pandas as pd import requests from bs4 import BeautifulSoup url_cntr = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population' t = requests.get(url_cntr) html_content = t.content html_soup = BeautifulSoup(html_content, 'html.parser') sov_tables = html_soup.find('table', class_="wikitable sortable") df0 = pd.read_html(str(sov_tables), na_values='–') df0 = df0[0] df0.head() df1 = df0.drop("Source", axis=1) df1.head() df2 = df1.rename(columns={'Country(or dependent territory)': 'Country name', '% of worldpopulation': '% of world population'}) df2['Rank'] = df2['Rank'].dropna() df2['Rank'] = df2['Rank'].fillna(0) df2['Rank'] = df2['Rank'].astype(int) df2['Date'] = pd.to_datetime(df2['Date']) df2['% of world population'] = df2['% of world population'].str.rstrip('%') df2['% of world population'] = df2['% of world population'].astype(float) df2['% of world population'] = df2['% of world population'].astype(int) df2['Country name'] = df2['Country name'].replace({'\[Note 2]':'','\[Note 3]':'','\[Note 4]':'', '\[Note 5]':'', '\[Note 6]':'', '\[Note 7]':'', '\[Note 8]':'', '\[Note 9]':'', '\[Note 10]':'', '\[Note 11]':'', '\[Note 12]':'', '\[Note 13]':'','\[Note 14]':'', '\[Note 15]':'', '\[Note 16]':'', '\[Note 17]':'', '\[Note 18]':'', '\[Note 19]':'', '\[Note 20]':'', '\[Note 21]':'', '\[Note 22]':''}, regex = True) df2.tail()However, I have 2 more questions: #1 How can I change (in 'Rank'): *index 221: 0 to the number 191? *indexes 223, 224, 225: 0 to the number 192? etc. screenshot: ![]() #2 It's about line no. 23. Is there any shorter/more elegant way? I went to the original website, checked/counted all the '[NoteX]' and manually put all of them in the code. RE: table from wikipedia - snippsat - Jul-01-2019 (Jul-01-2019, 12:16 PM)flow50 Wrote: #1 How can I change (in 'Rank'):Try using df.ffill() .import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan, 6], 'B': [200, np.nan, np.nan, 201, 202, 203]}) df A B 0 1.0 200.0 1 2.0 NaN 2 NaN NaN 3 4.0 201.0 4 NaN 202.0 5 6.0 203.0 df.ffill() A B 0 1.0 200.0 1 2.0 200.0 2 2.0 200.0 3 4.0 201.0 4 4.0 202.0 5 6.0 203.0 |