Python Forum

Full Version: fetching, parsing data from Wikipedia
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
hello dear python-experts, good day. Smile



this scraper fetches wikipedia pages



it is a nice little scraper - it ...:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen
url = 'https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration'
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')
these few lines fetch data . but i guess that i need more.

i am going to add some

find_all('table') ## that helps me to scan the entire document to look for the following tag <table>
and the following

tables = soup.find_all('table')
can i do this like so?
srry - BROY

but this makes no sense to me. it has nothing to do with the question

i guess this is a form of spam.
(May-05-2021, 06:12 PM)apollo Wrote: [ -> ]and the following
tables = soup.find_all('table')
can i do this like so?
Yes,but need to find right table here a coupled of way,and a better way with Pandas.
Here a Notebook se that get table and can start to work with right away as it's now a DataFrame.

Here a standard way.
As you see get table but still need a lot work to get data if want do something useful with it.
Do not use urllib.
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
print(soup.select_one('#mw-content-text > div.mw-parser-output > table:nth-child(9)'))
Output:
List of cities by sunshine duration - Wikipedia <table class="wikitable plainrowheaders sortable" style="text-align:right;"> <caption>Sunshine hours for selected cities in Africa </caption> <tbody><tr style="vertical-align:top"> <th>Country </th> <th>City </th> <th>Jan </th> <th>Feb </th> <th>Mar </th> <th>Apr </th> <th>May </th> <th>Jun </th> <th>Jul </th> <th>Aug </th> <th>Sep </th> <th>Oct </th> <th>Nov </th> <th>Dec </th> <th>Year </th> <th>Ref. </th></tr> <tr> <td style="text-align:left;"><a href="/wiki/Ivory_Coast" title="Ivory Coast">Ivory Coast</a> </td> <td style="text-align:left;"><a href="/wiki/Gagnoa" title="Gagnoa">Gagnoa</a> </td> <td style="background: #D5D500; color:#000000;;">183.0 </td> <td style="background: #D4D400; color:#000000;;">180.0 </td> <td style="background: #D8D800; color:#000000;;">196.0 </td> .....ect