fetching, parsing data from Wikipedia

fetching, parsing data from Wikipedia - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: fetching, parsing data from Wikipedia (/thread-33560.html)

fetching, parsing data from Wikipedia - apollo - May-05-2021

hello dear python-experts, good day. Smile

this scraper fetches wikipedia pages

it is a nice little scraper - it ...:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen
url = 'https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration'
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

these few lines fetch data . but i guess that i need more.

i am going to add some

find_all('table') ## that helps me to scan the entire document to look for the following tag <table>

and the following

tables = soup.find_all('table')

can i do this like so?

RE: fetching, parsing and writing into CSV - but only 1 percent of the whole dataset - apollo - May-06-2021

srry - BROY

but this makes no sense to me. it has nothing to do with the question

i guess this is a form of spam.

RE: fetching, parsing data from Wikipedia - snippsat - May-06-2021

(May-05-2021, 06:12 PM)apollo Wrote: and the following
tables = soup.find_all('table')
can i do this like so?

Yes,but need to find right table here a coupled of way,and a better way with Pandas.
Here a Notebook se that get table and can start to work with right away as it's now a DataFrame.

Here a standard way.
As you see get table but still need a lot work to get data if want do something useful with it.
Do not use urllib.

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
print(soup.select_one('#mw-content-text > div.mw-parser-output > table:nth-child(9)'))

Output:List of cities by sunshine duration - Wikipedia
<table class="wikitable plainrowheaders sortable" style="text-align:right;">
<caption>Sunshine hours for selected cities in Africa
</caption>
<tbody><tr style="vertical-align:top">
<th>Country
</th>
<th>City
</th>
<th>Jan
</th>
<th>Feb
</th>
<th>Mar
</th>
<th>Apr
</th>
<th>May
</th>
<th>Jun
</th>
<th>Jul
</th>
<th>Aug
</th>
<th>Sep
</th>
<th>Oct
</th>
<th>Nov
</th>
<th>Dec
</th>
<th>Year
</th>
<th>Ref.
</th></tr>
<tr>
<td style="text-align:left;"><a href="/wiki/Ivory_Coast" title="Ivory Coast">Ivory Coast</a>
</td>
<td style="text-align:left;"><a href="/wiki/Gagnoa" title="Gagnoa">Gagnoa</a>
</td>
<td style="background: #D5D500; color:#000000;;">183.0
</td>
<td style="background: #D4D400; color:#000000;;">180.0
</td>
<td style="background: #D8D800; color:#000000;;">196.0
</td>
.....ect