Python Forum

Full Version: WebScrape: Tabular Input, CSV Output
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Patience for a noob please, thank you.

Trying to scrape this information https://drive.google.com/file/d/0Bw37apQ...sp=sharing

from this website or page rather:  http://www.cis.unimelb.edu.au/people/

I started off with this:


import bs4 as bs
import urllib.request as ur

sauce = ur.urlopen('http://www.cis.unimelb.edu.au/people/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

for tables in soup.find_all("table"):
    print(tables)
But that's it, I'm stuck only because I wanted to get the information the same way as it was show in CSV.\

I want my end result to be(in CSV):

Title,Givenÿname,Family name,Position,Profile,Email
Dr,Roberto,Amadini,Research Fellow,Profile,[email protected]
Mr,Steven,Baker,Research Fellow,Profile,[email protected]
Mr,Daniel,Beck,Research Fellow,,[email protected]
Dr,Michelle,Blom,Research Fellow,Profile,[email protected]
It looks like you will need selenium to get information you want. Eventually PhanotmJS to avoid browser over-head. There is example in the snippsat tutorials - part2
As to the output - once you are able to parse the html source, it's just a matter of couple of lines with/or without using csv module (build-in) to write to file.
And final remark - for your own sake, use Requests module
As mention you need other tool to get email,
which i executed bye JavaScript in browser(DOM).
Can use PhanotmJS and pass source to BS.
Eg:
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
url = 'http://www.cis.unimelb.edu.au/people/'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.find('table'))
I get this output,better looking in a Pen.