Python Forum
WebScrape: Tabular Input, CSV Output
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
WebScrape: Tabular Input, CSV Output
#1
Patience for a noob please, thank you.

Trying to scrape this information https://drive.google.com/file/d/0Bw37apQ...sp=sharing

from this website or page rather:  http://www.cis.unimelb.edu.au/people/

I started off with this:


import bs4 as bs
import urllib.request as ur

sauce = ur.urlopen('http://www.cis.unimelb.edu.au/people/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

for tables in soup.find_all("table"):
    print(tables)
But that's it, I'm stuck only because I wanted to get the information the same way as it was show in CSV.\

I want my end result to be(in CSV):

Title,Givenÿname,Family name,Position,Profile,Email
Dr,Roberto,Amadini,Research Fellow,Profile,[email protected]
Mr,Steven,Baker,Research Fellow,Profile,[email protected]
Mr,Daniel,Beck,Research Fellow,,[email protected]
Dr,Michelle,Blom,Research Fellow,Profile,[email protected]
Reply
#2
It looks like you will need selenium to get information you want. Eventually PhanotmJS to avoid browser over-head. There is example in the snippsat tutorials - part2
As to the output - once you are able to parse the html source, it's just a matter of couple of lines with/or without using csv module (build-in) to write to file.
And final remark - for your own sake, use Requests module
Reply
#3
As mention you need other tool to get email,
which i executed bye JavaScript in browser(DOM).
Can use PhanotmJS and pass source to BS.
Eg:
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
url = 'http://www.cis.unimelb.edu.au/people/'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.find('table'))
I get this output,better looking in a Pen.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Webscrape using RPi and SQlite database, always write the last value in database Armond 0 514 Jul-19-2023, 09:11 PM
Last Post: Armond

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020