Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Get data from a webpage
#1
My girlfriend has been given the task of getting all the data from a webpage. The web page belongs to the adult education centre where she works. To get to the webpage, you must first log in. The url is a .asp file.

She has to put the data in an Excel sheet. The entries are student names, numbers, ID card number, telephone, courses, books etc. There are thousands of entries. HR students alone has 70 pages of entries. This all shows up on the webpage as a table. It is possible to copy and paste.

I can handle Python openpyxl reasonably well these days and I have heard of web-scraping, which I believe Python can do.

I don't know what .asp is.

Could you please give me some tips, pointers, about how to get the data with Python? What should I look at or learn?

Can I automate this task?
Reply
#2
You can start here: https://python-forum.io/Thread-Web-Scraping-part-1
then here: https://python-forum.io/Thread-Web-scraping-part-2
Reply
#3
You can look at Web-Scraping part-1 and part-2.
(Mar-02-2019, 01:14 AM)Pedroski55 Wrote: To get to the webpage, you must first log in
There are many Thread about log in to web-pages if search this forum.
In general you most inspect(Chrome/FireFox developer tools) and see what happens when log in.
Example if i do it with this site:
import requests
from bs4 import BeautifulSoup
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
 
params = {
    "username": "your_username",
    "password": "xxxxxxx",
    "remember": "yes",
    "submit": "Login",
    "action": "do_login",
}
 
with requests.Session() as s:
    s.post('https://python-forum.io/member.php?action=login', headers=headers, params=params)
    # logged in! session cookies saved for future requests
    response = s.get('https://python-forum.io/index.php')
    # cookies sent automatically!
    soup = BeautifulSoup(response.content, 'lxml')
    welcome = soup.find('span', class_="welcome").text
    print(welcome)
Output:
Welcome back, snippsat. You last visited: Today, 05:44 PM Log Out
Another way search for Selenium log in.
Reply
#4
Thanks a lot. I am a slow learner, so this will take a while.

Each page displays as a table. Whether or not it is really a table, I can't say right now.

After a quick first read, I'm thinking I need to find <table> </table> and get everything in there for each page.

Does that sound like a reasonable approach?

I can practice on my own little web page first!

First success! Thanks a lot!

Quote:>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.mylittlewebpage.com/18BE/18BEsWeek2.html'
>>> url_get = requests.get(url)
>>> soup = BeautifulSoup(url_get.content, 'html.parser')
>>> print(soup.find('table').text)

A. data projector B. flipchart C. personal statement D. reimburse E. travel expenses

>>>

Now got to figure out how to login to the other page!

So, the login page has a number which must be entered. How to do that from Python?

I logged in manually. The page source looks like this:

Quote:<html>
<head>
<title>管理中心</title>
<meta http-equiv=Content-Type content=text/html;charset=gb2312>
</head>
<frameset rows="64,*" frameborder="NO" border="0" framespacing="0">
<frame src="admin_top.asp" noresize="noresize" frameborder="NO" name="topFrame" scrolling="no" marginwidth="0" marginheight="0" target="main" />
<frameset cols="200,*" id="frame">
<frame src="left.asp" name="leftFrame" noresize="noresize" marginwidth="0" marginheight="0" frameborder="0" scrolling="no" target="main" />
<frame src="right.asp" name="main" marginwidth="0" marginheight="0" frameborder="0" scrolling="auto" target="_self" />
<frame src="UntitledFrame-1"><frame src="UntitledFrame-2"></frameset>
</frameset>
<noframes>
<body></body>
</noframes>
</html>

I can't see any table there!

Attached Files

Thumbnail(s)
   
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract data from a webpage cycloneseb 5 789 Apr-04-2020, 10:17 AM
Last Post: alekson
  flask requests display data from api on webpage with javacript pascale 0 959 Oct-25-2018, 08:30 PM
Last Post: pascale
  Not able to fetch data from a webpage sumandas89 3 2,472 Dec-21-2017, 08:30 AM
Last Post: sumandas89

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020