Python Forum
Get data from a webpage - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Get data from a webpage (/thread-16484.html)



Get data from a webpage - Pedroski55 - Mar-02-2019

My girlfriend has been given the task of getting all the data from a webpage. The web page belongs to the adult education centre where she works. To get to the webpage, you must first log in. The url is a .asp file.

She has to put the data in an Excel sheet. The entries are student names, numbers, ID card number, telephone, courses, books etc. There are thousands of entries. HR students alone has 70 pages of entries. This all shows up on the webpage as a table. It is possible to copy and paste.

I can handle Python openpyxl reasonably well these days and I have heard of web-scraping, which I believe Python can do.

I don't know what .asp is.

Could you please give me some tips, pointers, about how to get the data with Python? What should I look at or learn?

Can I automate this task?


RE: Get data from a webpage - Larz60+ - Mar-02-2019

You can start here: https://python-forum.io/Thread-Web-Scraping-part-1
then here: https://python-forum.io/Thread-Web-scraping-part-2


RE: Get data from a webpage - snippsat - Mar-02-2019

You can look at Web-Scraping part-1 and part-2.
(Mar-02-2019, 01:14 AM)Pedroski55 Wrote: To get to the webpage, you must first log in
There are many Thread about log in to web-pages if search this forum.
In general you most inspect(Chrome/FireFox developer tools) and see what happens when log in.
Example if i do it with this site:
import requests
from bs4 import BeautifulSoup
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
 
params = {
    "username": "your_username",
    "password": "xxxxxxx",
    "remember": "yes",
    "submit": "Login",
    "action": "do_login",
}
 
with requests.Session() as s:
    s.post('https://python-forum.io/member.php?action=login', headers=headers, params=params)
    # logged in! session cookies saved for future requests
    response = s.get('https://python-forum.io/index.php')
    # cookies sent automatically!
    soup = BeautifulSoup(response.content, 'lxml')
    welcome = soup.find('span', class_="welcome").text
    print(welcome)
Output:
Welcome back, snippsat. You last visited: Today, 05:44 PM Log Out
Another way search for Selenium log in.


RE: Get data from a webpage - Pedroski55 - Mar-02-2019

Thanks a lot. I am a slow learner, so this will take a while.

Each page displays as a table. Whether or not it is really a table, I can't say right now.

After a quick first read, I'm thinking I need to find <table> </table> and get everything in there for each page.

Does that sound like a reasonable approach?

I can practice on my own little web page first!

First success! Thanks a lot!

Quote:>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.mylittlewebpage.com/18BE/18BEsWeek2.html'
>>> url_get = requests.get(url)
>>> soup = BeautifulSoup(url_get.content, 'html.parser')
>>> print(soup.find('table').text)

A. data projector B. flipchart C. personal statement D. reimburse E. travel expenses

>>>

Now got to figure out how to login to the other page!

So, the login page has a number which must be entered. How to do that from Python?

I logged in manually. The page source looks like this:

Quote:<html>
<head>
<title>管理中心</title>
<meta http-equiv=Content-Type content=text/html;charset=gb2312>
</head>
<frameset rows="64,*" frameborder="NO" border="0" framespacing="0">
<frame src="admin_top.asp" noresize="noresize" frameborder="NO" name="topFrame" scrolling="no" marginwidth="0" marginheight="0" target="main" />
<frameset cols="200,*" id="frame">
<frame src="left.asp" name="leftFrame" noresize="noresize" marginwidth="0" marginheight="0" frameborder="0" scrolling="no" target="main" />
<frame src="right.asp" name="main" marginwidth="0" marginheight="0" frameborder="0" scrolling="auto" target="_self" />
<frame src="UntitledFrame-1"><frame src="UntitledFrame-2"></frameset>
</frameset>
<noframes>
<body></body>
</noframes>
</html>

I can't see any table there!