Python Forum

Full Version: Extract data from a webpage
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello all;

I'm would like to ask some advices to the python community concerning a script I'm trying to develop.

In fact, I'm living near a lake in Italy, and from time to time the water level is very close from my house, so I'm looking a way to pick-up a value from a webpage dealing with the lake level, and sending me notification when this level reach a high value.

The webpage giving the value is this one : https:www.astrogeo.va.it/idro/idro.php

the value I want to retrieve is the one after "Stazione di Leggiuno", by example today : 194.12, as indicated on the website.

Using examples found on the web, I used Request and beautifulsoup to retrieve this info :

#!/usr/bin/python
import requests
from bs4 import BeautifulSoup

# using the requests module, we use the "get" funtion
result = requests.get("https:www.astrogeo.va.it/idro/idro.php")

print(result.status_code)

# let us store the page content of the website
# from requests to a variable

src = result.content
print(src)
so, I receive the 'result' on my screen, with the data I want to import :
Quote:document.getElementById("Livello').InnerHTML="<strong>Stazione di Leggiuno: "+data.legb.livello[ data.legb.livello.lenght-1]+"<font color='#417 FDA'>

I'was thinking that the value I wanted to extract should be just after the mark "Stazione di Leggiuno", but instead, I got this "+data.legb.livello", and cannor recover the result displayed on the webpage (in this case 194.12).

Anyone of the python community has been face to this problem ? how is it possible to retrieve the numerical value , if possible ?

Many thanks in advance for your help !
I get an invalid URL when I attempt to request page.
Please advise of correct URL
Hello;

I can confirm this URL : https://www.astrogeo.va.it/idro/idro.php

strangely, when I try to open it with Safari, this is not working, but with Firefox, without any problem.
I think you're going to need selenium, here's some starter code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time


class WaterLevel:
    def __init__(self):
        self.analyze_page()
    
    def start_browser(self):
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        self.browser = webdriver.Firefox(capabilities=caps)
        
    def stop_browser(self):
        self.browser.close()

    def analyze_page(self):
        self.start_browser()
        url =  'https://www.astrogeo.va.it/idro/idro.php'
        self.browser.get(url)
        time.sleep(2)
        element = self.browser.find_element(By.XPATH, '/html/body/div[1]/div[4]/div[1]/div[2]/div/div/div/table[1]/tbody/tr[2]/td[1]/div/i')
        print(element.text)

        self.stop_browser()

if __name__ == '__main__':
    WaterLevel()
which produces the following output:
Output:
(12-11-2019, ore 11.30)
If looking closer at page so dos it give back json data with values.
Then can use this json data and drop Selenium in this case.
Example.
import requests
from datetime import datetime

url = 'https://www.astrogeo.va.it/data/idro/maggiore_inst.json'
response = requests.get(url)
livello = response.json()
livello_val = livello['legb']['livello_last']
livello_last = livello['legb']['livello_last_time']
livello_last = datetime.fromtimestamp(livello_last)
print(f'<{livello_val}> at date {livello_last}')
Output:
<194.13> at date 2019-11-12 18:10:00
I have a page with no problems.

Tell you managed to solve the problem?