Python Forum

Full Version: Get html body of URL
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,
I have a following issue. I would like to get HTML body of a webpage. I am beginning with Python, so to be clear - I need the same output as I get in Google chrome console using jquery command: $("body").html()

I tried:
import requests
url = "myurl"
r = requests.get(url)
r.text
but it gave me something different. Thanks for help!
you did use an actual url, not "myurl", correct?
You should use meaningful names, not single letters.
You should also check the status code to make sure you actually downloaded the page:
import requests


url = "https://google.com"
response = requests.get(url)
if response.status_code == 200:
    print(response.text)
else:
    print("Could not find url: {url}")
this code will give you nothing (unless you run it in interactive mode - i.e. line by line).
In any case, the source you get, may be different if page uses javascript to populate the content of the page.
Hi both, thanks for your replies!

@Larz60+ I checked it, and the status code is really 200.

import requests
url = "https://www.sreality.cz/hledani/pronajem/byty/praha?velikost=1%2B1,2%2Bkk&stavba=cihlova&patro-od=2&patro-do=100&razeni=nejlevnejsi"

r = requests.get(url)

r.status_code   #yes, status code == 200
r.text
@buran - I am not sure, how do you mean it. How can I get the HTML body, if the page uses js?
you need to check the status code in the code, why have a computer otherwise!
include an if/else statement as I showed in post 2
(Aug-03-2020, 09:00 AM)rama27 Wrote: [ -> ]@buran - I am not sure, how do you mean it. How can I get the HTML body, if the page uses js?
it does.
one way is to use tool like selenium
the other option is to examine the request being made. e.g. there is one link
https://www.sreality.cz/api/cs/v2/estate...6450616863
it returns json for first 20 properties, but you need to better research what information is contained and how to retrieve next batch
e.g. there is page in the next address
https://www.sreality.cz/api/cs/v2/estate...6451160636
the idea in this case is to replicate the requests made by the page and parse the json you get
Here a example with Selenium,it's not the most easy page to start with if new to this.
If you can find in info in the json return as @buran show,
then that is fine and fast way as it only requires Requests with a get call and catch response .json().
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.sreality.cz/hledani/pronajem/byty/praha?velikost=1%2B1"
browser.get(url)
time.sleep(3)
# Use BeautifulSoup
soup = BeautifulSoup(browser.page_source, 'lxml')
title = soup.find('h1', class_="page-title list-title ng-binding")
print(title.text)
print('-' * 40)
# Use Selenium
info = browser.find_elements_by_xpath("//div[@class='dir-property-list']//div[1]//div[1]//div[1]")
print(info[0].text)
Output:
Byty 1+1 k pronájmu Praha ---------------------------------------- Pronájem bytu 1+kk 35 m² Praha 5 - Smíchov 14 000 Kč za měsíc