Get html body of URL - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Get html body of URL (/thread-28765.html) |
Get html body of URL - rama27 - Aug-02-2020 Hi, I have a following issue. I would like to get HTML body of a webpage. I am beginning with Python, so to be clear - I need the same output as I get in Google chrome console using jquery command: $("body").html() I tried: import requests url = "myurl" r = requests.get(url) r.textbut it gave me something different. Thanks for help! RE: Get html body of URL - Larz60+ - Aug-02-2020 you did use an actual url, not "myurl", correct? You should use meaningful names, not single letters. You should also check the status code to make sure you actually downloaded the page: import requests url = "https://google.com" response = requests.get(url) if response.status_code == 200: print(response.text) else: print("Could not find url: {url}") RE: Get html body of URL - buran - Aug-03-2020 this code will give you nothing (unless you run it in interactive mode - i.e. line by line). In any case, the source you get, may be different if page uses javascript to populate the content of the page. RE: Get html body of URL - rama27 - Aug-03-2020 Hi both, thanks for your replies! @Larz60+ I checked it, and the status code is really 200. import requests url = "https://www.sreality.cz/hledani/pronajem/byty/praha?velikost=1%2B1,2%2Bkk&stavba=cihlova&patro-od=2&patro-do=100&razeni=nejlevnejsi" r = requests.get(url) r.status_code #yes, status code == 200 r.text@buran - I am not sure, how do you mean it. How can I get the HTML body, if the page uses js? RE: Get html body of URL - Larz60+ - Aug-03-2020 you need to check the status code in the code, why have a computer otherwise! include an if/else statement as I showed in post 2 RE: Get html body of URL - buran - Aug-03-2020 (Aug-03-2020, 09:00 AM)rama27 Wrote: @buran - I am not sure, how do you mean it. How can I get the HTML body, if the page uses js?it does. one way is to use tool like selenium the other option is to examine the request being made. e.g. there is one link https://www.sreality.cz/api/cs/v2/estates?building_type_search=2&category_main_cb=1&category_sub_cb=3%7C4&category_type_cb=2&floor_number=2%7C100&locality_region_id=10&per_page=20&sort=1&tms=1596450616863 it returns json for first 20 properties, but you need to better research what information is contained and how to retrieve next batch e.g. there is page in the next address https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_sub_cb=3&category_type_cb=2&locality_region_id=10&page=2&per_page=20&tms=1596451160636 the idea in this case is to replicate the requests made by the page and parse the json you get RE: Get html body of URL - snippsat - Aug-03-2020 Here a example with Selenium,it's not the most easy page to start with if new to this. If you can find in info in the json return as @buran show, then that is fine and fast way as it only requires Requests with a get call and catch response .json() .from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup import time #--| Setup options = Options() #options.add_argument("--headless") #options.add_argument("--window-size=1980,1020") browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options) #--| Parse or automation url = "https://www.sreality.cz/hledani/pronajem/byty/praha?velikost=1%2B1" browser.get(url) time.sleep(3) # Use BeautifulSoup soup = BeautifulSoup(browser.page_source, 'lxml') title = soup.find('h1', class_="page-title list-title ng-binding") print(title.text) print('-' * 40) # Use Selenium info = browser.find_elements_by_xpath("//div[@class='dir-property-list']//div[1]//div[1]//div[1]") print(info[0].text)
|