Scraping problems. Pls help with a correct request query. - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Scraping problems. Pls help with a correct request query. (/thread-21422.html) |
Scraping problems. Pls help with a correct request query. - gtlhbkkj - Sep-29-2019 Please help to formulate the correct request query. Thank you. . Here is the website http://www.eatonpowersource.com/cross-reference/ . one needs to enter a search parameter in the form [Image: pic1.png] . This is how the website looks like . . [Image: pic2.png] . in Mozilla there are two variants in the field analysis: POST and GET Method : POST Link: http://www.eatonpowersource.com/cross-reference/results/ Parameters Criteria.SiteSearchTerm Criteria.CurrentPageNumber=1 Criteria.FilterOptions.SortBy=CompetitorPartNumber Criteria.FilterOptions.SortOrder=Asc Criteria.CompetitorPartNumber=0330D0 Criteria.FilterOptions.PageSize=25 and Method: GET Link: http://www.eatonpowersource.com/cross-reference/json/criteriaresults/?Criteria.SiteSearchTerm=&Criteria.CurrentPageNumber=1&Criteria.FilterOptions.SortBy=CompetitorPartNumber&Criteria.FilterOptions.SortOrder=Asc&Criteria.CompetitorPartNumber=0330D0&Criteria.FilterOptions.PageSize=25&_=1569680055925 Параметры Criteria.SiteSearchTerm= Criteria.CurrentPageNumber=1 Criteria.FilterOptions.SortBy=CompetitorPartNumber Criteria.FilterOptions.SortOrder=Asc Criteria.CompetitorPartNumber=0330D0 Criteria.FilterOptions.PageSize=25 _=1569680055925 If I send a request query with GET parameters and method, I get the following [Image: pic3.png] this is not what is needed If I send a request query with POST parameters and method, then I get either 404 - page not found or error 500 . # function for web query and recording into a file def fg_list_bot(_name_element, _output_file): s = requests.Session() _data = {"Criteria.SiteSearchTerm":"", "Criteria.CurrentPageNumber":"1", "Criteria.FilterOptions.SortBy":"CompetitorPartNumber", "Criteria.FilterOptions.SortOrder":"Asc", "Criteria.CompetitorPartNumber":_name_element, "Criteria.FilterOptions.PageSize":"25", "_":"1569680055925"} r = requests.post("_Url", data=_data) with open(_output_file, "w") as f: f.write(r.text) print(r.status_code) input() import requests from bs4 import BeautifulSoup _url = "http://www.eatonpowersource.com/cross-reference/results/" _name_element = "0330D0" # text of the search request _output_file = "Eaton_Vickers.html" fg_list_bot(_name_element, _output_file) RE: Scraping problems. Pls help with a correct request query. - perfringo - Sep-30-2019 I am not web scraper myself but isin't it an option to go directly to required page URL: http://www.eatonpowersource.com/cross-reference/#/p:1/sb:0330D0/o:Asc/ps:25 RE: Scraping problems. Pls help with a correct request query. - Larz60+ - Sep-30-2019 You should use selenium for this. You will also have to install proper driver (chromedriver or gekodriver) (I use firefox, so gekodriver) The following code is almost correct, you can finish: import selenium from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from bs4 import BeautifulSoup from pathlib import Path import os import PrettifyPage import time class FetchEaton: def __init__(self): os.chdir(os.path.abspath(os.path.dirname(__file__))) url = 'http://www.eatonpowersource.com/cross-reference/' self.pp = PrettifyPage.PrettifyPage() self.get_xref_data(url) def get_xref_data(self, url): browser = self.start_browser() browser.get(url) time.sleep(1) searchbox = browser.find_element(By.XPATH, '//*[@id="Criteria_CompetitorPartNumber"]') searchbox.clear() searchbox.send_keys('0330D0') btn = browser.find_element(By.XPATH, '/html/body/div[5]/section/div/section/div[1]/div[1]/span/aside/form/div[1]/div/span/button/i').click() time.sleep(2) table = browser.find_element(By.XPATH, '/html/body/div[5]/section/div/section/div[1]/div[2]/div[3]/div/div[1]/div/div/table') allRows = table.find_elements(By.TAG_NAME, 'tr'); for row in allRows: cells = row.find_elements(By.TAG_NAME, 'td'); for cell in cells: print(cell.text) self.stop_browser(browser) def start_browser(self): caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True return webdriver.Firefox(capabilities=caps) def stop_browser(self, browser): browser.close() if __name__ == '__main__': FetchEaton()Output:
RE: Scraping problems. Pls help with a correct request query. - gtlhbkkj - Sep-30-2019 hi Larz60+, thank you very much ! You are the best ! Thanks to your response I know now that there is a SELENIUM for Python and I have already started to investigate what it is. I have simultaneously placed this Question to another Forum and the response was the following: "use this link with GET method" www.eatonpowersource.com/cross-reference/?sitesearchterm=0330D0 This is a very easy and short solution. It completely answers my Question. It is exactly what I need. But my Problem now is: I do not understand, how did he come to this mentioned above link: I have investigated the HTML text of the Website and have not found the word "sitesearchterm" at all. No Explanation from him followed. May I kindly ask you to explain, where from comes this mentioned above string with solution? Thank you RE: Scraping problems. Pls help with a correct request query. - gtlhbkkj - Oct-01-2019 (Sep-30-2019, 08:14 AM)Larz60+ Wrote: You should use selenium for this....Hi Larz60+, thank you again for the sample code. Today I have installed selenium and tried your code. Everything works fine, even better than expected. Please explain me what does it mean in your code the following import PrettifyPage self.pp = PrettifyPage.PrettifyPage()Google know nothing about PrettifyPage. THe single match was this forum - this is what google found only. What is it? How shall I install it or replace with an alternative. Python returns an error. Thank you RE: Scraping problems. Pls help with a correct request query. - Larz60+ - Oct-01-2019 Sorry, I forgot to post this is a module that allows you to use a variable indent with html, and adds line feeds where appropriate. Makes html much easier to read here's the script: from bs4 import BeautifulSoup import requests import pathlib class PrettifyPage: def __init__(self): pass def prettify(self, soup, indent): pretty_soup = str() previous_indent = 0 for line in soup.prettify().split("\n"): current_indent = str(line).find("<") if current_indent == -1 or current_indent > previous_indent + 2: current_indent = previous_indent + 1 previous_indent = current_indent pretty_soup += self.write_new_line(line, current_indent, indent) return pretty_soup def write_new_line(self, line, current_indent, desired_indent): new_line = "" spaces_to_add = (current_indent * desired_indent) - current_indent if spaces_to_add > 0: for i in range(spaces_to_add): new_line += " " new_line += str(line) + "\n" return new_line if __name__ == '__main__': pp = PrettifyPage() RE: Scraping problems. Pls help with a correct request query. - gtlhbkkj - Oct-01-2019 (Oct-01-2019, 07:17 PM)Larz60+ Wrote: Sorry, I forgot to postTHank you |