Python Forum
Scraping problems. Pls help with a correct request query.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping problems. Pls help with a correct request query.
#1
Please help to formulate the correct request query. Thank you.
.
Here is the website
http://www.eatonpowersource.com/cross-reference/
.
one needs to enter a search parameter in the form
[Image: pic1.png]
.
This is how the website looks like
.
.
[Image: pic2.png]
.
in Mozilla there are two variants in the field analysis: POST and GET

Method : POST
Link: http://www.eatonpowersource.com/cross-re...e/results/
Parameters
Criteria.SiteSearchTerm
Criteria.CurrentPageNumber=1
Criteria.FilterOptions.SortBy=CompetitorPartNumber
Criteria.FilterOptions.SortOrder=Asc
Criteria.CompetitorPartNumber=0330D0
Criteria.FilterOptions.PageSize=25

and
Method: GET
Link:
http://www.eatonpowersource.com/cross-re...9680055925

Параметры
Criteria.SiteSearchTerm=
Criteria.CurrentPageNumber=1
Criteria.FilterOptions.SortBy=CompetitorPartNumber
Criteria.FilterOptions.SortOrder=Asc
Criteria.CompetitorPartNumber=0330D0
Criteria.FilterOptions.PageSize=25
_=1569680055925

If I send a request query with GET parameters and method, I get the following
[Image: pic3.png]
this is not what is needed

If I send a request query with POST parameters and method, then I get
either 404 - page not found
or error 500
.
# function for web query and recording into a file
def fg_list_bot(_name_element, _output_file):
    s = requests.Session()
    _data = {"Criteria.SiteSearchTerm":"",
             "Criteria.CurrentPageNumber":"1",
             "Criteria.FilterOptions.SortBy":"CompetitorPartNumber",
             "Criteria.FilterOptions.SortOrder":"Asc",
             "Criteria.CompetitorPartNumber":_name_element,
             "Criteria.FilterOptions.PageSize":"25",
             "_":"1569680055925"}

    r = requests.post("_Url", data=_data)
    with open(_output_file, "w") as f: f.write(r.text)
    print(r.status_code)
    input()


import requests
from bs4 import BeautifulSoup
_url = "http://www.eatonpowersource.com/cross-reference/results/"
_name_element = "0330D0"      # text of the search request

_output_file = "Eaton_Vickers.html"
fg_list_bot(_name_element, _output_file)
Reply
#2
I am not web scraper myself but isin't it an option to go directly to required page URL:

http://www.eatonpowersource.com/cross-re...:Asc/ps:25
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#3
You should use selenium for this.
You will also have to install proper driver (chromedriver or gekodriver)
(I use firefox, so gekodriver)

The following code is almost correct, you can finish:
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from pathlib import Path
import os
import PrettifyPage
import time


class FetchEaton:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        url = 'http://www.eatonpowersource.com/cross-reference/'
        self.pp = PrettifyPage.PrettifyPage()
        self.get_xref_data(url)
    
    def get_xref_data(self, url):
        browser = self.start_browser()
        browser.get(url)
        time.sleep(1)
        searchbox = browser.find_element(By.XPATH, '//*[@id="Criteria_CompetitorPartNumber"]')
        searchbox.clear()
        searchbox.send_keys('0330D0')
        btn = browser.find_element(By.XPATH, '/html/body/div[5]/section/div/section/div[1]/div[1]/span/aside/form/div[1]/div/span/button/i').click()
        time.sleep(2)
        table = browser.find_element(By.XPATH, '/html/body/div[5]/section/div/section/div[1]/div[2]/div[3]/div/div[1]/div/div/table')
        allRows = table.find_elements(By.TAG_NAME, 'tr');
        for row in allRows:
            cells = row.find_elements(By.TAG_NAME, 'td');
            for cell in cells:
                print(cell.text)
        self.stop_browser(browser)

    def start_browser(self):
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        return webdriver.Firefox(capabilities=caps)

    def stop_browser(self, browser):
        browser.close()


if __name__ == '__main__':
    FetchEaton()
Output:
Output:
Filtration Hydac 0330D003BHHC Filtration V0334B2H03 Filtration Hydac 0330D003BNHC Filtration V0332B2C03 Filtration Hydac 0330D005BHHC Filtration V0334B2H10 Filtration Hydac 0330D005BNHC Filtration V0332B2C05 Filtration Hydac 0330D010BHHC Filtration V0334B2H10 Filtration Hydac 0330D010BNHC Filtration V0332B2C10
Reply
#4
hi Larz60+,
thank you very much !
You are the best !
Thanks to your response I know now that there is a SELENIUM for Python and I have already started to investigate what it is.

I have simultaneously placed this Question to another Forum and the response was the following:
"use this link with GET method"
www.eatonpowersource.com/cross-reference/?sitesearchterm=0330D0
This is a very easy and short solution. It completely answers my Question. It is exactly what I need.
But my Problem now is: I do not understand, how did he come to this mentioned above link:

I have investigated the HTML text of the Website and have not found the word "sitesearchterm" at all. No Explanation from him followed.
May I kindly ask you to explain, where from comes this mentioned above string with solution?
Thank you
Reply
#5
(Sep-30-2019, 08:14 AM)Larz60+ Wrote: You should use selenium for this....
Hi Larz60+,
thank you again for the sample code.
Today I have installed selenium and tried your code.
Everything works fine, even better than expected.
Please explain me what does it mean in your code the following

import PrettifyPage
self.pp = PrettifyPage.PrettifyPage()
Google know nothing about PrettifyPage. THe single match was this forum - this is what google found only.
What is it? How shall I install it or replace with an alternative.
Python returns an error.
Thank you
Reply
#6
Sorry, I forgot to post
this is a module that allows you to use a variable indent with html, and adds line feeds where appropriate.
Makes html much easier to read
here's the script:
from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line

if __name__ == '__main__':
    pp = PrettifyPage()
Reply
#7
(Oct-01-2019, 07:17 PM)Larz60+ Wrote: Sorry, I forgot to post
THank you
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  POST request with form data issue web scraping hoff1022 1 2,649 Aug-14-2020, 10:25 AM
Last Post: kashcode
  Scraping a dynamic data-table in python through AJAX request filozofo 1 3,823 Aug-14-2020, 10:13 AM
Last Post: kashcode
  The correct POST request abhie_lp 5 2,930 Jun-05-2020, 07:27 AM
Last Post: buran
  Scraping problems with Python requests. gtlhbkkj 1 1,831 Jan-22-2020, 11:00 AM
Last Post: gtlhbkkj
  Scraping problems. Pls help with a correct request query. gtlhbkkj 0 1,484 Oct-09-2019, 12:00 PM
Last Post: gtlhbkkj
  web scraping to csv formatting problems bluethundr 4 2,730 Jul-04-2019, 02:00 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020