Scraping problems. Pls help with a correct request query.

Scraping problems. Pls help with a correct request query. - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Scraping problems. Pls help with a correct request query. (/thread-21422.html)

Scraping problems. Pls help with a correct request query. - gtlhbkkj - Sep-29-2019

Please help to formulate the correct request query. Thank you.
.
Here is the website
http://www.eatonpowersource.com/cross-reference/
.
one needs to enter a search parameter in the form
[Image: pic1.png]
.
This is how the website looks like
.
.
[Image: pic2.png]
.
in Mozilla there are two variants in the field analysis: POST and GET

Method : POST
Link: http://www.eatonpowersource.com/cross-reference/results/
Parameters
Criteria.SiteSearchTerm
Criteria.CurrentPageNumber=1
Criteria.FilterOptions.SortBy=CompetitorPartNumber
Criteria.FilterOptions.SortOrder=Asc
Criteria.CompetitorPartNumber=0330D0
Criteria.FilterOptions.PageSize=25

and
Method: GET
Link:
http://www.eatonpowersource.com/cross-reference/json/criteriaresults/?Criteria.SiteSearchTerm=&Criteria.CurrentPageNumber=1&Criteria.FilterOptions.SortBy=CompetitorPartNumber&Criteria.FilterOptions.SortOrder=Asc&Criteria.CompetitorPartNumber=0330D0&Criteria.FilterOptions.PageSize=25&_=1569680055925

Параметры
Criteria.SiteSearchTerm=
Criteria.CurrentPageNumber=1
Criteria.FilterOptions.SortBy=CompetitorPartNumber
Criteria.FilterOptions.SortOrder=Asc
Criteria.CompetitorPartNumber=0330D0
Criteria.FilterOptions.PageSize=25
_=1569680055925

If I send a request query with GET parameters and method, I get the following
[Image: pic3.png]
this is not what is needed

If I send a request query with POST parameters and method, then I get
either 404 - page not found
or error 500
.

# function for web query and recording into a file
def fg_list_bot(_name_element, _output_file):
    s = requests.Session()
    _data = {"Criteria.SiteSearchTerm":"",
             "Criteria.CurrentPageNumber":"1",
             "Criteria.FilterOptions.SortBy":"CompetitorPartNumber",
             "Criteria.FilterOptions.SortOrder":"Asc",
             "Criteria.CompetitorPartNumber":_name_element,
             "Criteria.FilterOptions.PageSize":"25",
             "_":"1569680055925"}

    r = requests.post("_Url", data=_data)
    with open(_output_file, "w") as f: f.write(r.text)
    print(r.status_code)
    input()


import requests
from bs4 import BeautifulSoup
_url = "http://www.eatonpowersource.com/cross-reference/results/"
_name_element = "0330D0"      # text of the search request

_output_file = "Eaton_Vickers.html"
fg_list_bot(_name_element, _output_file)

RE: Scraping problems. Pls help with a correct request query. - perfringo - Sep-30-2019

I am not web scraper myself but isin't it an option to go directly to required page URL:

http://www.eatonpowersource.com/cross-reference/#/p:1/sb:0330D0/o:Asc/ps:25

RE: Scraping problems. Pls help with a correct request query. - Larz60+ - Sep-30-2019

You should use selenium for this.
You will also have to install proper driver (chromedriver or gekodriver)
(I use firefox, so gekodriver)

The following code is almost correct, you can finish:

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from pathlib import Path
import os
import PrettifyPage
import time


class FetchEaton:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        url = 'http://www.eatonpowersource.com/cross-reference/'
        self.pp = PrettifyPage.PrettifyPage()
        self.get_xref_data(url)
    
    def get_xref_data(self, url):
        browser = self.start_browser()
        browser.get(url)
        time.sleep(1)
        searchbox = browser.find_element(By.XPATH, '//*[@id="Criteria_CompetitorPartNumber"]')
        searchbox.clear()
        searchbox.send_keys('0330D0')
        btn = browser.find_element(By.XPATH, '/html/body/div[5]/section/div/section/div[1]/div[1]/span/aside/form/div[1]/div/span/button/i').click()
        time.sleep(2)
        table = browser.find_element(By.XPATH, '/html/body/div[5]/section/div/section/div[1]/div[2]/div[3]/div/div[1]/div/div/table')
        allRows = table.find_elements(By.TAG_NAME, 'tr');
        for row in allRows:
            cells = row.find_elements(By.TAG_NAME, 'td');
            for cell in cells:
                print(cell.text)
        self.stop_browser(browser)

    def start_browser(self):
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        return webdriver.Firefox(capabilities=caps)

    def stop_browser(self, browser):
        browser.close()


if __name__ == '__main__':
    FetchEaton()

Output:

Output:Filtration
Hydac 0330D003BHHC

Filtration V0334B2H03

Filtration
Hydac 0330D003BNHC

Filtration V0332B2C03

Filtration
Hydac 0330D005BHHC

Filtration V0334B2H10

Filtration
Hydac 0330D005BNHC

Filtration V0332B2C05

Filtration
Hydac 0330D010BHHC

Filtration V0334B2H10

Filtration
Hydac 0330D010BNHC

Filtration V0332B2C10

RE: Scraping problems. Pls help with a correct request query. - gtlhbkkj - Sep-30-2019

hi Larz60+,
thank you very much !
You are the best !
Thanks to your response I know now that there is a SELENIUM for Python and I have already started to investigate what it is.

I have simultaneously placed this Question to another Forum and the response was the following:
"use this link with GET method"
www.eatonpowersource.com/cross-reference/?sitesearchterm=0330D0
This is a very easy and short solution. It completely answers my Question. It is exactly what I need.
But my Problem now is: I do not understand, how did he come to this mentioned above link:

I have investigated the HTML text of the Website and have not found the word "sitesearchterm" at all. No Explanation from him followed.
May I kindly ask you to explain, where from comes this mentioned above string with solution?
Thank you

RE: Scraping problems. Pls help with a correct request query. - gtlhbkkj - Oct-01-2019

(Sep-30-2019, 08:14 AM)Larz60+ Wrote: You should use selenium for this....

Hi Larz60+,
thank you again for the sample code.
Today I have installed selenium and tried your code.
Everything works fine, even better than expected.
Please explain me what does it mean in your code the following

import PrettifyPage
self.pp = PrettifyPage.PrettifyPage()

Google know nothing about PrettifyPage. THe single match was this forum - this is what google found only.
What is it? How shall I install it or replace with an alternative.
Python returns an error.
Thank you

RE: Scraping problems. Pls help with a correct request query. - Larz60+ - Oct-01-2019

Sorry, I forgot to post
this is a module that allows you to use a variable indent with html, and adds line feeds where appropriate.
Makes html much easier to read
here's the script:

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line

if __name__ == '__main__':
    pp = PrettifyPage()

RE: Scraping problems. Pls help with a correct request query. - gtlhbkkj - Oct-01-2019

(Oct-01-2019, 07:17 PM)Larz60+ Wrote: Sorry, I forgot to post

THank you