Python Forum

Full Version: Python - Scrapy Login in
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello guys...i need your help...i was messing with the scrapy earlier, but for some reason my script doesnt work

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider

class StrongbotSpider(InitSpider):
    name = 'StrongBot'
    login_url = 'https://www.tatechnix.de/tatechnix/gx/login.php'
    start_urls = ['https://www.tatechnix.de/tatechnix/gx/product_info.php?info=p44235_ta-technix-sport-suspension-kit-opel-astra-h-caravan-2-0t-1-7-1-9cdti--without-level-control-type-a-h-30-30mm.html']

    def init_request(self):
        return scrapy.Request(
            url=self.login_url,
            callback=self.login,
        )

    def login(self, response):
        yield scrapy.FormRequest.from_response(
            response=response,
            formid='login',
            formdata={
                'email_address': 'example',
                'password': 'example',
            },
            callback=self.initialized,
        )

    def parse(self, response):
        for content in response.css('#gm_attr_calc_price'):
            yield {
                'Price' : content.css('span[itemprop="price"]::Text').extract()
            }
Here is the results:
(Scrapy) C:\Users\Petros\Python\TaTechnix18>scrapy crawl StrongBot
2018-10-19 09:59:32 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: TaTechnix18)
2018-10-19 09:59:32 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:23:52) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-10-19 09:59:32 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TaTechnix18', 'NEWSPIDER_MODULE': 'TaTechnix18.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['TaTechnix18.spiders']}
2018-10-19 09:59:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-19 09:59:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-19 09:59:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-19 09:59:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-19 09:59:33 [scrapy.core.engine] INFO: Spider opened
2018-10-19 09:59:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-19 09:59:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-19 09:59:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tatechnix.de/robots.txt> (referer: None)
2018-10-19 09:59:34 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.tatechnix.de/tatechnix/gx/login.php>
2018-10-19 09:59:34 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-19 09:59:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
 'downloader/request_bytes': 225,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7658,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 19, 6, 59, 34, 360982),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 19, 6, 59, 33, 758907)}
2018-10-19 09:59:34 [scrapy.core.engine] INFO: Spider closed (finished)
p.s



Here's another method but doesnt work either
# -*- coding: utf-8 -*-
import scrapy

class StrongbotSpider(scrapy.Spider):
    name = 'StrongBot'
    login_url = 'https://www.tatechnix.de/tatechnix/gx/login.php'
    start_urls = ['https://www.tatechnix.de/tatechnix/gx/product_info.php?info=p44235_ta-technix-sport-suspension-kit-opel-astra-h-caravan-2-0t-1-7-1-9cdti--without-level-control-type-a-h-30-30mm.html']

    def login(self, response):
        data = {
            'email_address': '[email protected]',
            'password': 'example',
            }
        yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_products)

    def parse(self, response):
        for content in response.css('#gm_attr_calc_price'):
            yield {
                'Price' : content.css('span[itemprop="price"]::Text').extract()
            }
(Oct-19-2018, 07:43 AM)Baggelhsk95 Wrote: [ -> ]
2018-10-19 09:59:34 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.tatechnix.de/tatechnix/gx/login.php>
Looks like the website doesn't want bots to visit the login page.
If you want, you can tell scrapy not to respect robots.txt using the ROBOTSTXT_OBEY setting.
if ill run normal bot without login, im getting the data just fine....
If you don't get to open the login page, the initialized() callback never gets called, so your spider never goes on to process the starting requests...