Python Forum

Pages: 1 2

I have a program that is a couple years old. When ran it loops through a bunch of accounts and clips all coupons on each account. Recently, on some accounts it fails while on others it succeeds.

The actual error i get on the output is

Error:Message: unknown error: Cannot read property 'click' of undefined
  (Session info: chrome=63.0.3239.108)
  (Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 4.4.0-138-generic x86_64)

the code that causes this error is:

def clip_all_btns(browser, number_of_coupons, username):
    '''
    clip ALL available coupons
    '''
    browser.execute_script("document.getElementsByClassName('badge')[0].click()")
    time.sleep(2 * MULT) #allow time to load javascript of checkbox (first click only)
    browser.execute_script("document.getElementsByClassName('close-btn')[0].click()")
    for i in range(number_of_coupons): #now we are verified, clip all coupons
        browser.execute_script("document.getElementsByClassName('badge')[{}].click()".format(i))
        browser.get_log('browser')
    print_color('clipped all coupons on {}'.format(username), GREEN)

I double checked to make sure that you still need to verify and; you do. So the html is still the same. I remember i had an issue with selenium click() method, and executing the script seemed to work for some reason.

here is the outerHTML copy of one of the actual javascript buttons being clicked

<div class="badge"><span aria-label="activated coupon" class="activate-text">✓</span><span tabindex="0" role="button" aria-label="click to add coupon Excedrin®" class="add-text"><span>Add</span>&nbsp;+&nbsp;</span></div>

Where i am stumped is why some accounts succeed and some fail? If the HTML was changed then i would assume since all accounts are clicking the same button, then all accounts would fail. So since some accounts succeed, that cannot be the case. Of course the website goes to great lengths to stop what i am doing so maybe i need to rewrite the entire code to find what they might of changed to trick bots?

The only thing that comes to mind is that document.getElementsByClassName('badge') is returning an empty list. Are some of the accounts not receiving any coupons?

looks like you need some good old print statements to isolate which click.

displaying i and document.getElementsByClassName('badge')[{}]

so i have been inspecting it for awhile now. I think i know what is causing the issue. But i dont know how to fix it. If i switch to bring up the browser and watch it do its thing. It takes forever to load on some while others it does quickly. However it loads the page properly, but then it seems to load something else indefinitely, breaking my bot. It does this on login page and the page to clip coupons.

So if i sit and watch it when it loading, the chrome loading symbol will spin for what seems eternity. If i click it to stop loading, my script continues on its way after that. Otherwise if i let it spin and load, it pauses my script. And apparently will eventually crash it.

I have done this repeatedly to accounts that failed on its own and the it will succeed by killing the browser from loading after it loads the page. So i think it really pinpoints the issue to whatever the browser is loading .

Does anyone know what this might be?

Here is a quick example of what i mean. (its 2x fast). It might seem like i dont give it enough time, but it will go on for at least 15 minutes before it crashes the program.
https://youtu.be/IaZsz0mHXSA

and a little longer delay to prove that it is that, that is causing the delay.
https://youtu.be/HFRtsjfmUpE

So the quick fix seems to be just to mash the escape key after loading the page, but before insert data into textboxs or clicking buttons. That would in theory replicate success automatically. But i would rather figure out the root cause.

this is a dumb question, but is the chromium driver up to date?

curious why I cannot bring up the dg.coupons.com site

i stayed with an older one because the newer one caused a weird problem. It will open a webbrowser, but never load the page. If i Cntrl + X it out i get this error

Error:Message: unknown error: Chrome failed to start: exited normally
  (chrome not reachable)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
  (Driver info: chromedriver=2.43.600233 (523efee95e3d68b8719b3a1c83051aa63aa6b10d),platform=Linux 4.4.0-138-generic x86_64)

I get server not found if I try to load that site: https://dg.coupons.com/signin
I was going to look at the site and see if it's possible to use the new (4.2.5) lxml to get the coupons.

(Nov-04-2018, 10:35 PM)Larz60+ Wrote: [ -> ]I get server not found if I try to load that site: https://dg.coupons.com/signin

not sure why you dont get it. I load the webpage

Well I do now? perhaps traffic or maybe I had a typo.
I'll play with it in lxml a bit.
I was also curious if gekodriver and firefox might work.

I know this isn't a direct fix for your problem, but I'm sending the code anyway to see if it might be useful:

I have some code for the new lxml, that is not really a scraper, but close. It's a tool to help with scraping, and it uses lxml 4.2.5. It's still in development, but has enough functionality that it's a help for scraping. Like is has the XPath for every html line, and you can search for all links and/or images or text, etc. I ran it on the (downloaded, a cheat, but quick) coupon page and then just printed all text. But everything else is in the tag_list dictionary. I'm only a few days away from releasing it as a package.

I didn't want to deal with the passwords, etc but you already have that code, so I logged in and went to the coupon page and did a save as (to sane directory where the code resides.
This dictionary is structured like:

tag_list = {
    '001027': {
        'element': '<a href="/psf/sponsorship/sponsors/">Powered by Rackspace</a>',   # the etree element
        'tag': 'a',                                                                   # tag name
        'attrib': {'href': '/psf/sponsorship/sponsors/'},                             # attribute list
        'text':   'Powered by Rackspace',         'rakesr'rakesr'rakesr'rakesr'rakesr                                    # The text inside the element
        'tail': None                                                                  # The text following this element's 
                                                                                      # closing tag, up to the start tag of 
                                                                                      # the next sibling element
        'base': None                                                                  # The base URI from an xml:base attribute 
                                                                                      # that this element contains or inherits
        'prefix': None                                                                # The namespace prefix of this element, if any
        'xpath': 'body/div/footer/div[2]/div/div/p/small/span[5]/a'                   # XPath - Can be used in lxml, or BeautifulSoup
                                                                                      # to find element.
    }
}

There is another list which has en entry for each tag used on the page, and number of occurrences for each.
This is useful as a search aid it's named self.types and has the following structure:

self.types =  {
    'tag': [],
    'attrib': [],
    'text': [],
    'tail': [],
    'base': [],
    'prefix': [],
}

notice that it actually scrapes most of everything on the page, but doesn't follow links.
but you can find all images by searching

I tried it on this page and it seems to pull everything ok.

If you can use it, don't know why it can't be used in conjunction with selenium (logins, etc.)

to try quickly (without logging in) do a savepage as of the coupon page (to same directory as code)

Here's a sample run:, and then a few things you can do with the dictionary: (there are still debug statements in here)

>>> import ScrapeRake
>>> gtl = ScrapeRake.ScrapeRake()
>>> sfile = gtl.assert_path_is_pathlib('Coupons Gallery.html')
>>> gtl.build_tag_list(entity=sfile, usecache=True, textmode=True, savepagefile=sfile)

Types Encountered:
tag: [<class 'str'>, <class 'cython_function_or_method'>]
attrib: [<class 'lxml.etree._Attrib'>, <class 'lxml.etree._ImmutableMapping'>]
text: [<class 'NoneType'>, <class 'str'>]
tail: [<class 'NoneType'>, <class 'str'>]
base: [<class 'NoneType'>]
prefix: [<class 'NoneType'>]

True
>>> gtl.show_unique_tags()
head: occurs 1 times
strong: occurs 18 times
p: occurs 7 times
span: occurs 2 times
style: occurs 1 times
noscript: occurs 1 times
>>> mydict = gtl.tag_list
>>> for key in mydict.keys():
...     print(mydict[key]['text']
... 
... )
... 


Details:
Details:
on
 ONE Gain Flings OR Gain Powder OR Gain Liquid Laundry Detergent 
(excludes Gain Fabric Enhancer, Gain Fireworks, Gain FLINGS 9 ct, and 
trial/travel size).
Details:
Details:
on
 ONE Downy Liquid Fabric Conditioner 60 lds or smaller (includes Odor 
Protect), Bounce/Downy sheets 105 ct and smaller, OR In Wash Scent 
Boosters 6.5 oz or smaller (includes Downy Unstopables, Fresh Protect, 
Infusions, Bounce Bursts, Dreft Blissfuls OR Gain Fireworks 4.9oz) 
(excludes Gain Fireworks 6.5oz, Downy Libre Enjuague, and trial/travel 
size).
Details:
on
 ONE Tide PODS (excludes Tide Liquid/Powder Laundry Detergent, Tide 
Simply, Tide Simply PODS, and Tide PODS 9 ct and below, and trial/travel
 size).
Details:
on
 ONE Gain Liquid Fabric Softener 48 ld or higher (Includes Gain 
Botanicals) OR Gain Fireworks 5.7 oz or larger OR Gain Dryer Sheets 105 
ct or higher. (excludes Flings, Liquid Detergent and trial/travel size).
Details:
Details:
Details:
on
 ONE Tide Detergent 75 oz or lower (excludes Tide PODS, Tide Rescue, 
Tide Simply, Tide Simply PODS, Tide Detergent 10 oz and trial/travel 
size).
Details:
Details:
Details:
Details:
Details:
Details:
on
 ONE Crest Toothpaste 3 oz or more (excludes 4.6oz Cavity, Regular, 
Baking Soda, Tartar Control/Protection, all F&W Pep Gleem, 3DW 
Whitening Therapy, Crest Detoxify, Gum & Enamel Repair and 
trial/travel size).
Details:
Details:
Details:
Only
 one manufacturer's coupons (printed, digital, or mobile) may be used on
 a single item in a single transaction. For more information, see our 
Product
 availability, styles, colors, brands, promotions and prices may vary 
between stores and online. Early sell-out possible on special purchase 
items,and quantities may be otherwise limited. We reserve the right in 
our sole discretion to limit quantities to normal retail and online 
purchases. No rain checks available. Not responsiblefor typographical 
errors.
Sorry,
 the shopping list is too large for us to send! Please try removing some
 items from your list and try again. (Please note that coupons will 
remain in your shopping list until redeemed or expired.)
#olark-wrapper #olark-container .olark-button {
  background-color: #ffff00 !important;
  color: #333333 !important;
}
#olark-wrapper #olark-container .olark-button:hover {
  background-color: #e5e600 !important;
}
#olark-wrapper #olark-container .olark-theme-bg {
  background-color: #ffff00 !important;
}
#olark-wrapper #olark-container .olark-theme-text {
  color: #333333 !important;
}
#olark-wrapper .olark-launch-button {
  background-color: #ffff00 !important;
}
#olark-wrapper .olark-launch-button svg path {
  fill: #333333 !important;
}
#olark-wrapper .olark-launch-button .olark-button-text {
  color: #333333 !important;
}
#olark-wrapper .olark-top-bar {
  background-color: #ffff00 !important;
  color: #333333 !important;
  border-color: #e5e600 !important;
}
#olark-wrapper .olark-top-bar-text {
  color: #333333 !important;
}
#olark-wrapper .olark-top-bar-arrow {
  fill: #333333 !important;
}
#olark-wrapper .olark-end-chat-button {
  color: #333333 !important;
  background-color: rgba(203, 204, 0, 0.5) !important;
}
#olark-wrapper .olark-end-chat-button:hover {
  background-color: #cbcc00 !important;
}
#olark-wrapper #olark-container .olark-visitor-message:not(.olark-message-trans-bg) {
  background-color: rgba(255, 255, 0, 0.25) !important;
}
#olark-wrapper #olark-container .olark-form-send-button {
  background-color: #ffff00 !important;
  color: #333333 !important;
}
#olark-wrapper #olark-container .olark-feedback-form-button {
  background-color: #ffff00 !important;
  color: #333333 !important;
}
#olark-wrapper #olark-container .olark-restart-button {
  background-color: #ffff00 !important;
  color: #333333 !important;
}
#olark-wrapper #olark-container .olark-branding-panel .olark-branding-cancel-button {
  background-color: #ffff00 !important;
  border: none !important;
  color: #333333 !important;
}
#olark-wrapper #olark-container .olark-branding-panel .olark-branding-go-button {
  border: none !important;
  background: rgba(255, 255, 0, 0.35) !important;
}
#olark-wrapper #olark-container .olark-send-transcript-container .olark-send-transcript-form.olark-inline-form-valid .olark-form-input-container {
  border-color: #ffff00 !important;
}
#olark-wrapper #olark-container .olark-send-transcript-container .olark-send-transcript-form.olark-inline-form-valid .olark-send-icon {
  fill: #ffff00 !important;
}
#olark-wrapper #olark-container .olark-visitor-message:not(.olark-message-has-border) {
  border: none !important;
}

None
>>>

The html is there, xpaths, etc. Here's the list:

Output:        Details include:
            sourceline - dictionary key, left zero padded to length of 6.
                The line number of this element when parsed, if known, otherwise None.
            tag
                The element's name.
            attrib
                A dictionary containing the element's attributes. The keys are the attribute names, and each corres-
                ponding value is the attribute's value.
            text
                The text inside the element, up to the start tag of the first child element. If there was no text there,
                this attribute will have the value None.
            tail
                The text following this element's closing tag, up to the start tag of the next sibling element. If there
                was no text there, this attribute will have the value None.
                This way of associating text with elements is not really typical of the way most XML processing
                models work; see Section 2, “How ElementTree represents XML” (p. 3).
            base
                The base URI from an xml:base attribute that this element contains or inherits, if any; None oth-
                erwise.
            prefix
                The namespace prefix of this element, if any, otherwise None.
            XPath
                Xml xpath to rech element from root of tree.

Here's the code:
ScrapeRake.py

# MIT License
#
# Copyright (c) 2018 L. McCaig A.K.A. Larz60+
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
import requests
import json
import os
import lxml.etree as etree
from lxml.etree import Element
import pathlib
import sys
from io import BytesIO


class ScrapeRake:
    def __init__(self):
        # need to set cwd to source path (will allow relative file addressing)
        self.home = os.path.abspath(os.path.dirname(__file__))
        os.chdir(self.home)
        self.user_agent = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:60.0) '
            'Gecko/20100101 Firefox/60.0 AppleWebKit/537.36 (KHTML, '
            'like Gecko) Chrome/68.0.3440.75 Safari/537.36'
        }
        self.tree = None
        self.tag_list = {}
        self.types =  {
            'tag': [],
            'attrib': [],
            'text': [],
            'tail': [],
            'base': [],
            'prefix': [],
        }

    def assert_path_is_pathlib(self, filename):
        if isinstance(filename, pathlib.PosixPath):
            return filename
        if ('/' in filename):
            return pathlib.Path(filename)
        # otherwise assume localdir
        homepath = pathlib.Path(self.home)
        return homepath / filename

    def fetch_page(self, entity, usecache=True, textmode=False, savepagefile=None):
        """
        entity may be either a full path or a url.
        If a url, you must specify a savepagefile,
        This is where the html of the webpage is to be stored
        """
        fmodes = {
            'read': 'rb', 
            'write': 'wb'
        }
        page = None
        if textmode:
            fmodes['read'] = 'r'
            fmodes['write'] = 'w'

        if isinstance(entity, str) and entity.startswith('http'):             # it's a URL
            if savepagefile is None:
                print('Savefile not specified, please specify and restart')
                sys.exit(-1)
            if usecache:
                try:
                    with savepagefile.open(fmodes['read']) as fp:
                        page = fp.read()
                except:
                    page = self.get_url(entity, textmode)
            else:
                page = self.get_url(entity, textmode)
            
            with savepagefile.open(fmodes['write']) as fp:
                fp.write(page)
        elif isinstance(entity, pathlib.PosixPath):
            with entity.open(fmodes['read']) as fp:
                page = fp.read()
        return page

    def get_url(self, url, textmode=False):
        """
        Fetch url and return contents
        if textmode, response.text will be returned, otherwise
        binary content.
        """
        page = None
        response = requests.get(url, headers=self.user_agent)
        if response.status_code == 200:
            if textmode:
                page = response.text
            else:
                page = response.content
        return page

    def build_tag_list(self, entity, usecache=True, textmode=None, savepagefile=None):
        """
        Construct tag_list dictionary and types dictionary.

        entity may be either a full path or a url.

        If a url, you must specify a savepagefile,
        This is where the html of the webpage is to be stored

        if usecache is True and entity is a url, will use cached file rather than
        fetch a new insttance. If True will attempt to use cached file, if the
        cached file can't be found, reverts back to url download.

        """
        retval = False
        types = self.types

        mytextmode = False

        page = self.fetch_page(entity, usecache, mytextmode, savepagefile)
        conditioned_page = BytesIO(page)

        tree = etree.parse(conditioned_page, parser=etree.HTMLParser())
        tree_walker = tree.getiterator()

        for n, element in enumerate(tree_walker):
            tl = self.tag_list[f'{element.sourceline:06}'] = {}
            # children = element.getchildren()
            # print(f'lengthchildren: {len(children)}')
            # for n1, kid in enumerate(children):
            #     grandchildren = kid.getchildren()
            #     for n2, grandkid in enumerate(grandchildren):
            #         print(f'\n\nelement_{n}, length grandkid: {len(grandkid)}, kid_{n1}, grandkid_{n2}: {etree.tostring(kid)}')
            #     else:
            #         print(f'\n\nelement_{n}, length kid: {len(kid)}, kid_{n1}: {etree.tostring(kid)}')
            #     input()
            # # sys.exit(0)
            tl['element'] = etree.tostring(element)

            tag = str(element.tag)
            if 'class' in tag:
                tag = etree.tostring(element.tag)
            tl['tag'] = tag
            ttype = type(element.tag)
            if ttype not in types['tag']:
                types['tag'].append(ttype)

            attrib = element.attrib
            tl['attrib'] = attrib 
            ttype = type(element.attrib)                   
            if ttype not in types['attrib']:
                types['attrib'].append(ttype)
            
            tl['text'] = element.text
            ttype = type(element.text)                   
            if ttype not in types['text']:
                types['text'].append(ttype)

            tail = element.tail
            if tail is not None:
                tail = tail.strip()
            tl['tail'] = tail
            ttype = type(element.tail)                   
            if ttype not in types['tail']:
                types['tail'].append(ttype)

            tl['base'] = element.base
            ttype = type(element.base)                   
            if ttype not in types['base']:
                types['base'].append(ttype)

            tl['prefix'] = element.prefix
            ttype = type(element.prefix)                   
            if ttype not in types['prefix']:
                types['prefix'].append(ttype)
            
            # add element xpath
            if 'Comment' in str(element.tag):
                xpath = None
            else:
                xpath = tree.getelementpath(element)
            tl['xpath'] = xpath
        retval = True

        print(f'\nTypes Encountered:')
        for key, value in types.items():
            print(f'{key}: {value}')
        print()
        return retval


    def display_one(self, key):
        print('\n============================================')        
        print(f'source line: {key} ')
        element = self.tag_list[key]
        for key1, value in element.items():
            print(f'key: {key1}, value: {value}')
        print('\n============================================')

    def show_tags(self, thedict, level= 0):
        """
        Show details about each tag
        from the self.tag_list dictionary.

        Details include:
            sourceline - dictionary key, left zero padded to length of 6.
                The line number of this element when parsed, if known, otherwise None.
            tag
                The element's name.
            attrib
                A dictionary containing the element's attributes. The keys are the attribute names, and each corres-
                ponding value is the attribute's value.
            text
                The text inside the element, up to the start tag of the first child element. If there was no text there,
                this attribute will have the value None.
            tail
                The text following this element's closing tag, up to the start tag of the next sibling element. If there
                was no text there, this attribute will have the value None.
                This way of associating text with elements is not really typical of the way most XML processing
                models work; see Section 2, “How ElementTree represents XML” (p. 3).
            base
                The base URI from an xml:base attribute that this element contains or inherits, if any; None oth-
                erwise.
            prefix
                The namespace prefix of this element, if any, otherwise None.
            XPath
                Xml xpath to rech element from root of tree.
        """
        for key, value in thedict.items():
            if isinstance(value, dict):
                if level == 0:
                    print('\n============================================')
                    print(f'source line: {key} ')
                else:
                    print(f'key: {key} ', end = '')
                self.show_tags(value, level + 1)
            else:
                print(f'key: {key}, value: {value}')

    def show_unique_tags(self):
        """
        Display a list of unique tags and number of times used
        """
        tag_occurence = {}
        if len(self.tag_list) == 0:
            print('Build tag_list first')
        else:
            for key, details in self.tag_list.items():
                tagid = details['tag']
                if tagid not in tag_occurence:
                    tag_occurence[tagid] = 1
                else:
                    tag_occurence[tagid] += 1
            for tagid, count in tag_occurence.items():
                print(f'{tagid}: occurs {count} times')

    def find_element(self,
        element=None,
        element_partial=False,
        startswith=False, 
        endswith=False, 
        XPath=None, 
        tag=None, 
        attribs=[],
        attrib_any=True,
        attrib_all=False,
        text=None, 
        tail=None):
        """
            Returns line number and scrape info for all lines matching element (In it's entirety).
        """

        if isinstance(element, bytes):
            element = element.decode()
        nonefound = True
        for key in self.tag_list.keys():
            if key == '000545':
                print('stop here')
            telement = self.tag_list[key]['element']
            if isinstance(telement, bytes):
                telement = str(telement.decode())
            telement = telement.strip()
            print(f'telement: {telement}\n element: {element}')
            if telement == element:
                self.display_one(key)
                nonefound = False
        if nonefound:
            print(f'\nelement: {element} was not found')
            
def testit():
    """
    Test all
    """
    gtl = ScrapeRake()

    # Try url
    sfile = gtl.assert_path_is_pathlib('Coupons Gallery.html')
    if not gtl.build_tag_list(entity=sfile, usecache=True, textmode=True, savepagefile=sfile):
    # if not gtl.build_tag_list(entity='https://www.python.org/', usecache=True, textmode=True, savepagefile=sfile):
        print('build_tag_list from URL failed')
    else:
        # test show_unique_tag
        gtl.show_unique_tags()

    # try from savefile
    if not gtl.build_tag_list(entity='python_org.html'):
        print('build_tag_list from URL failed')
    else:
        #test show_tags(gtl.tag_list)
        gtl.show_tags(gtl.tag_list)

    # test find_element
    gtl.find_element('<p>Python is a programming language that lets you work quickly ')

if __name__ == '__main__':
    testit()

Pages: 1 2

metulburr

stullis

Larz60+

metulburr

Larz60+

metulburr

Larz60+

metulburr

Larz60+

Larz60+