I know this isn't a direct fix for your problem, but I'm sending the code anyway to see if it might be useful:
I have some code for the new lxml, that is not really a scraper, but close. It's a tool to help with scraping, and it uses lxml 4.2.5. It's still in development, but has enough functionality that it's a help for scraping. Like is has the XPath for every html line, and you can search for all links and/or images or text, etc. I ran it on the (downloaded, a cheat, but quick) coupon page and then just printed all text. But everything else is in the tag_list dictionary. I'm only a few days away from releasing it as a package.
I didn't want to deal with the passwords, etc but you already have that code, so I logged in and went to the coupon page and did a save as (to sane directory where the code resides.
This dictionary is structured like:
tag_list = {
'001027': {
'element': '<a href="/psf/sponsorship/sponsors/">Powered by Rackspace</a>', # the etree element
'tag': 'a', # tag name
'attrib': {'href': '/psf/sponsorship/sponsors/'}, # attribute list
'text': 'Powered by Rackspace', 'rakesr'rakesr'rakesr'rakesr'rakesr # The text inside the element
'tail': None # The text following this element's
# closing tag, up to the start tag of
# the next sibling element
'base': None # The base URI from an xml:base attribute
# that this element contains or inherits
'prefix': None # The namespace prefix of this element, if any
'xpath': 'body/div/footer/div[2]/div/div/p/small/span[5]/a' # XPath - Can be used in lxml, or BeautifulSoup
# to find element.
}
}
There is another list which has en entry for each tag used on the page, and number of occurrences for each.
This is useful as a search aid it's named self.types and has the following structure:
self.types = {
'tag': [],
'attrib': [],
'text': [],
'tail': [],
'base': [],
'prefix': [],
}
notice that it actually scrapes most of everything on the page, but doesn't follow links.
but you can find all images by searching
I tried it on this page and it seems to pull everything ok.
If you can use it, don't know why it can't be used in conjunction with selenium (logins, etc.)
to try quickly (without logging in) do a savepage as of the coupon page (to same directory as code)
Here's a sample run:, and then a few things you can do with the dictionary: (there are still debug statements in here)
>>> import ScrapeRake
>>> gtl = ScrapeRake.ScrapeRake()
>>> sfile = gtl.assert_path_is_pathlib('Coupons Gallery.html')
>>> gtl.build_tag_list(entity=sfile, usecache=True, textmode=True, savepagefile=sfile)
Types Encountered:
tag: [<class 'str'>, <class 'cython_function_or_method'>]
attrib: [<class 'lxml.etree._Attrib'>, <class 'lxml.etree._ImmutableMapping'>]
text: [<class 'NoneType'>, <class 'str'>]
tail: [<class 'NoneType'>, <class 'str'>]
base: [<class 'NoneType'>]
prefix: [<class 'NoneType'>]
True
>>> gtl.show_unique_tags()
head: occurs 1 times
strong: occurs 18 times
p: occurs 7 times
span: occurs 2 times
style: occurs 1 times
noscript: occurs 1 times
>>> mydict = gtl.tag_list
>>> for key in mydict.keys():
... print(mydict[key]['text']
...
... )
...
Details:
Details:
on
ONE Gain Flings OR Gain Powder OR Gain Liquid Laundry Detergent
(excludes Gain Fabric Enhancer, Gain Fireworks, Gain FLINGS 9 ct, and
trial/travel size).
Details:
Details:
on
ONE Downy Liquid Fabric Conditioner 60 lds or smaller (includes Odor
Protect), Bounce/Downy sheets 105 ct and smaller, OR In Wash Scent
Boosters 6.5 oz or smaller (includes Downy Unstopables, Fresh Protect,
Infusions, Bounce Bursts, Dreft Blissfuls OR Gain Fireworks 4.9oz)
(excludes Gain Fireworks 6.5oz, Downy Libre Enjuague, and trial/travel
size).
Details:
on
ONE Tide PODS (excludes Tide Liquid/Powder Laundry Detergent, Tide
Simply, Tide Simply PODS, and Tide PODS 9 ct and below, and trial/travel
size).
Details:
on
ONE Gain Liquid Fabric Softener 48 ld or higher (Includes Gain
Botanicals) OR Gain Fireworks 5.7 oz or larger OR Gain Dryer Sheets 105
ct or higher. (excludes Flings, Liquid Detergent and trial/travel size).
Details:
Details:
Details:
on
ONE Tide Detergent 75 oz or lower (excludes Tide PODS, Tide Rescue,
Tide Simply, Tide Simply PODS, Tide Detergent 10 oz and trial/travel
size).
Details:
Details:
Details:
Details:
Details:
Details:
on
ONE Crest Toothpaste 3 oz or more (excludes 4.6oz Cavity, Regular,
Baking Soda, Tartar Control/Protection, all F&W Pep Gleem, 3DW
Whitening Therapy, Crest Detoxify, Gum & Enamel Repair and
trial/travel size).
Details:
Details:
Details:
Only
one manufacturer's coupons (printed, digital, or mobile) may be used on
a single item in a single transaction. For more information, see our
Product
availability, styles, colors, brands, promotions and prices may vary
between stores and online. Early sell-out possible on special purchase
items,and quantities may be otherwise limited. We reserve the right in
our sole discretion to limit quantities to normal retail and online
purchases. No rain checks available. Not responsiblefor typographical
errors.
Sorry,
the shopping list is too large for us to send! Please try removing some
items from your list and try again. (Please note that coupons will
remain in your shopping list until redeemed or expired.)
#olark-wrapper #olark-container .olark-button {
background-color: #ffff00 !important;
color: #333333 !important;
}
#olark-wrapper #olark-container .olark-button:hover {
background-color: #e5e600 !important;
}
#olark-wrapper #olark-container .olark-theme-bg {
background-color: #ffff00 !important;
}
#olark-wrapper #olark-container .olark-theme-text {
color: #333333 !important;
}
#olark-wrapper .olark-launch-button {
background-color: #ffff00 !important;
}
#olark-wrapper .olark-launch-button svg path {
fill: #333333 !important;
}
#olark-wrapper .olark-launch-button .olark-button-text {
color: #333333 !important;
}
#olark-wrapper .olark-top-bar {
background-color: #ffff00 !important;
color: #333333 !important;
border-color: #e5e600 !important;
}
#olark-wrapper .olark-top-bar-text {
color: #333333 !important;
}
#olark-wrapper .olark-top-bar-arrow {
fill: #333333 !important;
}
#olark-wrapper .olark-end-chat-button {
color: #333333 !important;
background-color: rgba(203, 204, 0, 0.5) !important;
}
#olark-wrapper .olark-end-chat-button:hover {
background-color: #cbcc00 !important;
}
#olark-wrapper #olark-container .olark-visitor-message:not(.olark-message-trans-bg) {
background-color: rgba(255, 255, 0, 0.25) !important;
}
#olark-wrapper #olark-container .olark-form-send-button {
background-color: #ffff00 !important;
color: #333333 !important;
}
#olark-wrapper #olark-container .olark-feedback-form-button {
background-color: #ffff00 !important;
color: #333333 !important;
}
#olark-wrapper #olark-container .olark-restart-button {
background-color: #ffff00 !important;
color: #333333 !important;
}
#olark-wrapper #olark-container .olark-branding-panel .olark-branding-cancel-button {
background-color: #ffff00 !important;
border: none !important;
color: #333333 !important;
}
#olark-wrapper #olark-container .olark-branding-panel .olark-branding-go-button {
border: none !important;
background: rgba(255, 255, 0, 0.35) !important;
}
#olark-wrapper #olark-container .olark-send-transcript-container .olark-send-transcript-form.olark-inline-form-valid .olark-form-input-container {
border-color: #ffff00 !important;
}
#olark-wrapper #olark-container .olark-send-transcript-container .olark-send-transcript-form.olark-inline-form-valid .olark-send-icon {
fill: #ffff00 !important;
}
#olark-wrapper #olark-container .olark-visitor-message:not(.olark-message-has-border) {
border: none !important;
}
None
>>>
The html is there, xpaths, etc. Here's the list:
Output:
Details include:
sourceline - dictionary key, left zero padded to length of 6.
The line number of this element when parsed, if known, otherwise None.
tag
The element's name.
attrib
A dictionary containing the element's attributes. The keys are the attribute names, and each corres-
ponding value is the attribute's value.
text
The text inside the element, up to the start tag of the first child element. If there was no text there,
this attribute will have the value None.
tail
The text following this element's closing tag, up to the start tag of the next sibling element. If there
was no text there, this attribute will have the value None.
This way of associating text with elements is not really typical of the way most XML processing
models work; see Section 2, “How ElementTree represents XML” (p. 3).
base
The base URI from an xml:base attribute that this element contains or inherits, if any; None oth-
erwise.
prefix
The namespace prefix of this element, if any, otherwise None.
XPath
Xml xpath to rech element from root of tree.
Here's the code:
ScrapeRake.py
# MIT License
#
# Copyright (c) 2018 L. McCaig A.K.A. Larz60+
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
import requests
import json
import os
import lxml.etree as etree
from lxml.etree import Element
import pathlib
import sys
from io import BytesIO
class ScrapeRake:
def __init__(self):
# need to set cwd to source path (will allow relative file addressing)
self.home = os.path.abspath(os.path.dirname(__file__))
os.chdir(self.home)
self.user_agent = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:60.0) '
'Gecko/20100101 Firefox/60.0 AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
self.tree = None
self.tag_list = {}
self.types = {
'tag': [],
'attrib': [],
'text': [],
'tail': [],
'base': [],
'prefix': [],
}
def assert_path_is_pathlib(self, filename):
if isinstance(filename, pathlib.PosixPath):
return filename
if ('/' in filename):
return pathlib.Path(filename)
# otherwise assume localdir
homepath = pathlib.Path(self.home)
return homepath / filename
def fetch_page(self, entity, usecache=True, textmode=False, savepagefile=None):
"""
entity may be either a full path or a url.
If a url, you must specify a savepagefile,
This is where the html of the webpage is to be stored
"""
fmodes = {
'read': 'rb',
'write': 'wb'
}
page = None
if textmode:
fmodes['read'] = 'r'
fmodes['write'] = 'w'
if isinstance(entity, str) and entity.startswith('http'): # it's a URL
if savepagefile is None:
print('Savefile not specified, please specify and restart')
sys.exit(-1)
if usecache:
try:
with savepagefile.open(fmodes['read']) as fp:
page = fp.read()
except:
page = self.get_url(entity, textmode)
else:
page = self.get_url(entity, textmode)
with savepagefile.open(fmodes['write']) as fp:
fp.write(page)
elif isinstance(entity, pathlib.PosixPath):
with entity.open(fmodes['read']) as fp:
page = fp.read()
return page
def get_url(self, url, textmode=False):
"""
Fetch url and return contents
if textmode, response.text will be returned, otherwise
binary content.
"""
page = None
response = requests.get(url, headers=self.user_agent)
if response.status_code == 200:
if textmode:
page = response.text
else:
page = response.content
return page
def build_tag_list(self, entity, usecache=True, textmode=None, savepagefile=None):
"""
Construct tag_list dictionary and types dictionary.
entity may be either a full path or a url.
If a url, you must specify a savepagefile,
This is where the html of the webpage is to be stored
if usecache is True and entity is a url, will use cached file rather than
fetch a new insttance. If True will attempt to use cached file, if the
cached file can't be found, reverts back to url download.
"""
retval = False
types = self.types
mytextmode = False
page = self.fetch_page(entity, usecache, mytextmode, savepagefile)
conditioned_page = BytesIO(page)
tree = etree.parse(conditioned_page, parser=etree.HTMLParser())
tree_walker = tree.getiterator()
for n, element in enumerate(tree_walker):
tl = self.tag_list[f'{element.sourceline:06}'] = {}
# children = element.getchildren()
# print(f'lengthchildren: {len(children)}')
# for n1, kid in enumerate(children):
# grandchildren = kid.getchildren()
# for n2, grandkid in enumerate(grandchildren):
# print(f'\n\nelement_{n}, length grandkid: {len(grandkid)}, kid_{n1}, grandkid_{n2}: {etree.tostring(kid)}')
# else:
# print(f'\n\nelement_{n}, length kid: {len(kid)}, kid_{n1}: {etree.tostring(kid)}')
# input()
# # sys.exit(0)
tl['element'] = etree.tostring(element)
tag = str(element.tag)
if 'class' in tag:
tag = etree.tostring(element.tag)
tl['tag'] = tag
ttype = type(element.tag)
if ttype not in types['tag']:
types['tag'].append(ttype)
attrib = element.attrib
tl['attrib'] = attrib
ttype = type(element.attrib)
if ttype not in types['attrib']:
types['attrib'].append(ttype)
tl['text'] = element.text
ttype = type(element.text)
if ttype not in types['text']:
types['text'].append(ttype)
tail = element.tail
if tail is not None:
tail = tail.strip()
tl['tail'] = tail
ttype = type(element.tail)
if ttype not in types['tail']:
types['tail'].append(ttype)
tl['base'] = element.base
ttype = type(element.base)
if ttype not in types['base']:
types['base'].append(ttype)
tl['prefix'] = element.prefix
ttype = type(element.prefix)
if ttype not in types['prefix']:
types['prefix'].append(ttype)
# add element xpath
if 'Comment' in str(element.tag):
xpath = None
else:
xpath = tree.getelementpath(element)
tl['xpath'] = xpath
retval = True
print(f'\nTypes Encountered:')
for key, value in types.items():
print(f'{key}: {value}')
print()
return retval
def display_one(self, key):
print('\n============================================')
print(f'source line: {key} ')
element = self.tag_list[key]
for key1, value in element.items():
print(f'key: {key1}, value: {value}')
print('\n============================================')
def show_tags(self, thedict, level= 0):
"""
Show details about each tag
from the self.tag_list dictionary.
Details include:
sourceline - dictionary key, left zero padded to length of 6.
The line number of this element when parsed, if known, otherwise None.
tag
The element's name.
attrib
A dictionary containing the element's attributes. The keys are the attribute names, and each corres-
ponding value is the attribute's value.
text
The text inside the element, up to the start tag of the first child element. If there was no text there,
this attribute will have the value None.
tail
The text following this element's closing tag, up to the start tag of the next sibling element. If there
was no text there, this attribute will have the value None.
This way of associating text with elements is not really typical of the way most XML processing
models work; see Section 2, “How ElementTree represents XML” (p. 3).
base
The base URI from an xml:base attribute that this element contains or inherits, if any; None oth-
erwise.
prefix
The namespace prefix of this element, if any, otherwise None.
XPath
Xml xpath to rech element from root of tree.
"""
for key, value in thedict.items():
if isinstance(value, dict):
if level == 0:
print('\n============================================')
print(f'source line: {key} ')
else:
print(f'key: {key} ', end = '')
self.show_tags(value, level + 1)
else:
print(f'key: {key}, value: {value}')
def show_unique_tags(self):
"""
Display a list of unique tags and number of times used
"""
tag_occurence = {}
if len(self.tag_list) == 0:
print('Build tag_list first')
else:
for key, details in self.tag_list.items():
tagid = details['tag']
if tagid not in tag_occurence:
tag_occurence[tagid] = 1
else:
tag_occurence[tagid] += 1
for tagid, count in tag_occurence.items():
print(f'{tagid}: occurs {count} times')
def find_element(self,
element=None,
element_partial=False,
startswith=False,
endswith=False,
XPath=None,
tag=None,
attribs=[],
attrib_any=True,
attrib_all=False,
text=None,
tail=None):
"""
Returns line number and scrape info for all lines matching element (In it's entirety).
"""
if isinstance(element, bytes):
element = element.decode()
nonefound = True
for key in self.tag_list.keys():
if key == '000545':
print('stop here')
telement = self.tag_list[key]['element']
if isinstance(telement, bytes):
telement = str(telement.decode())
telement = telement.strip()
print(f'telement: {telement}\n element: {element}')
if telement == element:
self.display_one(key)
nonefound = False
if nonefound:
print(f'\nelement: {element} was not found')
def testit():
"""
Test all
"""
gtl = ScrapeRake()
# Try url
sfile = gtl.assert_path_is_pathlib('Coupons Gallery.html')
if not gtl.build_tag_list(entity=sfile, usecache=True, textmode=True, savepagefile=sfile):
# if not gtl.build_tag_list(entity='https://www.python.org/', usecache=True, textmode=True, savepagefile=sfile):
print('build_tag_list from URL failed')
else:
# test show_unique_tag
gtl.show_unique_tags()
# try from savefile
if not gtl.build_tag_list(entity='python_org.html'):
print('build_tag_list from URL failed')
else:
#test show_tags(gtl.tag_list)
gtl.show_tags(gtl.tag_list)
# test find_element
gtl.find_element('<p>Python is a programming language that lets you work quickly ')
if __name__ == '__main__':
testit()