Posts: 214
Threads: 54
Joined: Sep 2019
Hello,
Here is the tag from where I want to extract text fragment (in bold):
<a class="a-link-normal a-text-normal" href=" /Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">.
Here is my code:
import urllib.request
from bs4 import BeautifulSoup
import re
def download(url, user_agent='wswp', num_retries=2):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
nameList = bs.find_all('a', {'href':re.compile('"/.*(keywords).*')})
print(len(nameList))
for name in nameList:
print(name.get_text()) Doesn't work.
Any suggestions.
Thanks.
Posts: 25
Threads: 7
Joined: Oct 2019
Hi :)
If this is your entire string then you can do it liek this with re. expr.
import re
string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">
url = re.search("href=\"(.*)>", string)
print(url.group(1))
#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_) Here is the regexHelper I use :)
-> RegexHelper
Kr
Posts: 8,085
Threads: 153
Joined: Sep 2016
Nov-23-2019, 01:11 AM
(This post was last modified: Nov-23-2019, 01:11 AM by buran.)
Posts: 214
Threads: 54
Joined: Sep 2019
Nov-25-2019, 09:49 AM
(This post was last modified: Nov-25-2019, 10:07 AM by Pavel_47.)
(Nov-22-2019, 08:32 PM)Fre3k Wrote: Hi :)
If this is your entire string then you can do it liek this with re. expr.
import re
string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">
url = re.search("href=\"(.*)>", string)
print(url.group(1))
#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_) Here is the regexHelper I use :)
-> RegexHelper
Kr
In my case the searching object isn't string, but BeaurtifulSoup.
(Nov-23-2019, 01:11 AM)buran Wrote: https://python-forum.io/Thread-Learning-...2#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-...9#pid96789
Thanks. Tried without regexp. Doesn't work.
html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
link = bs.select('a[href^⁼"/Cybersecurity"]')
print(link) Here is output:
Output: Traceback (most recent call last):
File "/home/pavel/python_code/BeautifulSoup_test1.py", line 23, in <module>
link = bs.select('a[href^⁼"/Cybersecurity"]')
File "/home/pavel/.local/lib/python3.6/site-packages/bs4/element.py", line 1358, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 114, in select
return compile(select, namespaces, flags, **kwargs).select(tag, limit)
File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 63, in compile
return cp._cached_css_compile(pattern, namespaces, custom, flags)
File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 206, in _cached_css_compile
CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(),
File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1062, in process_selectors
return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 911, in parse_selectors
key, m = next(iselector)
File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1055, in selector_iter
raise SelectorSyntaxError(msg, self.pattern, index)
File "<string>", line None
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 1
line 1:
a[href^⁼"/Cybersecurity"]
Posts: 8,085
Threads: 153
Joined: Sep 2016
Nov-25-2019, 11:57 AM
(This post was last modified: Nov-25-2019, 03:25 PM by buran.)
import requests
from bs4 import BeautifulSoup
url = 'http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,*/*',
'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive'}
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))
# using css selectors
a1 = soup.select('a.a-link-normal.a-text-normal')
print(a1[0].get('href')) Not sure why you get malformed selector error, but you probably don't want to select using part of the href attribute (would you change the selector for every page?)
Posts: 214
Threads: 54
Joined: Sep 2019
It works. Thanks.
But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)
Could it possible to search using keyword 3319988417, which is ISBN for this book.
Thnks once more.
Posts: 7,068
Threads: 122
Joined: Sep 2016
In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is = .
This is not the best way for this anyway,as it find all Cybersecurity links and return a list then have to clean up some more.
Quote:Could it possible to search using keyword 3319988417, which is ISBN for this book.
Think i did answer this in your other Thread post #6.
Posts: 214
Threads: 54
Joined: Sep 2019
(Nov-25-2019, 01:09 PM)snippsat Wrote: In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =. Thanks.
Concerning post#6, it was another story.
In that post the question was find page using ISBN.
My actual question is searching inside the page using ISBN.
Posts: 8,085
Threads: 153
Joined: Sep 2016
Nov-25-2019, 03:17 PM
(This post was last modified: Nov-25-2019, 03:17 PM by buran.)
(Nov-25-2019, 12:43 PM)Pavel_47 Wrote: But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else) actually, the product page is a template, so it is expected that the html tag (e.g. class attribute) for that particular element will be the same (at least for main parts like book title/link), just the content will be different.
|