Python Forum

Full Version: Extract text from tag content using regular expression
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

Here is the tag from where I want to extract text fragment (in bold):
<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">.

Here is my code:
import urllib.request
from bs4 import BeautifulSoup
import re

def download(url, user_agent='wswp', num_retries=2):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        html = urllib.request.urlopen(request)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
            # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
nameList = bs.find_all('a', {'href':re.compile('"/.*(keywords).*')})
print(len(nameList))
for name in nameList:
    print(name.get_text())
Doesn't work.
Any suggestions.
Thanks.
Hi :)

If this is your entire string then you can do it liek this with re. expr.

import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)
Here is the regexHelper I use :)
-> RegexHelper

Kr
(Nov-22-2019, 08:32 PM)Fre3k Wrote: [ -> ]Hi :)

If this is your entire string then you can do it liek this with re. expr.

import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)
Here is the regexHelper I use :)
-> RegexHelper

Kr

In my case the searching object isn't string, but BeaurtifulSoup.

(Nov-23-2019, 01:11 AM)buran Wrote: [ -> ]https://python-forum.io/Thread-Learning-...2#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-...9#pid96789

Thanks. Tried without regexp. Doesn't work.

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
link = bs.select('a[href^⁼"/Cybersecurity"]')
print(link)
Here is output:
Output:
Traceback (most recent call last): File "/home/pavel/python_code/BeautifulSoup_test1.py", line 23, in <module> link = bs.select('a[href^⁼"/Cybersecurity"]') File "/home/pavel/.local/lib/python3.6/site-packages/bs4/element.py", line 1358, in select return soupsieve.select(selector, self, namespaces, limit, **kwargs) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 114, in select return compile(select, namespaces, flags, **kwargs).select(tag, limit) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 63, in compile return cp._cached_css_compile(pattern, namespaces, custom, flags) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 206, in _cached_css_compile CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(), File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1062, in process_selectors return self.parse_selectors(self.selector_iter(self.pattern), index, flags) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 911, in parse_selectors key, m = next(iselector) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1055, in selector_iter raise SelectorSyntaxError(msg, self.pattern, index) File "<string>", line None soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 1 line 1: a[href^⁼"/Cybersecurity"]
import requests
from bs4 import BeautifulSoup

url = 'http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}

resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')

# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

# using  css selectors
a1 = soup.select('a.a-link-normal.a-text-normal')
print(a1[0].get('href'))
Not sure why you get malformed selector error, but you probably don't want to select using part of the href attribute (would you change the selector for every page?)
It works. Thanks.
But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)
Could it possible to search using keyword 3319988417, which is ISBN for this book.
Thnks once more.
In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode the normal is =.
This is not the best way for this anyway,as it find all Cybersecurity links and return a list then have to clean up some more.
Quote:Could it possible to search using keyword 3319988417, which is ISBN for this book.
Think i did answer this in your other Thread post #6.
(Nov-25-2019, 01:09 PM)snippsat Wrote: [ -> ]In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =.
Thanks.
Concerning post#6, it was another story.
In that post the question was find page using ISBN.
My actual question is searching inside the page using ISBN.
(Nov-25-2019, 12:43 PM)Pavel_47 Wrote: [ -> ]But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)
actually, the product page is a template, so it is expected that the html tag (e.g. class attribute) for that particular element will be the same (at least for main parts like book title/link), just the content will be different.