Extract text from tag content using regular expression

Pavel_47 · Nov-22-2019, 02:20 PM

Hello,

Here is the tag from where I want to extract text fragment (in bold):
<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">.

Here is my code:

import urllib.request
from bs4 import BeautifulSoup
import re

def download(url, user_agent='wswp', num_retries=2):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        html = urllib.request.urlopen(request)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
            # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
nameList = bs.find_all('a', {'href':re.compile('"/.*(keywords).*')})
print(len(nameList))
for name in nameList:
    print(name.get_text())

Doesn't work.
Any suggestions.
Thanks.

Fre3k · Nov-22-2019, 08:32 PM

Hi :)

If this is your entire string then you can do it liek this with re. expr.

import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)

Here is the regexHelper I use :)
-> RegexHelper

Kr

**buran** · (This post was last modified: Nov-23-2019, 01:11 AM by buran.)

https://python-forum.io/Thread-Learning-...2#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-...9#pid96789

Pavel_47 · (This post was last modified: Nov-25-2019, 10:07 AM by Pavel_47.)

(Nov-22-2019, 08:32 PM)Fre3k Wrote: Hi :)

If this is your entire string then you can do it liek this with re. expr.
import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)
Here is the regexHelper I use :)
-> RegexHelper

Kr

In my case the searching object isn't string, but BeaurtifulSoup.

(Nov-23-2019, 01:11 AM)buran Wrote: https://python-forum.io/Thread-Learning-...2#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-...9#pid96789

Thanks. Tried without regexp. Doesn't work.

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
link = bs.select('a[href^⁼"/Cybersecurity"]')
print(link)

Here is output:

Output:Traceback (most recent call last):
  File "/home/pavel/python_code/BeautifulSoup_test1.py", line 23, in <module>
    link = bs.select('a[href^⁼"/Cybersecurity"]')
  File "/home/pavel/.local/lib/python3.6/site-packages/bs4/element.py", line 1358, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 114, in select
    return compile(select, namespaces, flags, **kwargs).select(tag, limit)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 63, in compile
    return cp._cached_css_compile(pattern, namespaces, custom, flags)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 206, in _cached_css_compile
    CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(),
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1062, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 911, in parse_selectors
    key, m = next(iselector)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1055, in selector_iter
    raise SelectorSyntaxError(msg, self.pattern, index)
  File "<string>", line None
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 1
  line 1:
a[href^⁼"/Cybersecurity"]

**buran** · (This post was last modified: Nov-25-2019, 03:25 PM by buran.)

import requests
from bs4 import BeautifulSoup

url = 'http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}

resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')

# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

# using  css selectors
a1 = soup.select('a.a-link-normal.a-text-normal')
print(a1[0].get('href'))

Not sure why you get malformed selector error, but you probably don't want to select using part of the href attribute (would you change the selector for every page?)

Pavel_47 · Nov-25-2019, 12:43 PM

It works. Thanks.
But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)
Could it possible to search using keyword 3319988417, which is ISBN for this book.
Thnks once more.

***snippsat*** · Nov-25-2019, 01:09 PM

In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =.
This is not the best way for this anyway,as it find all Cybersecurity links and return a list then have to clean up some more.

Quote:Could it possible to search using keyword 3319988417, which is ISBN for this book.

Think i did answer this in your other Thread post #6.

Pavel_47 · Nov-25-2019, 01:22 PM

(Nov-25-2019, 01:09 PM)snippsat Wrote: In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =.

Thanks.
Concerning post#6, it was another story.
In that post the question was find page using ISBN.
My actual question is searching inside the page using ISBN.

**buran** · (This post was last modified: Nov-25-2019, 03:17 PM by buran.)

(Nov-25-2019, 12:43 PM)Pavel_47 Wrote: But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)

actually, the product page is a template, so it is expected that the html tag (e.g. class attribute) for that particular element will be the same (at least for main parts like book title/link), just the content will be different.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Regular Expression	rakhmadiev	6	5,369	Aug-21-2023, 01:52 PM Last Post: Gribouillis
	Extract Href URL and Text From List	knight2000	2	8,971	Jul-08-2021, 12:53 PM Last Post: knight2000
	Selenium extract id text	xzozx	1	2,116	Jun-15-2020, 06:32 AM Last Post: Larz60+
	BeautifulSoup : how to have a html5 attribut searched for in a regular expression ?	arbiel	2	2,619	May-09-2020, 03:05 PM Last Post: arbiel
	Extract text between bold headlines from HTML	CostasG	1	2,321	Aug-31-2019, 10:53 AM Last Post: snippsat
	Extract Anchor Text (Scrapy)	soothsayerpg	2	8,324	Jul-21-2018, 07:18 AM Last Post: soothsayerpg
	webscraping - failing to extract specific text from data.gov	rontar	2	3,182	May-19-2018, 08:01 AM Last Post: rontar
	web scraping with python regular expression	dbpython2017	6	9,217	Sep-26-2017, 02:16 AM Last Post: dbpython2017

Extract text from tag content using regular expression

User Panel Messages

Announcements