Extract text from tag content using regular expression

Extract text from tag content using regular expression - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Extract text from tag content using regular expression (/thread-22679.html)

Extract text from tag content using regular expression - Pavel_47 - Nov-22-2019

Hello,

Here is the tag from where I want to extract text fragment (in bold):
<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">.

Here is my code:

import urllib.request
from bs4 import BeautifulSoup
import re

def download(url, user_agent='wswp', num_retries=2):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        html = urllib.request.urlopen(request)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
            # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
nameList = bs.find_all('a', {'href':re.compile('"/.*(keywords).*')})
print(len(nameList))
for name in nameList:
    print(name.get_text())

Doesn't work.
Any suggestions.
Thanks.

RE: Extract text from tag content using regular expression - Fre3k - Nov-22-2019

Hi :)

If this is your entire string then you can do it liek this with re. expr.

import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)

Here is the regexHelper I use :)
-> RegexHelper

Kr

RE: Extract text from tag content using regular expression - buran - Nov-23-2019

https://python-forum.io/Thread-Learning-bs4-re-search-RegEx-string-cutoff?pid=96702#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-bs4-re-search-RegEx-string-cutoff?pid=96789#pid96789

RE: Extract text from tag content using regular expression - Pavel_47 - Nov-25-2019

(Nov-22-2019, 08:32 PM)Fre3k Wrote: Hi :)

If this is your entire string then you can do it liek this with re. expr.
import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)
Here is the regexHelper I use :)
-> RegexHelper

Kr

In my case the searching object isn't string, but BeaurtifulSoup.

(Nov-23-2019, 01:11 AM)buran Wrote: https://python-forum.io/Thread-Learning-bs4-re-search-RegEx-string-cutoff?pid=96702#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-bs4-re-search-RegEx-string-cutoff?pid=96789#pid96789

Thanks. Tried without regexp. Doesn't work.

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
link = bs.select('a[href^⁼"/Cybersecurity"]')
print(link)

Here is output:

Output:Traceback (most recent call last):
  File "/home/pavel/python_code/BeautifulSoup_test1.py", line 23, in <module>
    link = bs.select('a[href^⁼"/Cybersecurity"]')
  File "/home/pavel/.local/lib/python3.6/site-packages/bs4/element.py", line 1358, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 114, in select
    return compile(select, namespaces, flags, **kwargs).select(tag, limit)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 63, in compile
    return cp._cached_css_compile(pattern, namespaces, custom, flags)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 206, in _cached_css_compile
    CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(),
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1062, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 911, in parse_selectors
    key, m = next(iselector)
  File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1055, in selector_iter
    raise SelectorSyntaxError(msg, self.pattern, index)
  File "<string>", line None
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 1
  line 1:
a[href^⁼"/Cybersecurity"]

RE: Extract text from tag content using regular expression - buran - Nov-25-2019

import requests
from bs4 import BeautifulSoup

url = 'http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}

resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')

# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

# using  css selectors
a1 = soup.select('a.a-link-normal.a-text-normal')
print(a1[0].get('href'))

Not sure why you get malformed selector error, but you probably don't want to select using part of the href attribute (would you change the selector for every page?)

RE: Extract text from tag content using regular expression - Pavel_47 - Nov-25-2019

It works. Thanks.
But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)
Could it possible to search using keyword 3319988417, which is ISBN for this book.
Thnks once more.

RE: Extract text from tag content using regular expression - snippsat - Nov-25-2019

In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =.
This is not the best way for this anyway,as it find all Cybersecurity links and return a list then have to clean up some more.

Quote:Could it possible to search using keyword 3319988417, which is ISBN for this book.

Think i did answer this in your other Thread post #6.

RE: Extract text from tag content using regular expression - Pavel_47 - Nov-25-2019

(Nov-25-2019, 01:09 PM)snippsat Wrote: In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =.

Thanks.
Concerning post#6, it was another story.
In that post the question was find page using ISBN.
My actual question is searching inside the page using ISBN.

RE: Extract text from tag content using regular expression - buran - Nov-25-2019

(Nov-25-2019, 12:43 PM)Pavel_47 Wrote: But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)

actually, the product page is a template, so it is expected that the html tag (e.g. class attribute) for that particular element will be the same (at least for main parts like book title/link), just the content will be different.