Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Extract text from tag content using regular expression
#1
Hello,

Here is the tag from where I want to extract text fragment (in bold):
<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">.

Here is my code:
import urllib.request
from bs4 import BeautifulSoup
import re

def download(url, user_agent='wswp', num_retries=2):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        html = urllib.request.urlopen(request)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
            # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
nameList = bs.find_all('a', {'href':re.compile('"/.*(keywords).*')})
print(len(nameList))
for name in nameList:
    print(name.get_text())
Doesn't work.
Any suggestions.
Thanks.
Quote
#2
Hi :)

If this is your entire string then you can do it liek this with re. expr.

import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)
Here is the regexHelper I use :)
-> RegexHelper

Kr
Quote
#3
https://python-forum.io/Thread-Learning-...2#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-...9#pid96789
Quote
#4
(Nov-22-2019, 08:32 PM)Fre3k Wrote: Hi :)

If this is your entire string then you can do it liek this with re. expr.

import re

string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">

url = re.search("href=\"(.*)>", string)
print(url.group(1))

#or:
url_ = re.findall("href=\"(.*)>", string)
print(url_)
Here is the regexHelper I use :)
-> RegexHelper

Kr

In my case the searching object isn't string, but BeaurtifulSoup.

(Nov-23-2019, 01:11 AM)buran Wrote: https://python-forum.io/Thread-Learning-...2#pid96702
Also look at @snppsat answer in same thread:
https://python-forum.io/Thread-Learning-...9#pid96789

Thanks. Tried without regexp. Doesn't work.

html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss')
bs = BeautifulSoup(html.read(), 'lxml')
link = bs.select('a[href^⁼"/Cybersecurity"]')
print(link)
Here is output:
Output:
Traceback (most recent call last): File "/home/pavel/python_code/BeautifulSoup_test1.py", line 23, in <module> link = bs.select('a[href^⁼"/Cybersecurity"]') File "/home/pavel/.local/lib/python3.6/site-packages/bs4/element.py", line 1358, in select return soupsieve.select(selector, self, namespaces, limit, **kwargs) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 114, in select return compile(select, namespaces, flags, **kwargs).select(tag, limit) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/__init__.py", line 63, in compile return cp._cached_css_compile(pattern, namespaces, custom, flags) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 206, in _cached_css_compile CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(), File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1062, in process_selectors return self.parse_selectors(self.selector_iter(self.pattern), index, flags) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 911, in parse_selectors key, m = next(iselector) File "/home/pavel/.local/lib/python3.6/site-packages/soupsieve/css_parser.py", line 1055, in selector_iter raise SelectorSyntaxError(msg, self.pattern, index) File "<string>", line None soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 1 line 1: a[href^⁼"/Cybersecurity"]
Quote
#5
import requests
from bs4 import BeautifulSoup

url = 'http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}

resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')

# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

# using  css selectors
a1 = soup.select('a.a-link-normal.a-text-normal')
print(a1[0].get('href'))
Not sure why you get malformed selector error, but you probably don't want to select using part of the href attribute (would you change the selector for every page?)
snippsat and Pavel_47 like this post
Quote
#6
It works. Thanks.
But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)
Could it possible to search using keyword 3319988417, which is ISBN for this book.
Thnks once more.
Quote
#7
In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode the normal is =.
This is not the best way for this anyway,as it find all Cybersecurity links and return a list then have to clean up some more.
Quote:Could it possible to search using keyword 3319988417, which is ISBN for this book.
Think i did answer this in your other Thread post #6.
Pavel_47 likes this post
Quote
#8
(Nov-25-2019, 01:09 PM)snippsat Wrote: In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =.
Thanks.
Concerning post#6, it was another story.
In that post the question was find page using ISBN.
My actual question is searching inside the page using ISBN.
Quote
#9
(Nov-25-2019, 12:43 PM)Pavel_47 Wrote: But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)
actually, the product page is a template, so it is expected that the html tag (e.g. class attribute) for that particular element will be the same (at least for main parts like book title/link), just the content will be different.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Extract text between bold headlines from HTML CostasG 1 313 Aug-31-2019, 10:53 AM
Last Post: snippsat
  Extract Anchor Text (Scrapy) soothsayerpg 2 2,216 Jul-21-2018, 07:18 AM
Last Post: soothsayerpg
  webscraping - failing to extract specific text from data.gov rontar 2 808 May-19-2018, 08:01 AM
Last Post: rontar
  web scraping with python regular expression dbpython2017 6 4,138 Sep-26-2017, 02:16 AM
Last Post: dbpython2017
  Regular Expression rakhmadiev 4 2,095 Jun-04-2017, 05:47 PM
Last Post: metulburr

Forum Jump:


Users browsing this thread: 1 Guest(s)