Extract text from tag content using regular expression - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Extract text from tag content using regular expression (/thread-22679.html) |
Extract text from tag content using regular expression - Pavel_47 - Nov-22-2019 Hello, Here is the tag from where I want to extract text fragment (in bold): <a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1">. Here is my code: import urllib.request from bs4 import BeautifulSoup import re def download(url, user_agent='wswp', num_retries=2): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: html = urllib.request.urlopen(request) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries - 1) return html html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss') bs = BeautifulSoup(html.read(), 'lxml') nameList = bs.find_all('a', {'href':re.compile('"/.*(keywords).*')}) print(len(nameList)) for name in nameList: print(name.get_text())Doesn't work. Any suggestions. Thanks. RE: Extract text from tag content using regular expression - Fre3k - Nov-22-2019 Hi :) If this is your entire string then you can do it liek this with re. expr. import re string = "<a class="a-link-normal a-text-normal" href="/Cybersecurity-Intelligent-Systems-Reference-Library/dp/3319988417/ref=sr_1_1?keywords=9783319988412&qid=1574431833&sr=8-1"> url = re.search("href=\"(.*)>", string) print(url.group(1)) #or: url_ = re.findall("href=\"(.*)>", string) print(url_)Here is the regexHelper I use :) -> RegexHelper Kr RE: Extract text from tag content using regular expression - buran - Nov-23-2019 https://python-forum.io/Thread-Learning-bs4-re-search-RegEx-string-cutoff?pid=96702#pid96702 Also look at @snppsat answer in same thread: https://python-forum.io/Thread-Learning-bs4-re-search-RegEx-string-cutoff?pid=96789#pid96789 RE: Extract text from tag content using regular expression - Pavel_47 - Nov-25-2019 (Nov-22-2019, 08:32 PM)Fre3k Wrote: Hi :) In my case the searching object isn't string, but BeaurtifulSoup. (Nov-23-2019, 01:11 AM)buran Wrote: https://python-forum.io/Thread-Learning-bs4-re-search-RegEx-string-cutoff?pid=96702#pid96702 Thanks. Tried without regexp. Doesn't work. html = download('http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss') bs = BeautifulSoup(html.read(), 'lxml') link = bs.select('a[href^⁼"/Cybersecurity"]') print(link)Here is output:
RE: Extract text from tag content using regular expression - buran - Nov-25-2019 import requests from bs4 import BeautifulSoup url = 'http://www.amazon.com/s?k=9783319988412&ref=nb_sb_noss' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0', 'Accept': 'text/html,*/*', 'Accept-Language': 'en,en-US;q=0.7,en;q=0.3', 'X-Requested-With': 'XMLHttpRequest', 'Connection': 'keep-alive'} resp = requests.get(url, headers=headers) soup = BeautifulSoup(resp.text, 'lxml') # using find a = soup.find('a', {'class': 'a-link-normal a-text-normal'}) print(a.get('href')) # using css selectors a1 = soup.select('a.a-link-normal.a-text-normal') print(a1[0].get('href'))Not sure why you get malformed selector error, but you probably don't want to select using part of the href attribute (would you change the selector for every page?) RE: Extract text from tag content using regular expression - Pavel_47 - Nov-25-2019 It works. Thanks. But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else) Could it possible to search using keyword 3319988417, which is ISBN for this book. Thnks once more. RE: Extract text from tag content using regular expression - snippsat - Nov-25-2019 In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is = .This is not the best way for this anyway,as it find all Cybersecurity links and return a list then have to clean up some more. Quote:Could it possible to search using keyword 3319988417, which is ISBN for this book.Think i did answer this in your other Thread post #6. RE: Extract text from tag content using regular expression - Pavel_47 - Nov-25-2019 (Nov-25-2019, 01:09 PM)snippsat Wrote: In your 'a[href^⁼"/Cybersecurity"]' there is a Unicode ⁼ the normal is =.Thanks. Concerning post#6, it was another story. In that post the question was find page using ISBN. My actual question is searching inside the page using ISBN. RE: Extract text from tag content using regular expression - buran - Nov-25-2019 (Nov-25-2019, 12:43 PM)Pavel_47 Wrote: But perhaps for other books the attribute of tags will be differnt (i.e. instead of 'a-link-normal a-text-normal' something else)actually, the product page is a template, so it is expected that the html tag (e.g. class attribute) for that particular element will be the same (at least for main parts like book title/link), just the content will be different. |