Python Forum
Thread Rating:
  • 1 Vote(s) - 1 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract Anchor Text (Scrapy)
#1
Tried using the search box but didn't find any post that relates. Also tried Googling it but didn't find any answer.

How can I extract only the anchor text in a given hyperlink?

Quote:I.E. <a href='mydomain.com'>my anchor text</a>

Quote:example:
<div class = "blog_next_page">
<a class="next_page" href="mydomain.com/page/2">my anchor text</a>

Called the page/site using
scrapy shell 'website url'
using
response.css('div.blog_next_page > a::attr(href)').extract_first()
I can extract the link but how can i get "my anchor text"?

Many thanks for the help!
Reply
#2
try
response.css('div.blog_next_page > a::text').extract_first()
Scrapy Selectors docs
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
(Jul-21-2018, 06:26 AM)buran Wrote: try
response.css('div.blog_next_page > a::text').extract_first()
Scrapy Selectors docs

It works!

I was messing around with having 'text' inside the attr() or a::text(), geez...

So 'text' alone is just inside the string or it sniff a string at the a-tag?

Thanks again!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract Href URL and Text From List knight2000 2 8,432 Jul-08-2021, 12:53 PM
Last Post: knight2000
  Selenium extract id text xzozx 1 2,068 Jun-15-2020, 06:32 AM
Last Post: Larz60+
  Extract text from tag content using regular expression Pavel_47 8 5,031 Nov-25-2019, 03:17 PM
Last Post: buran
  Extract text between bold headlines from HTML CostasG 1 2,247 Aug-31-2019, 10:53 AM
Last Post: snippsat
  webscraping - failing to extract specific text from data.gov rontar 2 3,121 May-19-2018, 08:01 AM
Last Post: rontar
  Scrapy-cut: Advanced Cookiecutter Scrapy Templating scriptso 2 4,579 Feb-02-2017, 07:57 PM
Last Post: scriptso

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020