Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Beautifulsoup parsing
#6
I think that splitting on whitespaces is not enough, there are both longer titles and multiple authors. I tried dirty way with extracting <b> and splitting rest on "[" on your url.

from bs4 import BeautifulSoup as bs
import requests

url = "https://www.rfc-editor.org/rfc-index.html"
 
soup = bs(requests.get(url).text, 'html.parser')
for btag in soup.select("td b")[1:]:
    title = btag.text
    author = btag.parent.text[len(title)+1:].partition("[")[0].strip()
    print("Title: {}\nAuthor: {}\n".format(title, author))
gives
Output:
Title: Augmented BNF for Syntax Specifications: ABNF Author: D. Crocker, P. Overell Title: Host Software Author: S. Crocker Title: Host software Author: B. Duvall Title: Documentation conventions Author: S.D. Crocker ... ... Title: ARPA Network Functional Specifications Author: G. Deloche Title: Host Software Author: G. Deloche
Reply


Messages In This Thread
Beautifulsoup parsing - by Larz60+ - Apr-04-2017, 09:28 PM
RE: Beautifulsoup parsing - by metulburr - Apr-04-2017, 09:35 PM
RE: Beautifulsoup parsing - by Larz60+ - Apr-04-2017, 09:43 PM
RE: Beautifulsoup parsing - by metulburr - Apr-04-2017, 09:57 PM
RE: Beautifulsoup parsing - by Larz60+ - Apr-04-2017, 10:41 PM
RE: Beautifulsoup parsing - by zivoni - Apr-04-2017, 11:15 PM
RE: Beautifulsoup parsing - by Larz60+ - Apr-04-2017, 11:27 PM
RE: Beautifulsoup parsing - by Larz60+ - Apr-05-2017, 03:07 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup not parsing other URLs giddyhead 0 1,232 Feb-23-2022, 05:35 PM
Last Post: giddyhead
  BeautifulSoup: 6k records - but stops after parsing 20 lines apollo 0 1,850 May-10-2021, 05:08 PM
Last Post: apollo
  Logic behind BeautifulSoup data-parsing jimsxxl 7 4,450 Apr-13-2021, 09:06 AM
Last Post: jimsxxl
  BeautifulSoup Parsing Error slinkplink 6 9,700 Feb-12-2018, 02:55 PM
Last Post: seco

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020