Beautifulsoup parsing

**Larz60+** · (This post was last modified: Apr-04-2017, 09:28 PM by Larz60+.)

line in html:

Output:
<b>Host Software</b> S. Crocker

I want Host software and author separated
i get the title with x.find('b'),

I am tired and this is not poping out of my weary brain
what about author?

***metulburr*** · (This post was last modified: Apr-04-2017, 09:35 PM by metulburr.)

whats the next tag after </b> ?

**Larz60+** · (This post was last modified: Apr-04-2017, 09:43 PM by Larz60+.)

Here's two sets of table entries:

Output:<tr valign="top">
<td valign="top">
<script type="text/javascript">
          doMainDocLink('RFC0001');
        </script><noscript>0001</noscript>
</td>
<td>
<b>Host Software</b> S. Crocker 
        [ April 1969 ]
        (TXT = 21088)
        
        (Status: UNKNOWN)
        
          (Stream: Legacy)
        
           (DOI: 10.17487/RFC0001)
        </td>
</tr>
<tr valign="top">
<td valign="top">
<script type="text/javascript">
          doMainDocLink('RFC0002');
        </script><noscript>0002</noscript>
</td>
<td>
<b>Host software</b> B. Duvall 
        [ April 1969 ]
        (TXT = 17145)
        
        (Status: UNKNOWN)
        
          (Stream: Legacy)
        
           (DOI: 10.17487/RFC0002)
        </td>
</tr>

The text after the <b> tag varies in length and content

The page is located here: https://www.rfc-editor.org/rfc-index.html

***metulburr*** · (This post was last modified: Apr-04-2017, 09:57 PM by metulburr.)

im actually not sure how to do that other than string splitting after getting that td
But that is assuming the structure is always
Host Software X. XXXXXX

from bs4 import BeautifulSoup

html = '''
<tr valign="top">
<td valign="top">
<script type="text/javascript">
          doMainDocLink('RFC0001');
        </script><noscript>0001</noscript>
</td>
<td>
<b>Host Software</b> S. Crocker 
        [ April 1969 ]
        (TXT = 21088)
        
        (Status: UNKNOWN)
        
          (Stream: Legacy)
        
           (DOI: 10.17487/RFC0001)
        </td>
</tr>
<tr valign="top">
<td valign="top">
<script type="text/javascript">
          doMainDocLink('RFC0002');
        </script><noscript>0002</noscript>
</td>
<td>
<b>Host software</b> B. Duvall 
        [ April 1969 ]
        (TXT = 17145)
        
        (Status: UNKNOWN)
        
          (Stream: Legacy)
        
           (DOI: 10.17487/RFC0002)
        </td>
</tr>
'''

soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')
td = tds[1]
print(td.text.split()[:2])
print(td.text.split()[2:4])

Output:['Host', 'Software']
['S.', 'Crocker']

**Larz60+** · Apr-04-2017, 10:41 PM

I thought so, even if there is another way, that will work fine so long as i consider
that there may not be any author (presented)

***zivoni*** · Apr-04-2017, 11:15 PM

I think that splitting on whitespaces is not enough, there are both longer titles and multiple authors. I tried dirty way with extracting <b> and splitting rest on "[" on your url.

from bs4 import BeautifulSoup as bs
import requests

url = "https://www.rfc-editor.org/rfc-index.html"
 
soup = bs(requests.get(url).text, 'html.parser')
for btag in soup.select("td b")[1:]:
    title = btag.text
    author = btag.parent.text[len(title)+1:].partition("[")[0].strip()
    print("Title: {}\nAuthor: {}\n".format(title, author))

gives

Output:Title: Augmented BNF for Syntax Specifications: ABNF
Author: D. Crocker, P. Overell

Title: Host Software
Author: S. Crocker

Title: Host software
Author: B. Duvall

Title: Documentation conventions
Author: S.D. Crocker

...
...

Title: ARPA Network Functional Specifications
Author: G. Deloche

Title: Host Software
Author: G. Deloche

**Larz60+** · Apr-04-2017, 11:27 PM

I like it, that works well

**Larz60+** · Apr-05-2017, 03:07 AM

Now here's the funny part.
I did a little more looking around the web site, and voila, there it was, a text file with everything I was looking for.

Not a loss at all, though, because I learned just a little more.
Now if I can keep that up until I retire at 90, it will be good!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BeautifulSoup not parsing other URLs	giddyhead	0	1,189	Feb-23-2022, 05:35 PM Last Post: giddyhead
	BeautifulSoup: 6k records - but stops after parsing 20 lines	apollo	0	1,803	May-10-2021, 05:08 PM Last Post: apollo
	Logic behind BeautifulSoup data-parsing	jimsxxl	7	4,257	Apr-13-2021, 09:06 AM Last Post: jimsxxl
	BeautifulSoup Parsing Error	slinkplink	6	9,524	Feb-12-2018, 02:55 PM Last Post: seco

Beautifulsoup parsing

User Panel Messages

Announcements