![]() |
Beautifulsoup parsing - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Beautifulsoup parsing (/thread-2715.html) |
Beautifulsoup parsing - Larz60+ - Apr-04-2017 line in html: I want Host software and author separatedi get the title with x.find('b'), I am tired and this is not poping out of my weary brain what about author? RE: Beautifulsoup parsing - metulburr - Apr-04-2017 whats the next tag after </b> ? RE: Beautifulsoup parsing - Larz60+ - Apr-04-2017 Here's two sets of table entries: The text after the <b> tag varies in length and contentThe page is located here: https://www.rfc-editor.org/rfc-index.html RE: Beautifulsoup parsing - metulburr - Apr-04-2017 im actually not sure how to do that other than string splitting after getting that td But that is assuming the structure is always Host Software X. XXXXXX from bs4 import BeautifulSoup html = ''' <tr valign="top"> <td valign="top"> <script type="text/javascript"> doMainDocLink('RFC0001'); </script><noscript>0001</noscript> </td> <td> <b>Host Software</b> S. Crocker [ April 1969 ] (TXT = 21088) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0001) </td> </tr> <tr valign="top"> <td valign="top"> <script type="text/javascript"> doMainDocLink('RFC0002'); </script><noscript>0002</noscript> </td> <td> <b>Host software</b> B. Duvall [ April 1969 ] (TXT = 17145) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0002) </td> </tr> ''' soup = BeautifulSoup(html, 'html.parser') tds = soup.find_all('td') td = tds[1] print(td.text.split()[:2]) print(td.text.split()[2:4])
RE: Beautifulsoup parsing - Larz60+ - Apr-04-2017 I thought so, even if there is another way, that will work fine so long as i consider that there may not be any author (presented) RE: Beautifulsoup parsing - zivoni - Apr-04-2017 I think that splitting on whitespaces is not enough, there are both longer titles and multiple authors. I tried dirty way with extracting <b> and splitting rest on "[" on your url. from bs4 import BeautifulSoup as bs import requests url = "https://www.rfc-editor.org/rfc-index.html" soup = bs(requests.get(url).text, 'html.parser') for btag in soup.select("td b")[1:]: title = btag.text author = btag.parent.text[len(title)+1:].partition("[")[0].strip() print("Title: {}\nAuthor: {}\n".format(title, author))gives
RE: Beautifulsoup parsing - Larz60+ - Apr-04-2017 I like it, that works well RE: Beautifulsoup parsing - Larz60+ - Apr-05-2017 Now here's the funny part. I did a little more looking around the web site, and voila, there it was, a text file with everything I was looking for. Not a loss at all, though, because I learned just a little more. Now if I can keep that up until I retire at 90, it will be good! |