Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract contents from HTML
#3
(May-10-2018, 06:44 PM)snippsat Wrote: Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:
from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}

Works great, thank you! I had tried using soup on my own but now realize I was approaching it completely wrong.
Reply


Messages In This Thread
Extract contents from HTML - by chisox721 - May-10-2018, 05:35 PM
RE: Extract contents from HTML - by snippsat - May-10-2018, 06:44 PM
RE: Extract contents from HTML - by chisox721 - May-10-2018, 09:50 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,645 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,370 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Extract text between bold headlines from HTML CostasG 1 2,333 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I extract specific lines from HTML files before and after a word? glittergirl 1 5,107 Aug-06-2019, 07:23 AM
Last Post: fishhook

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020