Python Forum

Full Version: Extract contents from HTML
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hey guys-I'm trying to create a dataframe from a portion of a HTML doc and can't figure out how to extract the data. Doesn't seem like it should be that hard but I'm lost. All I'm looking to do is pull the corresponding info from "id=", "name=", and "position=". Also I really only need the information if the "position=" is "home". Any help with this would be greatly appreciated.


[<umpires><umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire><umpire first="Paul
" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire><umpire first="Gerry" id="427103" last="Davis"
name="Gerry Davis" position="second"></umpire><umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="thir
d"></umpire><umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire><umpire first="Da
n" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire></umpires>]
Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:
from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}
(May-10-2018, 06:44 PM)snippsat Wrote: [ -> ]Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:
from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}

Works great, thank you! I had tried using soup on my own but now realize I was approaching it completely wrong.