Extract contents from HTML

chisox721 · May-10-2018, 05:35 PM

Hey guys-I'm trying to create a dataframe from a portion of a HTML doc and can't figure out how to extract the data. Doesn't seem like it should be that hard but I'm lost. All I'm looking to do is pull the corresponding info from "id=", "name=", and "position=". Also I really only need the information if the "position=" is "home". Any help with this would be greatly appreciated.

[<umpires><umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire><umpire first="Paul
" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire><umpire first="Gerry" id="427103" last="Davis"
name="Gerry Davis" position="second"></umpire><umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="thir
d"></umpire><umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire><umpire first="Da
n" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire></umpires>]

***snippsat*** · (This post was last modified: May-10-2018, 06:44 PM by snippsat.)

Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:

from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')

Use:

>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}

chisox721 · May-10-2018, 09:50 PM

(May-10-2018, 06:44 PM)snippsat Wrote: Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:

from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')

Use:

>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}

Works great, thank you! I had tried using soup on my own but now realize I was approaching it completely wrong.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,651	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,373	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	Extract text between bold headlines from HTML	CostasG	1	2,338	Aug-31-2019, 10:53 AM Last Post: snippsat
	How do I extract specific lines from HTML files before and after a word?	glittergirl	1	5,114	Aug-06-2019, 07:23 AM Last Post: fishhook

Extract contents from HTML

User Panel Messages

Announcements