Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract contents from HTML
#1
Hey guys-I'm trying to create a dataframe from a portion of a HTML doc and can't figure out how to extract the data. Doesn't seem like it should be that hard but I'm lost. All I'm looking to do is pull the corresponding info from "id=", "name=", and "position=". Also I really only need the information if the "position=" is "home". Any help with this would be greatly appreciated.


[<umpires><umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire><umpire first="Paul
" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire><umpire first="Gerry" id="427103" last="Davis"
name="Gerry Davis" position="second"></umpire><umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="thir
d"></umpire><umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire><umpire first="Da
n" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire></umpires>]
Reply
#2
Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:
from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}
Reply
#3
(May-10-2018, 06:44 PM)snippsat Wrote: Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:
from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}

Works great, thank you! I had tried using soup on my own but now realize I was approaching it completely wrong.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,483 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,316 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Extract text between bold headlines from HTML CostasG 1 2,249 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I extract specific lines from HTML files before and after a word? glittergirl 1 5,032 Aug-06-2019, 07:23 AM
Last Post: fishhook

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020