Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Extract contents from HTML
#1
Hey guys-I'm trying to create a dataframe from a portion of a HTML doc and can't figure out how to extract the data. Doesn't seem like it should be that hard but I'm lost. All I'm looking to do is pull the corresponding info from "id=", "name=", and "position=". Also I really only need the information if the "position=" is "home". Any help with this would be greatly appreciated.


[<umpires><umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire><umpire first="Paul
" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire><umpire first="Gerry" id="427103" last="Davis"
name="Gerry Davis" position="second"></umpire><umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="thir
d"></umpire><umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire><umpire first="Da
n" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire></umpires>]
Quote
#2
Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:
from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}
chisox721 likes this post
Quote
#3
(May-10-2018, 06:44 PM)snippsat Wrote: Use a parser eg BeautifulSoup/lxml, Web-Scraping part-1
Example:
from bs4 import BeautifulSoup

html = '''\
<umpires>
  <umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>
  <umpire first="Paul" id="427361" last="Nauert" name="Paul Nauert" position="first"></umpire>
  <umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
  <umpire first="Laz" id="427113" last="Diaz" name="Laz Diaz" position="third"></umpire>
  <umpire first="Bill" id="427344" last="Miller" name="Bill Miller" position="left"></umpire>
  <umpire first="Dan" id="427248" last="Iassogna" name="Dan Iassogna" position="right"></umpire>
</umpires>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> soup.find('umpire')
<umpire first="Mark" id="427533" last="Wegner" name="Mark Wegner" position="home"></umpire>

>>> soup.find('umpire', position="second")
<umpire first="Gerry" id="427103" last="Davis" name="Gerry Davis" position="second"></umpire>
>>> soup.find('umpire', position="second").get('id')
'427103'

>>> [i.get('position') for i in soup.find_all('umpire')]
['home', 'first', 'second', 'third', 'left', 'right']
>>> [i.get('id') for i in soup.find_all('umpire')]
['427533', '427361', '427103', '427113', '427344', '427248']

>>> # Last a little more advance name and position in a dictionary
>>> dict(zip([i.get('name') for i in soup.find_all('umpire')], [i.get('position') for i in soup.find_all('umpire')]))
{'Bill Miller': 'left',
 'Dan Iassogna': 'right',
 'Gerry Davis': 'second',
 'Laz Diaz': 'third',
 'Mark Wegner': 'home',
 'Paul Nauert': 'first'}

Works great, thank you! I had tried using soup on my own but now realize I was approaching it completely wrong.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 66 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Extract text between bold headlines from HTML CostasG 1 313 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I extract specific lines from HTML files before and after a word? glittergirl 1 2,361 Aug-06-2019, 07:23 AM
Last Post: fishhook

Forum Jump:


Users browsing this thread: 1 Guest(s)