I’m trying to scrape data from page.
There are no classes in the HTML that I can use, but they do have data-stat ID’s how can I target them to scrap the text inside?
HTML example
<div id="contents">
<div class="text" data-stat="name">Joe Bloggs</div>
<div class="text" data-stat="address">123 fake street</div>
<div class="text" data-stat="phone_number">0881234567898</div>
</div>
I know how to scrape data when there is a class, but the classes, in this case, are all the same, and the data-stat is what's different.
I was using BeautifulSoup as that’s what I normally use for scrapping. But I’ve never targeted data-stat
any help would be get – thanks.
Call them by a dictionary call or CSS selector.
from bs4 import BeautifulSoup
html = '''\
<div id="contents">
<div class="text" data-stat="name">Joe Bloggs</div>
<div class="text" data-stat="address">123 fake street</div>
<div class="text" data-stat="phone_number">0881234567898</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
>>> soup.find('div', {'data-stat': 'name'})
<div class="text" data-stat="name">Joe Bloggs</div>
>>>
>>> soup.select('#contents > div:nth-child(1)')
[<div class="text" data-stat="name">Joe Bloggs</div>]
>>> soup.select('#contents > div:nth-child(2)')
[<div class="text" data-stat="address">123 fake street</div>]
thanks for the reply.
I can't get this working here is my code.
I have it trying to print clubs first to see if I can connect.. it was working and returning the whole table.. but it's not now, not sure why.
Even when it was returning the whole table in print, the print on "name" wasn't working.
from bs4 import BeautifulSoup
import requests
try:
html = requests.get('https://fbref.com/en/squads/19538871/Manchester-United-Stats')
html.raise_for_status()
soup = BeautifulSoup(html.text, 'lxml')
clubs = soup.find(class_='stats_table')
for club in clubs:
name = club.find('div', {'data-stat': 'player'})
print(clubs)
except Exception as e:
print(e)
I'm getting this message on print
slice indices must be integers or None or have an __index__ method
You get that error because you loop is wrong,so try to do a string slice.
Also it's not
div
it's in a table
th
tag.
Do not use
try:except
when testing stuff out.
from bs4 import BeautifulSoup
import requests
html = requests.get('https://fbref.com/en/squads/19538871/Manchester-United-Stats')
html.raise_for_status()
soup = BeautifulSoup(html.content, 'lxml')
clubs = soup.find(class_='stats_table')
players = clubs.find_all('th', {'data-stat': 'player'})
for name in players:
print(name.text)
Output:
Player
David de Gea
Bruno Fernandes
Harry Maguire
Scott McTominay
Fred
Cristiano Ronaldo
Mason Greenwood
Aaron Wan-Bissaka
Luke Shaw
....
Thanks, this works.
But now I've another problem.
I don't just want to get the players names out, I will add more stats, like assists/shots etc.
But i'm not sure how to get the data aligned to the player.
I can do this
from bs4 import BeautifulSoup
import requests
html = requests.get('https://fbref.com/en/squads/19538871/Manchester-United-Stats')
html.raise_for_status()
soup = BeautifulSoup(html.content, 'lxml')
clubs = soup.find(class_='stats_table')
players = clubs.find_all('th', {'data-stat': 'player'})
assists = clubs.find_all('td', {'data-stat': 'assists'})
for name in players:
print(name.text)
for assist in assists:
print(assist.text)
and it will print out the assist value, but it's below the players name and not beside it. so If I was to save to excel/csv it wouldn't work.
Make into list with text output then
zip() it together.
Example.
>>> players = [tag.text for tag in players]
>>> assists = [tag.text for tag in assists]
>>> zip(player, assists)
<zip object at 0x000000001D9ED580>
>>> record = dict(zip(player, assists))
>>> record
{'Aaron Wan-Bissaka': '2',
'Alex Telles': '1',
'Amad Diallo': '',
'Andreas Pereira': '',
'Anthony Elanga': '0',
'Anthony Martial': '0',
'Brandon Williams': '29',
'Bruno Fernandes': '0',
'Cristiano Ronaldo': '1',
'Daniel James': '0',
'David de Gea': '5',
'Dean Henderson': '',
'Diogo Dalot': '2',
'Donny van de Beek': '',
'Edinson Cavani': '0',
'Eric Bailly': '0',
'Fred': '3',
'Harry Maguire': '0',
'Jadon Sancho': '0',
'Jesse Lingard': '0',
'Juan Mata': '',
'Luke Shaw': '2',
'Marcus Rashford': '7',
'Mason Greenwood': '0',
'Nemanja Matić': '1',
'Paul Pogba': '1',
'Phil Jones': '0',
'Player': '0',
'Raphaël Varane': '0',
'Scott McTominay': '3',
'Squad Total': '24',
'Tom Heaton': '',
'Victor Lindelöf': '1'}
>>> record['Brandon Williams']
'29'
Also look at Pandas it's great for reading table in from html.
Then have lot of power to finding eg statistic about players and games
Example
NoteBook.