Python Forum
How can I target and scrape a data-stat
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I target and scrape a data-stat
#1
I’m trying to scrape data from page.
There are no classes in the HTML that I can use, but they do have data-stat ID’s how can I target them to scrap the text inside?

HTML example
<div id="contents">

<div class="text" data-stat="name">Joe Bloggs</div>
<div class="text" data-stat="address">123 fake street</div>
<div class="text" data-stat="phone_number">0881234567898</div>

</div>

I know how to scrape data when there is a class, but the classes, in this case, are all the same, and the data-stat is what's different.
I was using BeautifulSoup as that’s what I normally use for scrapping. But I’ve never targeted data-stat
any help would be get – thanks.
Reply
#2
Call them by a dictionary call or CSS selector.
from bs4 import BeautifulSoup

html = '''\
<div id="contents">
  <div class="text" data-stat="name">Joe Bloggs</div>
  <div class="text" data-stat="address">123 fake street</div>
  <div class="text" data-stat="phone_number">0881234567898</div>
</div>'''

soup = BeautifulSoup(html, 'lxml')
>>> soup.find('div', {'data-stat': 'name'})
<div class="text" data-stat="name">Joe Bloggs</div>
>>> 
>>> soup.select('#contents > div:nth-child(1)')
[<div class="text" data-stat="name">Joe Bloggs</div>]
>>> soup.select('#contents > div:nth-child(2)')
[<div class="text" data-stat="address">123 fake street</div>]
Reply
#3
thanks for the reply.

I can't get this working here is my code.

I have it trying to print clubs first to see if I can connect.. it was working and returning the whole table.. but it's not now, not sure why.
Even when it was returning the whole table in print, the print on "name" wasn't working.

from bs4 import BeautifulSoup
import requests

try:
    html = requests.get('https://fbref.com/en/squads/19538871/Manchester-United-Stats')
    html.raise_for_status()

    soup = BeautifulSoup(html.text, 'lxml')

    clubs = soup.find(class_='stats_table')

    for club in clubs:
        name = club.find('div', {'data-stat': 'player'})

        print(clubs)

except Exception as e:
    print(e)
I'm getting this message on print
slice indices must be integers or None or have an __index__ method
Reply
#4
You get that error because you loop is wrong,so try to do a string slice.
Also it's not div it's in a table th tag.
Do not use try:except when testing stuff out.
from bs4 import BeautifulSoup
import requests

html = requests.get('https://fbref.com/en/squads/19538871/Manchester-United-Stats')
html.raise_for_status()
soup = BeautifulSoup(html.content, 'lxml')
clubs = soup.find(class_='stats_table')
players = clubs.find_all('th', {'data-stat': 'player'})
for name in players:
    print(name.text)
Output:
Player David de Gea Bruno Fernandes Harry Maguire Scott McTominay Fred Cristiano Ronaldo Mason Greenwood Aaron Wan-Bissaka Luke Shaw ....
Reply
#5
Thanks, this works.

But now I've another problem.
I don't just want to get the players names out, I will add more stats, like assists/shots etc.

But i'm not sure how to get the data aligned to the player.

I can do this
from bs4 import BeautifulSoup
import requests

html = requests.get('https://fbref.com/en/squads/19538871/Manchester-United-Stats')
html.raise_for_status()
soup = BeautifulSoup(html.content, 'lxml')
clubs = soup.find(class_='stats_table')
players = clubs.find_all('th', {'data-stat': 'player'})
assists = clubs.find_all('td', {'data-stat': 'assists'})
for name in players:
    print(name.text)
for assist in assists:
    print(assist.text)
and it will print out the assist value, but it's below the players name and not beside it. so If I was to save to excel/csv it wouldn't work.
Reply
#6
Make into list with text output then zip() it together.
Example.
>>> players = [tag.text for tag in players]
>>> assists = [tag.text for tag in assists]
>>> zip(player, assists)
<zip object at 0x000000001D9ED580>
>>> record = dict(zip(player, assists))
>>> record
{'Aaron Wan-Bissaka': '2',
 'Alex Telles': '1',
 'Amad Diallo': '',
 'Andreas Pereira': '',
 'Anthony Elanga': '0',
 'Anthony Martial': '0',
 'Brandon Williams': '29',
 'Bruno Fernandes': '0',
 'Cristiano Ronaldo': '1',
 'Daniel James': '0',
 'David de Gea': '5',
 'Dean Henderson': '',
 'Diogo Dalot': '2',
 'Donny van de Beek': '',
 'Edinson Cavani': '0',
 'Eric Bailly': '0',
 'Fred': '3',
 'Harry Maguire': '0',
 'Jadon Sancho': '0',
 'Jesse Lingard': '0',
 'Juan Mata': '',
 'Luke Shaw': '2',
 'Marcus Rashford': '7',
 'Mason Greenwood': '0',
 'Nemanja Matić': '1',
 'Paul Pogba': '1',
 'Phil Jones': '0',
 'Player': '0',
 'Raphaël Varane': '0',
 'Scott McTominay': '3',
 'Squad Total': '24',
 'Tom Heaton': '',
 'Victor Lindelöf': '1'}
>>> record['Brandon Williams']
'29'
Also look at Pandas it's great for reading table in from html.
Then have lot of power to finding eg statistic about players and games
Example NoteBook.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 875 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
  I am trying to scrape data to broadcast it on Telegram BarryBoos 1 2,145 Jun-10-2023, 02:36 PM
Last Post: snippsat
  Is it possible to scrape this data from Google Searches rosjo 1 2,203 Nov-06-2020, 06:51 PM
Last Post: Larz60+
  [WinError 10061] No connection could be made because the target machine actively refu kannanponraj 1 8,414 May-10-2020, 10:39 AM
Last Post: Larz60+
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,216 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  Want to scrape a table data and export it into CSV format tahir1990 9 5,302 Oct-22-2019, 08:03 AM
Last Post: buran
  webscrapping links and then enter those links to scrape data kirito85 2 3,231 Jun-13-2019, 02:23 AM
Last Post: kirito85
  Scrape ASPX data with python... hoff1022 0 4,550 Feb-26-2019, 06:16 PM
Last Post: hoff1022
  Target: Web Scraping / Web Automation vetabz 2 3,659 May-07-2017, 01:47 PM
Last Post: vetabz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020