Posts: 27
Threads: 4
Joined: Oct 2023
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser'
http_encoding = resp.encoding if 'charset' in resp.headers.get( 'content-type' , '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html = True )
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding = encoding)
data = []
table = soup.find_all( 'table' ,attrs = { 'class' : 'updated_next_results_table' })
print (table)
rows = soup.find_all( 'tr' )
for row in rows:
cols = row.find_all( 'td' )
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
print (data)
|
i am able to take links and datas but my expected result is
from this link https://www.sbostats.com/soccer/league/italy/serie-a
for each match have values of names of teams from the table and the relative link.
Posts: 27
Threads: 4
Joined: Oct 2023
no one can help me?
if i use attrs={'class':'widget-results__team-name match-name'} is empty []
Posts: 7,324
Threads: 123
Joined: Sep 2016
Oct-09-2023, 12:12 PM
(This post was last modified: Oct-09-2023, 12:13 PM by snippsat.)
Here a example on how to print out the whole table.
Now get hfref back in need full address just concat all with https://www.sbostats.com .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser'
http_encoding = resp.encoding if 'charset' in resp.headers.get( 'content-type' , '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html = True )
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding = encoding)
table = soup.find_all( 'table' ,attrs = { 'class' : 'updated_next_results_table' })
table = table[ 0 ]
tr = table.find_all( 'tr' )
for row in tr:
if row.text = = None :
pass
if row.find( 'a' ) = = None :
pass
else :
print (row.text)
print ( f "{row.find('a')['href']}\n" )
|
Output: Monza STATS Salernitana 1.73 3.80 4.50
/soccer/stats?country=italy&league=serie-a"e=1.73&direction=home&id=NDAwOTcwOQ==
Lazio STATS Atalanta 2.55 3.40 2.63
/soccer/stats?country=italy&league=serie-a"e=2.55&direction=home&id=NDAwOTcxMA==
Frosinone STATS Verona 2.15 3.40 3.30
/soccer/stats?country=italy&league=serie-a"e=2.15&direction=home&id=NDAxMDQ4Mg==
Cagliari STATS AS Roma 3.80 3.40 1.95
/soccer/stats?country=italy&league=serie-a"e=1.95&direction=away&id=NDAxMDQ4Mw==
.....
Posts: 27
Threads: 4
Joined: Oct 2023
Oct-09-2023, 02:50 PM
(This post was last modified: Oct-09-2023, 02:50 PM by cartonics.)
thanks so much now i'll try to understand better the code ...
if i want to remove some data for example
from
Monza STATS Salernitana 1.73 3.80 4.50
to
Monza - Salernitana
i have to save them in a txt and then edit or can be done on the fly removing that "td" of table with beautifolsoup?
i have also to replace some text in the url x = c.replace('"e', ""e") but i can solve later this
Posts: 7,324
Threads: 123
Joined: Sep 2016
(Oct-09-2023, 02:50 PM)cartonics Wrote: if i want to remove some data for example
from
Monza STATS Salernitana 1.73 3.80 4.50
to
Monza - Salernitana
i have to save them in a txt and then edit or can be done on the fly removing that "td" of table with beautifolsoup? When do row.text then is just a string and BS has done it's job.
So if what to change output now have to use Python string methods or eg regex.
1 2 3 4 5 6 |
>>> tr[ 2 ].text
' Monza STATS Salernitana 1.73 3.80 4.50 '
>>> tr[ 2 ].text.replace( 'STATS' , '-' ).split()
[ 'Monza' , '-' , 'Salernitana' , '1.73' , '3.80' , '4.50' ]
>>> ' ' .join(row.text.replace( 'STATS' , '-' ).split())
'Fiorentina - Empoli 1.44 4.33 7.50'
|
Then code will be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser'
http_encoding = resp.encoding if 'charset' in resp.headers.get( 'content-type' , '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html = True )
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding = encoding)
table = soup.find_all( 'table' ,attrs = { 'class' : 'updated_next_results_table' })
table = table[ 0 ]
tr = table.find_all( 'tr' )
for row in tr:
if row.text = = None :
pass
if row.find( 'a' ) = = None :
pass
else :
print ( ' ' .join(row.text.replace( 'STATS' , '-' ).split()))
print ( f "{row.find('a')['href']}\n" )
|
Output: Monza - Salernitana 1.73 3.80 4.50
/soccer/stats?country=italy&league=serie-a"e=1.73&direction=home&id=NDAwOTcwOQ==
Lazio - Atalanta 2.55 3.40 2.63
/soccer/stats?country=italy&league=serie-a"e=2.55&direction=home&id=NDAwOTcxMA==
Frosinone - Verona 2.15 3.40 3.30
/soccer/stats?country=italy&league=serie-a"e=2.15&direction=home&id=NDAxMDQ4Mg==
Cagliari - AS Roma 3.80 3.40 1.95
/soccer/stats?country=italy&league=serie-a"e=1.95&direction=away&id=NDAxMDQ4Mw==
.....
cartonics likes this post
Posts: 27
Threads: 4
Joined: Oct 2023
Oct-10-2023, 08:38 AM
(This post was last modified: Oct-10-2023, 08:38 AM by cartonics.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser'
http_encoding = resp.encoding if 'charset' in resp.headers.get( 'content-type' , '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html = True )
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding = encoding)
table = soup.find_all( 'table' ,attrs = { 'class' : 'updated_next_results_table' })
table = table[ 0 ]
tr = table.find_all( 'tr' )
for row in tr:
if row.text = = None :
pass
if row.find( 'a' ) = = None :
pass
else :
y = f "{row.find('a')['href']}\n"
x = ' ' .join(row.text.replace( 'STATS' , '-' ).split())
q = ''.join([i for i in x if not i.isdigit()])
z = c.replace( '"e' , ""e")
f = open ( "matches.txt" , "a" )
f.write( str (q) + ' ' + str (z))
f.close()
|
i edited the link for my needs i have only to understand how to remove all numbers of odds
tryed this q= ''.join([i for i in x if not i.isdigit()]) but in output i find . points that remains from decimals
so i added q1= ' '.join(q.replace('.', '').split())
does the work but i think is a very dirty solution.. i think that is the worst solution :)
Posts: 7,324
Threads: 123
Joined: Sep 2016
Oct-10-2023, 11:33 AM
(This post was last modified: Oct-10-2023, 11:34 AM by snippsat.)
Some tips you should not have open file in the loop,same for *https://www.sbostats.com which is a value that don't change.
So a example like this i use with open(close file object automatic).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser'
http_encoding = resp.encoding if 'charset' in resp.headers.get( 'content-type' , '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html = True )
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding = encoding)
table = soup.find_all( 'table' ,attrs = { 'class' : 'updated_next_results_table' })
table = table[ 0 ]
tr = table.find_all( 'tr' )
with open ( 'matches.txt' , 'a' ) as fp:
for row in tr:
if row.text = = None :
pass
if row.find( 'a' ) = = None :
pass
else :
fp.write( f "{' '.join(row.text.replace('STATS', '-').split()[:3])}\n" )
fp.write( f "{base_url}{row.find('a')['href']}\n\n" )
|
Output: Verona - Napoli
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA==
Torino - Inter
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=1.83&direction=away&id=NDAxMTg3OQ==
Sassuolo - Lazio
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=2.30&direction=away&id=NDAxMTg4MA==
.....
I like output better like this,but you can just change to have all one line as in your example.
Quote:i edited the link for my needs i have only to understand how to remove all numbers of odds
Also i guess that you have tested all this in the loop,this how i testet only one value(interactive interpret) then added to loop.
1 2 3 4 5 6 7 8 |
>>> tr[ 2 ]
<tr> <td class = "widget-results__team-details ovf updated_m130" > <span class = "widget-results__team-name match-name" data - original - title = "Verona" data - placement = "bottom" data - toggle = "tooltip" >Verona< / span> < / td> <td class = "widget-results__score text-center limitstats" > <a class = "btn btn-primary btn-xs" href = '/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA==' >STATS< / a> < / td> <td class = "widget - results__team - details ovf updated_m130 text - right "> <div class=" row "> <div class=" col - sm - 3 "> </div> <div class=" col - sm - 9 "> <span class=" widget - results__team - name match - name " data-original-title=" Napoli " data-placement=" bottom " data-toggle=" tooltip "> Napoli </span> </div> </div> </td> <td class=" widget - results__quote "> <span class=" " style=" ">6.50</span> </td> <td class=" widget - results__quote "> <span class=" ">4.00</span> </td> <td class=" widget - results__quote "> <span class=" match_fav " style=" "> 1.50 < / span> < / td> < / tr>
>>>
>>> ' ' .join(tr[ 2 ].text.replace( 'STATS' , '-' ).split())
'Fiorentina - Empoli 1.44 4.33 7.50'
>>>
>>> ' ' .join(tr[ 2 ].text.replace( 'STATS' , '-' ).split()[: 3 ])
'Verona - Napoli'
|
Posts: 27
Threads: 4
Joined: Oct 2023
A stupid question... why if in the source code in the link there is
serie-a"e
scraping become
=serie-a"e
is it a problem of encoding ??
Posts: 7,324
Threads: 123
Joined: Sep 2016
Oct-10-2023, 03:44 PM
(This post was last modified: Oct-10-2023, 03:44 PM by snippsat.)
(Oct-10-2023, 01:32 PM)cartonics Wrote: A stupid question... why if in the source code in the link there is
serie-a"e
scraping become
=serie-a"e
is it a problem of encoding ?? Yes,and the reason is your code 😉
Remove the encoding stuff you start with and use lxml as parser,then the links will work.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'lxml'
soup = BeautifulSoup(resp.content, parser)
table = soup.find_all( 'table' , attrs = { 'class' : 'updated_next_results_table' })
table = table[ 0 ]
tr = table.find_all( 'tr' )
with open ( 'matches.txt' , 'a' ) as fp:
for row in tr:
if row.text = = None :
pass
if row.find( 'a' ) = = None :
pass
else :
fp.write( f "{' '.join(row.text.replace('STATS', '-').split()[:3])}\n" )
fp.write( f "{base_url}{row.find('a')['href']}\n\n" )
|
Output: Verona - Napoli
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA==
Torino - Inter
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=1.83&direction=away&id=NDAxMTg3OQ==
Sassuolo - Lazio
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=2.30&direction=away&id=NDAxMTg4MA==
cartonics likes this post
Posts: 27
Threads: 4
Joined: Oct 2023
Thank you so much for your help and cause i can understand.. i am so new to python only few days and it seems really promising...
Now it does all that i needed.. but i have a "didactical" question
from here:
<tr> <td class="widget-results__team-details ovf updated_m130"> <span class="widget-results__team-name match-name" data-original-title="Verona" data-placement="bottom" data-toggle="tooltip">Verona</span> </td> <td class="widget-results__score text-center limitstats"> <a class="btn btn-primary btn-xs" href='/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA=='>STATS</a> </td> <td class="widget-results__team-details ovf updated_m130 text-right"> <div class="row"> <div class="col-sm-3"> </div> <div class="col-sm-9"> <span class="widget-results__team-name match-name" data-original-title="Napoli" data-placement="bottom" data-toggle="tooltip"> Napoli </span> </div> </div> </td> <td class="widget-results__quote"> <span class="" style="">6.50</span> </td> <td class="widget-results__quote"> <span class="">4.00</span> </td> <td class="widget-results__quote"> <span class="match_fav" style="">1.50</span> </td> </tr>
my first idea was to take only the tags widget-results__team-name match-name and btn btn-primary btn-xs
is there something to achieve that?
another question:
if there is more than one table in link for example here.
https://www.sbostats.com/soccer/league/i...-c-group-c
is it possible to scrape only the second one
[Image: img.png]
i think the trick can be done here: table = table[0]
but i want always the table after the words "PARTITE CONCLUSE" and is not always table[0]
|