Help Scraping links and table from link

cartonics · Oct-06-2023, 08:32 AM

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
 
parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
 
#print (soup)
 
##for link in soup.select('a[href^="/soccer/stats?"]'):
##    #print ('https://www.sbostats.com/soccer/stats?country=italy&league=serie-a&quote=1.44&direction=away&id=Mzk5OTk5MQ==')
##    href1 = ['href']
##    # a"e
##    c = ('https://www.sbostats.com'+link['href'])
##    x = c.replace('"e', "&quote")
##    print (x)
 
 
data = []
table = soup.find_all('table',attrs={'class':'updated_next_results_table'}) #, 
 
print (table)
 
 
rows = soup.find_all('tr')
for row in rows:
    cols = row.find_all('td') #, attrs={'class':'widget-results__team-name match-name'}
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values
 
print (data)

i am able to take links and datas but my expected result is
from this link https://www.sbostats.com/soccer/league/italy/serie-a

for each match have values of names of teams from the table and the relative link.

cartonics · Oct-09-2023, 08:33 AM

no one can help me?

if i use attrs={'class':'widget-results__team-name match-name'} is empty []

***snippsat*** · (This post was last modified: Oct-09-2023, 12:13 PM by snippsat.)

Here a example on how to print out the whole table.
Now get hfref back in need full address just concat all with https://www.sbostats.com.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

 from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
 
parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
table = soup.find_all('table',attrs={'class':'updated_next_results_table'})
 
table = table[0]
tr = table.find_all('tr')
for row in tr:
    if row.text == None:
        pass
    if row.find('a') == None:
        pass
    else:
        print(row.text)
        print(f"{row.find('a')['href']}\n")

Output:  Monza   STATS        Salernitana      1.73   3.80   4.50  
/soccer/stats?country=italy&league=serie-a"e=1.73&direction=home&id=NDAwOTcwOQ==

  Lazio   STATS        Atalanta      2.55   3.40   2.63  
/soccer/stats?country=italy&league=serie-a"e=2.55&direction=home&id=NDAwOTcxMA==

  Frosinone   STATS        Verona      2.15   3.40   3.30  
/soccer/stats?country=italy&league=serie-a"e=2.15&direction=home&id=NDAxMDQ4Mg==

  Cagliari   STATS        AS Roma      3.80   3.40   1.95  
/soccer/stats?country=italy&league=serie-a"e=1.95&direction=away&id=NDAxMDQ4Mw==
.....

cartonics · (This post was last modified: Oct-09-2023, 02:50 PM by cartonics.)

thanks so much now i'll try to understand better the code ...

if i want to remove some data for example
from
Monza STATS Salernitana 1.73 3.80 4.50
to
Monza - Salernitana

i have to save them in a txt and then edit or can be done on the fly removing that "td" of table with beautifolsoup?

i have also to replace some text in the url x = c.replace('"e', "&quote") but i can solve later this

***snippsat*** · Oct-09-2023, 03:37 PM

(Oct-09-2023, 02:50 PM)cartonics Wrote: if i want to remove some data for example
from
Monza STATS Salernitana 1.73 3.80 4.50
to
Monza - Salernitana

i have to save them in a txt and then edit or can be done on the fly removing that "td" of table with beautifolsoup?

When do row.text then is just a string and BS has done it's job.
So if what to change output now have to use Python string methods or eg regex.

        
              >>> tr[2].text
'  Monza   STATS        Salernitana      1.73   3.80   4.50  '
>>> tr[2].text.replace('STATS', '-').split()
['Monza', '-', 'Salernitana', '1.73', '3.80', '4.50']
>>> ' '.join(row.text.replace('STATS', '-').split())
'Fiorentina - Empoli 1.44 4.33 7.50'

Then code will be:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
 
parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
table = soup.find_all('table',attrs={'class':'updated_next_results_table'})
 
table = table[0]
tr = table.find_all('tr')
for row in tr:
    if row.text == None:
        pass
    if row.find('a') == None:
        pass
    else:
        #print(row.text)
        print(' '.join(row.text.replace('STATS', '-').split()))
        print(f"{row.find('a')['href']}\n")

Output:Monza - Salernitana 1.73 3.80 4.50
/soccer/stats?country=italy&league=serie-a"e=1.73&direction=home&id=NDAwOTcwOQ==

Lazio - Atalanta 2.55 3.40 2.63
/soccer/stats?country=italy&league=serie-a"e=2.55&direction=home&id=NDAwOTcxMA==

Frosinone - Verona 2.15 3.40 3.30
/soccer/stats?country=italy&league=serie-a"e=2.15&direction=home&id=NDAxMDQ4Mg==

Cagliari - AS Roma 3.80 3.40 1.95
/soccer/stats?country=italy&league=serie-a"e=1.95&direction=away&id=NDAxMDQ4Mw==
.....

cartonics · (This post was last modified: Oct-10-2023, 08:38 AM by cartonics.)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
  
parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
table = soup.find_all('table',attrs={'class':'updated_next_results_table'})
  
table = table[0]
tr = table.find_all('tr')
for row in tr:
    if row.text == None:
        pass
    if row.find('a') == None:
        pass
    else:
        #print(row.text)
        #print(' '.join(row.text.replace('STATS', '-').split()))
        #print(f"{row.find('a')['href']}\n")
        y= f"{row.find('a')['href']}\n"
        x= ' '.join(row.text.replace('STATS', '-').split())
        q= ''.join([i for i in x if not i.isdigit()])
        c = ('*https://www.sbostats.com' + y)
        z = c.replace('"e', "&quote")
        #print(x + z)
        f = open("matches.txt", "a")
    #f.write([x] +[y])
        f.write(str(q) + ' ' + str(z))
        f.close()

i edited the link for my needs i have only to understand how to remove all numbers of odds

tryed this q= ''.join([i for i in x if not i.isdigit()]) but in output i find . points that remains from decimals
so i added q1= ' '.join(q.replace('.', '').split())
does the work but i think is a very dirty solution.. i think that is the worst solution :)

***snippsat*** · (This post was last modified: Oct-10-2023, 11:34 AM by snippsat.)

Some tips you should not have open file in the loop,same for *https://www.sbostats.com which is a value that don't change.
So a example like this i use with open(close file object automatic).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
 
parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
table = soup.find_all('table',attrs={'class':'updated_next_results_table'})
 
table = table[0]
tr = table.find_all('tr')
base_url = '*https://www.sbostats.com'
with open('matches.txt', 'a') as fp:
    for row in tr:
        if row.text == None:
            pass
        if row.find('a') == None:
            pass
        else:
            #print(' '.join(row.text.replace('STATS', '-').split()[:3]))
            #print(f"{base_url}{row.find('a')['href']}\n")
            fp.write(f"{' '.join(row.text.replace('STATS', '-').split()[:3])}\n")
            fp.write(f"{base_url}{row.find('a')['href']}\n\n")

Output:Verona - Napoli
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA==

Torino - Inter
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=1.83&direction=away&id=NDAxMTg3OQ==

Sassuolo - Lazio
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=2.30&direction=away&id=NDAxMTg4MA==
.....

I like output better like this,but you can just change to have all one line as in your example.

Quote:i edited the link for my needs i have only to understand how to remove all numbers of odds

Also i guess that you have tested all this in the loop,this how i testet only one value(interactive interpret) then added to loop.

        
              >>> tr[2]
<tr> <td class="widget-results__team-details ovf updated_m130"> <span class="widget-results__team-name match-name" data-original-title="Verona" data-placement="bottom" data-toggle="tooltip">Verona</span> </td> <td class="widget-results__score text-center limitstats"> <a class="btn btn-primary btn-xs" href='/soccer/stats?country=italy&amp;league=serie-a"e=1.50&amp;direction=away&amp;id=NDAxMTg3OA=='>STATS</a> </td> <td class="widget-results__team-details ovf updated_m130 text-right"> <div class="row"> <div class="col-sm-3"> </div> <div class="col-sm-9"> <span class="widget-results__team-name match-name" data-original-title="Napoli" data-placement="bottom" data-toggle="tooltip"> Napoli </span> </div> </div> </td> <td class="widget-results__quote"> <span class="" style="">6.50</span> </td> <td class="widget-results__quote"> <span class="">4.00</span> </td> <td class="widget-results__quote"> <span class="match_fav" style="">1.50</span> </td> </tr>
>>> 
>>> ' '.join(tr[2].text.replace('STATS', '-').split())
'Fiorentina - Empoli 1.44 4.33 7.50'
>>> # Remove odds
>>> ' '.join(tr[2].text.replace('STATS', '-').split()[:3])
'Verona - Napoli'

cartonics · Oct-10-2023, 01:32 PM

A stupid question... why if in the source code in the link there is

serie-a&quote
scraping become
=serie-a"e

is it a problem of encoding ??

***snippsat*** · (This post was last modified: Oct-10-2023, 03:44 PM by snippsat.)

(Oct-10-2023, 01:32 PM)cartonics Wrote: A stupid question... why if in the source code in the link there is

serie-a&quote
scraping become
=serie-a"e

is it a problem of encoding ??

Yes,and the reason is your code 😉
Remove the encoding stuff you start with and use lxml as parser,then the links will work.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
 
parser = 'lxml'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a")
soup = BeautifulSoup(resp.content, parser)
 
table = soup.find_all('table', attrs={'class':'updated_next_results_table'})
table = table[0]
tr = table.find_all('tr')
base_url = '*https://www.sbostats.com'
with open('matches.txt', 'a') as fp:
    for row in tr:
        if row.text == None:
            pass
        if row.find('a') == None:
            pass
        else:
            #print(' '.join(row.text.replace('STATS', '-').split()[:3]))
            #print(f"{base_url}{row.find('a')['href']}\n")
            fp.write(f"{' '.join(row.text.replace('STATS', '-').split()[:3])}\n")
            fp.write(f"{base_url}{row.find('a')['href']}\n\n")

Output:Verona - Napoli
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a&quote=1.50&direction=away&id=NDAxMTg3OA==

Torino - Inter
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a&quote=1.83&direction=away&id=NDAxMTg3OQ==

Sassuolo - Lazio
*https://www.sbostats.com/soccer/stats?country=italy&league=serie-a&quote=2.30&direction=away&id=NDAxMTg4MA==

cartonics · Oct-11-2023, 07:08 AM

Thank you so much for your help and cause i can understand.. i am so new to python only few days and it seems really promising...

Now it does all that i needed.. but i have a "didactical" question

from here:
<tr> <td class="widget-results__team-details ovf updated_m130"> <span class="widget-results__team-name match-name" data-original-title="Verona" data-placement="bottom" data-toggle="tooltip">Verona</span> </td> <td class="widget-results__score text-center limitstats"> <a class="btn btn-primary btn-xs" href='/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA=='>STATS</a> </td> <td class="widget-results__team-details ovf updated_m130 text-right"> <div class="row"> <div class="col-sm-3"> </div> <div class="col-sm-9"> <span class="widget-results__team-name match-name" data-original-title="Napoli" data-placement="bottom" data-toggle="tooltip"> Napoli </span> </div> </div> </td> <td class="widget-results__quote"> <span class="" style="">6.50</span> </td> <td class="widget-results__quote"> <span class="">4.00</span> </td> <td class="widget-results__quote"> <span class="match_fav" style="">1.50</span> </td> </tr>

my first idea was to take only the tags widget-results__team-name match-name and btn btn-primary btn-xs

is there something to achieve that?

another question:
if there is more than one table in link for example here.
https://www.sbostats.com/soccer/league/i...-c-group-c
is it possible to scrape only the second one
[Image: img.png]

i think the trick can be done here: table = table[0]

but i want always the table after the words "PARTITE CONCLUSE" and is not always table[0]

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scraping data from table into existing dataframe	vincer58	1	3,125	Jan-09-2022, 05:15 PM Last Post: vincer58
	Need help scraping wikipedia table	bborusz2	6	4,823	Dec-01-2020, 11:31 PM Last Post: snippsat
	Web Scraping Inquiry (Extracting content from a table in asubdomain)	DustinKlent	3	4,800	Aug-17-2020, 10:10 AM Last Post: snippsat
	Scraping a dynamic data-table in python through AJAX request	filozofo	1	5,036	Aug-14-2020, 10:13 AM Last Post: kashcode
	scraping multiple pages from table	bandar	1	3,471	Jun-27-2020, 10:43 PM Last Post: Larz60+
	get link and link text from table	metulburr	5	8,288	Jun-13-2019, 07:50 PM Last Post: snippsat
	webscrapping links and then enter those links to scrape data	kirito85	2	4,306	Jun-13-2019, 02:23 AM Last Post: kirito85
	Error while scraping links with beautiful soup	mgtheboss	4	10,239	Dec-22-2017, 12:41 PM Last Post: mgtheboss
	Web scraping "fancy" table	acehole60	2	5,830	Dec-16-2016, 09:17 AM Last Post: acehole60

Help Scraping links and table from link

User Panel Messages

Announcements