Python Forum
Unable to gather data using beautifulscoup() [Output shows blank file]
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Unable to gather data using beautifulscoup() [Output shows blank file]
#1
I am trying to use the below code to extract the gaming data. There is no error while running the code but there is no data grabbed. I believe that the class is wrong in soup.find() but to my understanding it contains the data.
This is the first time I am using the beautifulscoup to grab the data, the output contains no data.
Any help in getting this code running would be greatly appreciated.


from bs4 import BeautifulSoup
import urllib
import pandas as pd

pages = 18
rec_count = 0
rank = []
gname = []
platform = []
year = []
genre = []
publisher = []
sales_na = []
sales_eu = []
sales_jp = []
sales_ot = []
sales_gl = []

urlhead = 'http://www.vgchartz.com/gamedb/?page='
urltail = '&results=1000&name=&platform=&minSales=0.01&publisher=&genre=&sort=GL'

for page in range(1,pages):
	#surl = urlhead + str(page) + urltail	
	r = urllib.request.urlopen('http://www.vgchartz.com/gamedb/?page=&results=1000&name=&platform=&minSales=0.01&publisher=&genre=&sort=GL').read()
	soup = BeautifulSoup(r,"lxml")
	print(page)
	chart = soup.find("div", class_="container-fluid")
	for row in chart.find_all('tr')[1:]:
		try: 
			col = row.find_all('td')
		
			#extract data into column data
			column_1 = col[0].string.strip()
			column_2 = col[1].string.strip()		
			column_3 = col[2].string.strip()		
			column_4 = col[3].string.strip()		
			column_5 = col[4].string.strip()	
			column_6 = col[5].string.strip()
			column_7 = col[6].string.strip()		
			column_8 = col[7].string.strip()		
			column_9 = col[8].string.strip()		
			column_10 = col[9].string.strip()		
			column_11 = col[10].string.strip()

			#Add Data to columns
			#Adding data only if able to read all of the columns
			rank.append(column_1)
			gname.append(column_2)
			platform.append(column_3)
			year.append(column_4)
			genre.append(column_5)
			publisher.append(column_6)
			sales_na.append(column_7)
			sales_eu.append(column_8)
			sales_jp.append(column_9)
			sales_ot.append(column_10)
			sales_gl.append(column_11)
		
			rec_count += 1
	
		except:
			continue

columns = {'rank': rank, 'name': gname, 'platform': platform, 'year': year, 'genre': genre, 'publisher': publisher, 'NA_Sales':sales_na, 'EU_Sales': sales_eu,'JP_Sales': sales_jp,'Other_Sales':sales_ot, 'Global_Sales':sales_gl }
print(rec_count)
df = pd.DataFrame(columns)
df = df[['Rank','Name','Platform','Year','Genre','Publisher','NA_Sales','EU_Sales','JP_Sales','Other_Sales','Global_Sales']]
del df.index.name
df.to_csv("vgsales.csv",sep=",",encoding='utf-8')
Output:
df[['rank','name','platform','year','genre','publisher','NA_Sales','EU_Sales','JP_Sales','Other_Sales','Global_Sales']] del df.index.name df.to_csv("vg1sales.csv",sep=",",encoding='utf-8') 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0
Reply
#2
When using soup.find() it stop at first hit,there are 6 class="container-fluid".
Find a tag that have that have more specific info and contain the table.
Example:
from bs4 import BeautifulSoup
import requests

url = 'http://www.vgchartz.com/gamedb/?page=&results=1000&name=&platform=&minSales=0.01&publisher=&genre=&sort=GL'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
chart = soup.find("div", id="generalBody")
tr_tag = chart.find_all('tr')
Test:
>>> tr_tag[4]
<tr style="background-image:url(../imgs/chartBar_alt_large.gif); height:70px">
<td>2</td>
<td>
<div id="photo3">
<a href="/games/game.php?id=6455&amp;region=All">
<div style="height:60px; width:60px; overflow:hidden;"> <img alt="Boxart Missing" border="0" src="/games/boxart/8972270ccc.jpg" width="60"/>
</div>
</a>
</div>
</td> <td style="font-size:12pt;"> <a href="http://www.vgchartz.com/game/6455/super-mario-bros/?region=All">Super Mario Bros.    </a> </td>
<td>
<center>
<img alt="NES" src="/images/consoles/NES_b.png"/>
</center>
</td> <td width="100">Nintendo  </td> <td align="center">N/A  </td> <td align="center">10.0  </td> <td align="center">N/A  </td> <td align="center">40.24m</td> <td align="center" width="75">18th Oct 85  </td> <td align="center" width="75">N/A</td></tr>

>>> tr_tag[4].find_all('a')[1].text
... 
'Super Mario Bros.    '

>>> td = tr_tag[4].find_all('td', align="center")
>>> for item in td:
...     item.text
...     
'N/A  '
'10.0  '
'N/A  '
'40.24m'
'18th Oct 85  '
'N/A'
Look at Web-Scraping part-1,
as you see no use of urllib always Requests.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Unable to get the data from web API using authentication key lokamaba 0 1,973 May-15-2020, 05:07 AM
Last Post: lokamaba
  Unable to access javaScript generated data with selenium and headless FireFox. pjn4 0 2,541 Aug-04-2019, 11:10 AM
Last Post: pjn4
  unable to load file using python selenium purnima1 4 6,512 Dec-12-2017, 04:04 PM
Last Post: hshivaraj
  Unable to print data while looping through list in csv for webscraping - Python Prince_Bhatia 1 3,497 Oct-04-2017, 11:18 AM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020