I have a python scraper for ranker.com. It scrapes the name and image url from the site.
import requests
import pandas as pd
from pandas import DataFrame
#Check "ListID" for api id
r = requests.get('https://api.ranker.com/lists/538997/items?limit=100')
data = r.json()['listItems']
for i in data: print(i['name'], i['image'])
I wanted this output to be saved into a ranks excel file. I dont have much experience with python. Hence, can someone please help me with the excel part of the code. Thank you
Here is a script I wrote a couple days ago. I have elided some parts. It scrapes a site to retrieve the part description. I am using openpyxl
import requests
import re
import sys
import openpyxl
ErrorLog = []
debug = False
# *******************************************************************
# logError
# Inputs:
# s: a string to add to the error log
# *******************************************************************
def logError(s):
ErrorLog.append(s)
# *******************************************************************
# msgProlog
# Inputs:
# C. which is a cell value
# Result: string
# The standard prolog to a message, which includes a trailing space
# 'A1: "123" '
# *******************************************************************
def msgProlog(C):
return (C.coordinate + ": \"" + str(C.value) + '" ')
# *******************************************************************
# deHTML
# Inputs:
# s: A string
# Result: string
# The input string with all the &...; codes replaced
# *******************************************************************
def deHTML(s):
t = s
t = t.replace('"', '"')
t = t.replace('<', '<')
t = t.replace('>', ">")
t = t.replace('&', '&')
return t
# *******************************************************************
# Main program
# *******************************************************************
# Open the workspace
workspace_name = "inventory.xlsx"
try:
wb = openpyxl.load_workbook(filename = workspace_name)
except:
print("Error opening file ", sys.exc_info()[0])
raise
sheet = wb.active
# Set the default columns in case we can't figure out what is going on
colPartNumber = 1
captionPartNumber = "#"
colManufacturer = 2
captionManufacturer = "Mfg"
colDescription = 5
captionDescription = "Description"
# First, identify the columns that have the headings
# "#" colPartNumber
# "Mfg" colManufacturer
# "Description" colDescription
for col in range(1,20):
cell = sheet.cell(1, col)
caption = sheet.cell(1, col).value
if caption == None:
print(cell.coordinate, end=": ")
print("<None>")
break
print(cell.coordinate + ":\"" + caption + "\"")
if caption == captionPartNumber:
colPartNumber = col
print(captionPartNumber + ' = ' + str(col))
elif caption == captionManufacturer:
colManufacturer = col
print(captionManufacturer + ' = ' + str(col))
elif caption == captionDescription:
colDescription = col
print(captionDescription + ' = ' + str(col))
print("")
# Now we can iterate through the cells
row = 1
try:
while(True):
row += 1
# Get the part #
partNo = sheet.cell(row = row, column = colPartNumber)
part = partNo.value
if(part == None):
print(partNo.coordinate, end=": ")
print("<None>")
break;
# do we already have a description?
desc = sheet.cell(row = row, column = colDescription)
if desc.value != None:
# We have a description. If it has not been deHTMLed, do so
nicedesc = deHTML(desc.value)
if(desc.value != nicedesc):
msg = msgProlog(partNo) + '"' + desc.value + '"=>"' + nicedesc + '"****************'
print(msg)
logError(msg)
desc.value = nicedesc
continue
msg = msgProlog(partNo) + ' already has description "' + desc.value + '"'
print(msg)
continue
# We do not have a description. Go find it
try:
r = requests.get('https://www.elided.com/products/' + str(partNo.value))
# If there was an error, throw an exception
r.raise_for_status()
except:
msg = msgProlog(partNo) + ' "' + str(r) + '"'
print(msg)
logError(msg)
print(r)
continue
# The best description is found in an entry that says
# twitter:title<useless stuff>content="(.*)"
x = re.search('twitter:title.*content="(.*)"', r.text)
if x != None:
msgProlog(partNo) + '"' + x.group(1) + '"'
print(msg)
caption = x.group(1)
caption = deHTML(caption)
desc.value = caption
else:
msg = msgProlog(partNo) + "<No attribute found>"
logError(msg)
print(msg)
except:
print("Exception occurred", sys.exc_info()[0])
# Now save the workbook
wb.save(filename = workspace_name)
print("File saved")
print("=========================== Errors ===========================")
for msg in ErrorLog:
print(msg)
I suggest using Pandas as an intermediary.
Convert your json to a Pandas dataframe, then convert the pandas dataframe to an xls file.
(Mar-21-2021, 12:19 PM)jefsummers Wrote: [ -> ]I suggest using Pandas as an intermediary.
Convert your json to a Pandas dataframe, then convert the pandas dataframe to an xls file.
If I had JSON, this might apply, but I don't.
I used openpyxl because it was the first one I found. I've heard of Pandas, what is the advantage over what I have? I'm curious, since I don't actually know anything about Pandas.
See OP llne 6. The request is converted to a json.
Pandas is widely used for tabular information, used with AI, graphing (matplotlib), etc.
Here is an introduction:
Brandon Rhodes - Pandas from PyCon 2015