Python Forum

I have a python scraper for ranker.com. It scrapes the name and image url from the site.

import requests
import pandas as pd
from pandas import DataFrame
#Check "ListID" for api id
r = requests.get('https://api.ranker.com/lists/538997/items?limit=100')
data = r.json()['listItems']
for i in data: print(i['name'], i['image'])

I wanted this output to be saved into a ranks excel file. I dont have much experience with python. Hence, can someone please help me with the excel part of the code. Thank you

Here is a script I wrote a couple days ago. I have elided some parts. It scrapes a site to retrieve the part description. I am using openpyxl

import requests
import re
import sys
import openpyxl
 
ErrorLog = []
debug = False
# *******************************************************************
# logError
# Inputs:
#    s: a string to add to the error log
# *******************************************************************
def logError(s):
    ErrorLog.append(s)
# *******************************************************************
# msgProlog
# Inputs:
#    C. which is a cell value
# Result: string
#    The standard prolog to a message, which includes a trailing space
#    'A1: "123" '
# *******************************************************************
def msgProlog(C):
    return (C.coordinate + ": \"" + str(C.value) + '" ')
# *******************************************************************
# deHTML
# Inputs:
#    s: A string
# Result: string
#    The input string with all the &...; codes replaced
# *******************************************************************
def deHTML(s):
    t = s
    t = t.replace('&quot;', '"')
    t = t.replace('&lt', '<')
    t = t.replace('&gt', ">")
    t = t.replace('&amp;', '&')
    return t
# *******************************************************************
# Main program
# *******************************************************************
# Open the workspace
workspace_name = "inventory.xlsx"
try:
   wb = openpyxl.load_workbook(filename = workspace_name)
except:
    print("Error opening file ", sys.exc_info()[0])
    raise
sheet = wb.active
 
# Set the default columns in case we can't figure out what is going on
colPartNumber = 1
captionPartNumber = "#"
colManufacturer = 2
captionManufacturer = "Mfg"
colDescription = 5
captionDescription = "Description"
 
# First, identify the columns that have the headings
# "#" colPartNumber
# "Mfg" colManufacturer
# "Description" colDescription
 
for col in range(1,20):
    cell = sheet.cell(1, col)
    caption = sheet.cell(1, col).value
    if caption == None:
       print(cell.coordinate, end=": ")
       print("<None>")
       break
 
    print(cell.coordinate + ":\"" + caption + "\"")
    if caption == captionPartNumber:
        colPartNumber = col
        print(captionPartNumber + ' = ' + str(col))
    elif caption == captionManufacturer:
        colManufacturer = col
        print(captionManufacturer + ' = ' + str(col))
    elif caption == captionDescription:
        colDescription = col
        print(captionDescription + ' = ' + str(col))
 
print("")
 
# Now we can iterate through the cells
row = 1
try: 
    while(True):
        row += 1
        # Get the part #
        partNo = sheet.cell(row = row, column = colPartNumber)
        part = partNo.value
        if(part == None):
           print(partNo.coordinate, end=": ")
           print("<None>")
           break;
 
    # do we already have a description?
        desc = sheet.cell(row = row, column = colDescription)
        if desc.value != None:
            # We have a description.  If it has not been deHTMLed, do so
            nicedesc = deHTML(desc.value)
            if(desc.value != nicedesc):
               msg = msgProlog(partNo) + '"' + desc.value + '"=>"' + nicedesc + '"****************'
               print(msg)
               logError(msg)
               desc.value = nicedesc
               continue
            msg = msgProlog(partNo) + ' already has description "' + desc.value + '"'
            print(msg)
            continue
 
    # We do not have a description.  Go find it
        try:
            r = requests.get('https://www.elided.com/products/' + str(partNo.value))
 
            # If there was an error, throw an exception
            r.raise_for_status()
        except:
            msg = msgProlog(partNo) + ' "' + str(r) + '"'
            print(msg)
            logError(msg)
            print(r)
            continue
        # The best description is found in an entry that says
        # twitter:title<useless stuff>content="(.*)"
        x = re.search('twitter:title.*content="(.*)"', r.text)
        if x != None:
            msgProlog(partNo) + '"' + x.group(1) + '"'
            print(msg)
            caption = x.group(1)
            caption = deHTML(caption)
            desc.value = caption
        else:
            msg = msgProlog(partNo) + "<No attribute found>"
            logError(msg)
            print(msg)
except:
    print("Exception occurred", sys.exc_info()[0])
 
# Now save the workbook
wb.save(filename = workspace_name)
print("File saved")
print("=========================== Errors ===========================")
for msg in ErrorLog:
   print(msg)

I suggest using Pandas as an intermediary.
Convert your json to a Pandas dataframe, then convert the pandas dataframe to an xls file.

(Mar-21-2021, 12:19 PM)jefsummers Wrote: [ -> ]I suggest using Pandas as an intermediary.
Convert your json to a Pandas dataframe, then convert the pandas dataframe to an xls file.

If I had JSON, this might apply, but I don't.

I used openpyxl because it was the first one I found. I've heard of Pandas, what is the advantage over what I have? I'm curious, since I don't actually know anything about Pandas.

See OP llne 6. The request is converted to a json.

Pandas is widely used for tabular information, used with AI, graphing (matplotlib), etc.

Here is an introduction: Brandon Rhodes - Pandas from PyCon 2015

Beyondfacts

supuflounder

jefsummers

supuflounder

jefsummers