Python Forum
How can I use the following script?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I use the following script?
#1
Lightbulb 
Hello guys,

This is my first time using Python and I'm very excited about it.
I'm trying Scrape some data from a website (www.zapimoveis.com.br), and I found a developer who created a project on Pypi jus to do it. His project can be accessed via (https://pypi.org/project/zapimoveis-scraper/), and installed via (pip install zapimoveis-scraper).

My question is, how can I extract the desired data and export it to an Excel Spreadsheet? There is no tutorial or infos available on his website. Basically I would like to know how can I use his project/ codes?

At the Usage Example of his page he just says:
zapimoveis_scraper.search(localization=”go+goiania++setor-oeste”, num_pages=5)

The objects returned from search contain the following attributes:
description: property description
price: property price (monthly)
bedrooms: number of bedrooms on property
bathrooms: number of bathrooms on property
total_area_m2: property area (square meters)
vacancies: parking spots available on property
address: property address

Maybe it's a very simple and stupid question for you guys... but as I mentioned, I'm just beggining and I'm learning a lot!

Thank you
Reply
#2
You use it like this after install.
# zap.py
import zapimoveis_scraper as zap

result  = zap.search(localization="go+goiania++setor-oeste", num_pages=5) 
print(result[0].description)
print(result[0].price)
print(result[0].bedrooms)
Output:
λ python zap.py A casa no bairro Setor Oeste possui 100 metros quadrados com 3 quartos sendo 1 suite e 2 banheiros, 02 salas, cozinha, area de serviço. mais R$ 1.700 3
brunolelli Wrote:My question is, how can I extract the desired data and export it to an Excel Spreadsheet?
There are serval ways i find pandas most easy as get a look like Excel and has read_excel() to_excel() methods.
A other way is openpyxl
Reply
#3
Thank you so much!
It's working...

Would you mind providing me an example using pandas?

How can I extract all outputs and send the results to an excel .csv file?


Thank you!
Reply
#4
(Mar-17-2021, 12:34 AM)brunolelli Wrote: Would you mind providing me an example using pandas?

How can I extract all outputs and send the results to an excel .csv file?
Can look this post it will be in similar way.
So you can loop over result as show in post and save in eg a dictionary for easier import to pandas.
As this is your first time using Python,so can there be some struggle with this if lack the basic knowledge.
Reply
#5
Thank you again!
I'm using the following code. What are your thoughts?

df = pd.DataFrame([], columns=['Address', 'Price', 'Total Area', 'Bathrooms', 'Badrooms', 'Vacancies', 'Description'])

for i in range(5):
    df = df.append({'Address': result[i].address, 'Price': result[i].price, 'Total Area': result[i].total_area_m2, 'Bathrooms': result[i].bathrooms, 'Badrooms':result[i].bedrooms, 'Vacancies':result[i].vacancies, 'Description':result[i].description}, ignore_index=True)

df.to_csv (r'C:\Users\bruno\Desktop\FII\Zap.csv', index = False, header=True, sep=';')
print(df)
As you can see here, I'm just looping through 5 items. How can I loop through all items? And it's not a fixed number, it varies...
Thank you
Reply
#6
I think I found a solution:

What do you think about it? Is it appropriate?

ofertas = len(result)
print(ofertas)

df = pd.DataFrame([], columns=['Address', 'Price', 'Total Area', 'Bathrooms', 'Badrooms', 'Vacancies', 'Description'])

for i in range(ofertas):
    df = df.append({'Address': result[i].address, 'Price': result[i].price, 'Total Area': result[i].total_area_m2, 'Bathrooms': result[i].bathrooms, 'Badrooms':result[i].bedrooms, 'Vacancies':result[i].vacancies, 'Description':result[i].description}, ignore_index=True)

df.to_csv (r'C:\Users\bruno\Desktop\FII\Zap.csv', index = False, header=True, sep=';')
print(df)
Reply
#7
Here is a script I wrote a couple days ago. Its purpose is to go out to a Web site, find the product page, and add the product description to my spreadsheet. The spreadsheet starts out with an ID number in column A and an (empty) description in column E. If the description is blank or empty, it queries the Web site. I've elided some details.
import requests
import re
import sys
import openpyxl

ErrorLog = []
debug = False
# *******************************************************************
# logError
# Inputs:
#    s: a string to add to the error log
# *******************************************************************
def logError(s):
    ErrorLog.append(s)
# *******************************************************************
# msgProlog
# Inputs:
#    C. which is a cell value
# Result: string
#    The standard prolog to a message, which includes a trailing space
#    'A1: "123" '
# *******************************************************************
def msgProlog(C):
    return (C.coordinate + ": \"" + str(C.value) + '" ')
# *******************************************************************
# deHTML
# Inputs:
#    s: A string
# Result: string
#    The input string with all the &...; codes replaced
# *******************************************************************
def deHTML(s):
    t = s
    t = t.replace('"', '"')
    t = t.replace('&lt', '<')
    t = t.replace('&gt', ">")
    t = t.replace('&amp;', '&')
    return t
# *******************************************************************
# Main program
# *******************************************************************
# Open the workspace
workspace_name = "inventory.xlsx"
try:
   wb = openpyxl.load_workbook(filename = workspace_name)
except:
    print("Error opening file ", sys.exc_info()[0])
    raise
sheet = wb.active

# Set the default columns in case we can't figure out what is going on
colPartNumber = 1
captionPartNumber = "#"
colManufacturer = 2
captionManufacturer = "Mfg"
colDescription = 5
captionDescription = "Description"

# First, identify the columns that have the headings
# "#" colPartNumber
# "Mfg" colManufacturer
# "Description" colDescription

for col in range(1,20):
    cell = sheet.cell(1, col)
    caption = sheet.cell(1, col).value
    if caption == None:
       print(cell.coordinate, end=": ")
       print("<None>")
       break

    print(cell.coordinate + ":\"" + caption + "\"")
    if caption == captionPartNumber:
        colPartNumber = col
        print(captionPartNumber + ' = ' + str(col))
    elif caption == captionManufacturer:
        colManufacturer = col
        print(captionManufacturer + ' = ' + str(col))
    elif caption == captionDescription:
        colDescription = col
        print(captionDescription + ' = ' + str(col))

print("")

# Now we can iterate through the cells
row = 1
try: 
    while(True):
        row += 1
        # Get the part #
        partNo = sheet.cell(row = row, column = colPartNumber)
        part = partNo.value
        if(part == None):
           print(partNo.coordinate, end=": ")
           print("<None>")
           break;

    # do we already have a description?
        desc = sheet.cell(row = row, column = colDescription)
        if desc.value != None:
            # We have a description.  If it has not been deHTMLed, do so
            nicedesc = deHTML(desc.value)
            if(desc.value != nicedesc):
               msg = msgProlog(partNo) + '"' + desc.value + '"=>"' + nicedesc + '"****************'
               print(msg)
               logError(msg)
               desc.value = nicedesc
               continue
            msg = msgProlog(partNo) + ' already has description "' + desc.value + '"'
            print(msg)
            continue

    # We do not have a description.  Go find it
        try:
            r = requests.get('https://www.elided.com/products/' + str(partNo.value))

            # If there was an error, throw an exception
            r.raise_for_status()
        except:
            msg = msgProlog(partNo) + ' "' + str(r) + '"'
            print(msg)
            logError(msg)
            print(r)
            continue
        # The best description is found in an entry that says
        # twitter:title<useless stuff>content="(.*)"
        x = re.search('twitter:title.*content="(.*)"', r.text)
        if x != None:
            msgProlog(partNo) + '"' + x.group(1) + '"'
            print(msg)
            caption = x.group(1)
            caption = deHTML(caption)
            desc.value = caption
        else:
            msg = msgProlog(partNo) + "<No attribute found>"
            logError(msg)
            print(msg)
except:
    print("Exception occurred", sys.exc_info()[0])

# Now save the workbook
wb.save(filename = workspace_name)
print("File saved")
print("=========================== Errors ===========================")
for msg in ErrorLog:
   print(msg)
 
brunolelli likes this post
Reply
#8
(Mar-17-2021, 01:50 AM)brunolelli Wrote: What do you think about it? Is it appropriate?
If it give you what wanted then is okay👍
I can not run it as it not all code,line length get way to long.
ofertas = len(result)
print(ofertas)

df = pd.DataFrame(
    [],
    columns=[
        "Address",
        "Price",
        "Total Area",
        "Bathrooms",
        "Badrooms",
        "Vacancies",
        "Description",
    ],
)

for i in range(ofertas):
    df = df.append(
        {
            "Address": result[i].address,
            "Price": result[i].price,
            "Total Area": result[i].total_area_m2,
            "Bathrooms": result[i].bathrooms,
            "Badrooms": result[i].bedrooms,
            "Vacancies": result[i].vacancies,
            "Description": result[i].description,
        },
        ignore_index=True,
    )

df.to_csv(r"C:\Users\bruno\Desktop\FII\Zap.csv", index=False, header=True, sep=";",)
print(df)
brunolelli likes this post
Reply
#9
Thank you so much guys!

You guys are the best!

Thank you
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020