Python Forum
Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR
#1
I have a list of urls in a csv file (I can either host said file on my local machine or online). I need to pull biz name, address, and phone # from the web pages in the list. I have all of the correct class names. I want to extract this data to a csv with the aforementioned columns.

From the csv:

https://slicelife.com/restaurants/wi/mil...aukee/menu
https://slicelife.com/restaurants/nj/nor...hvale/menu
https://slicelife.com/restaurants/mn/man...pizza/menu
https://slicelife.com/restaurants/pa/new...k-hut/menu


When I run the code, it will create a csv with the desired column headers, but no data due to errors. I CAN pull data from the scraped urls one at a time like this:


# locationRawData = soup.find('div', attrs={"class": "f19xeu2d"}).text.encode('utf-8'), 
# pizzeriaName = soup.find('h1', attrs={"class": "f13p7rsj"}).text.encode('utf-8'),
# address = soup.find('address', attrs={"class": "f1lfckhr"}).text.encode('utf-8'),
# phoneNumber = soup.find('button', attrs={"class": "f12gt8lx"}).text.encode('utf-8'),
I have tried:

from bs4 import BeautifulSoup
import requests
import json
import csv
from urllib.request import urlopen


TrattoriArray = []
with open('aliveSlice.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        TrattoriArray.append(url) # Add each url to list contents

for url in TrattoriArray:  # Parse through each url in the list.
    page = urlopen(url[0]).read()
    content = BeautifulSoup(page.content, "html.parser")

pizzaArray = []
for pizzeria in content.findAll('div', attrs={"class": "f19xeu2d"}):
    pizzeriaObject = {
        "pizzeriaName": pizzeria.find('h1', attrs={"class": "f13p7rsj"}).text.encode('utf-8'),
        "address": pizzeria.find('address', attrs={"class": "f1lfckhr"}).text.encode('utf-8'),
        "phoneNumber": pizzeria.find('rc-c2d-number', attrs={"span": "rc-c2d-number"}).text.encode('utf-8'),

    }
    pizzaArray.append(pizzeriaObject)
with open('pizzeriaData.json', 'w') as outfile:
    json.dump(pizzaArray, outfile)
and

import requests
from bs4 import BeautifulSoup
import csv

with open('aliveSCRAPE.csv', newline='') as f_urls, open('output.csv', 'w', newline='') as f_output:
    csv_urls = csv.reader(f_urls)
    csv_output = csv.writer(f_output)
    csv_output.writerow(['locationRawData' , 'pizzeriaName' , 'address', 'Phone'])

    for line in csv_urls:
        r = requests.get(line[0]).text
        soup = BeautifulSoup(r.content, 'lxml')

        locationRawData = soup.find('h1')
        print('RAW :', locationRawData.text)

        pizzeriaName = soup.find('h1', class_='f13p7rsj').text
        pizzeria_name = pizzeria.split(':')
        print('pizzeriaName:', pizzeria_name[1])

        address = soup.find_all('address', class_='f1lfckhr'})
        print('Address :', address[2].text)

        phoneNumber = soup.find_all('button', class_='f12gt8lx')
        print('Phone :', phoneNumber[3].text)

        locationRawData = soup.find_all('div', class_='f19xeu2d'})
        print('RAW :', locationRawData[4].text)

        csv_output.writerow([locationRawData.text, pizzeria_name[1], address[2].text, phoneNumber[3].text])
And...a few other methods, which is the easiest? This is literally the first thing I have ever programmed in Python.
...\Desktop\scrapeYourPlate\test\Code>Python scrape.py
RAW : Bakers Buck Hut
Traceback (most recent call last):
  File "scrape.py", line 98, in <module>
    print('pizzeriaName:', pizzeriaName[1].text)
  File ...AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\bs4\element.py", line 1016, in __getitem__
    return self.attrs[key]
KeyError: 1
python python-3.x beautifulsoup



ERROR:


CodeNinjaGrasshopper
255 bronze badges
File "scrape.py", line 98, in <module> print('pizzeriaName:', pizzeriaName[1].text) File "C:\...\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\bs4\element.py", line 1016, in getitem return self.attrs[key] KeyError: 1 
Reply
#2
This page has enough java script in it that I would get the first page using selenium, thes use beautiful soup to get the details
There are examples of this on this forum under tutorials/web scraping (by snippsat)
Reply
#3
(Jul-04-2019, 01:44 AM)Larz60+ Wrote: This page has enough java script in it that I would get the first page using selenium, thes use beautiful soup to get the details
There are examples of this on this forum under tutorials/web scraping (by snippsat)

Okay I will research that! Thank you for your reply! Right now I feel like I dont know what I dont know and I dont even know what to search for unles I am pointed in the right direction as you took the time to do! Thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  What is the error of not being able to pull data in this code? i didn't see an error? TestPerson 2 1,179 Sep-30-2022, 02:36 PM
Last Post: DeaD_EyE
  BeautifulSoup not parsing other URLs giddyhead 0 1,168 Feb-23-2022, 05:35 PM
Last Post: giddyhead
  Extract Href URL and Text From List knight2000 2 8,625 Jul-08-2021, 12:53 PM
Last Post: knight2000
  Extract data from sports betting sites nestor 3 5,548 Mar-30-2021, 04:37 PM
Last Post: Larz60+
  Extract data from a table Bob_M 3 2,627 Aug-14-2020, 03:36 PM
Last Post: Bob_M
  Need logic on how to scrap 100K URLs goodmind 2 2,569 Jun-29-2020, 09:53 AM
Last Post: goodmind
  Extract data with Selenium and BeautifulSoup nestor 3 3,814 Jun-06-2020, 01:34 AM
Last Post: Larz60+
  Parse a URL list stored in a CSV paulfearn100 0 1,663 May-07-2020, 02:26 PM
Last Post: paulfearn100
  Extract json-ld schema markup data and store in MongoDB Nuwan16 0 2,411 Apr-05-2020, 04:06 PM
Last Post: Nuwan16
  Extract data from a webpage cycloneseb 5 2,818 Apr-04-2020, 10:17 AM
Last Post: alekson

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020