Python Forum
Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR
#1
I have a list of urls in a csv file (I can either host said file on my local machine or online). I need to pull biz name, address, and phone # from the web pages in the list. I have all of the correct class names. I want to extract this data to a csv with the aforementioned columns.

From the csv:

https://slicelife.com/restaurants/wi/mil...aukee/menu
https://slicelife.com/restaurants/nj/nor...hvale/menu
https://slicelife.com/restaurants/mn/man...pizza/menu
https://slicelife.com/restaurants/pa/new...k-hut/menu


When I run the code, it will create a csv with the desired column headers, but no data due to errors. I CAN pull data from the scraped urls one at a time like this:


# locationRawData = soup.find('div', attrs={"class": "f19xeu2d"}).text.encode('utf-8'), 
# pizzeriaName = soup.find('h1', attrs={"class": "f13p7rsj"}).text.encode('utf-8'),
# address = soup.find('address', attrs={"class": "f1lfckhr"}).text.encode('utf-8'),
# phoneNumber = soup.find('button', attrs={"class": "f12gt8lx"}).text.encode('utf-8'),
I have tried:

from bs4 import BeautifulSoup
import requests
import json
import csv
from urllib.request import urlopen


TrattoriArray = []
with open('aliveSlice.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        TrattoriArray.append(url) # Add each url to list contents

for url in TrattoriArray:  # Parse through each url in the list.
    page = urlopen(url[0]).read()
    content = BeautifulSoup(page.content, "html.parser")

pizzaArray = []
for pizzeria in content.findAll('div', attrs={"class": "f19xeu2d"}):
    pizzeriaObject = {
        "pizzeriaName": pizzeria.find('h1', attrs={"class": "f13p7rsj"}).text.encode('utf-8'),
        "address": pizzeria.find('address', attrs={"class": "f1lfckhr"}).text.encode('utf-8'),
        "phoneNumber": pizzeria.find('rc-c2d-number', attrs={"span": "rc-c2d-number"}).text.encode('utf-8'),

    }
    pizzaArray.append(pizzeriaObject)
with open('pizzeriaData.json', 'w') as outfile:
    json.dump(pizzaArray, outfile)
and

import requests
from bs4 import BeautifulSoup
import csv

with open('aliveSCRAPE.csv', newline='') as f_urls, open('output.csv', 'w', newline='') as f_output:
    csv_urls = csv.reader(f_urls)
    csv_output = csv.writer(f_output)
    csv_output.writerow(['locationRawData' , 'pizzeriaName' , 'address', 'Phone'])

    for line in csv_urls:
        r = requests.get(line[0]).text
        soup = BeautifulSoup(r.content, 'lxml')

        locationRawData = soup.find('h1')
        print('RAW :', locationRawData.text)

        pizzeriaName = soup.find('h1', class_='f13p7rsj').text
        pizzeria_name = pizzeria.split(':')
        print('pizzeriaName:', pizzeria_name[1])

        address = soup.find_all('address', class_='f1lfckhr'})
        print('Address :', address[2].text)

        phoneNumber = soup.find_all('button', class_='f12gt8lx')
        print('Phone :', phoneNumber[3].text)

        locationRawData = soup.find_all('div', class_='f19xeu2d'})
        print('RAW :', locationRawData[4].text)

        csv_output.writerow([locationRawData.text, pizzeria_name[1], address[2].text, phoneNumber[3].text])
And...a few other methods, which is the easiest? This is literally the first thing I have ever programmed in Python.
...\Desktop\scrapeYourPlate\test\Code>Python scrape.py
RAW : Bakers Buck Hut
Traceback (most recent call last):
  File "scrape.py", line 98, in <module>
    print('pizzeriaName:', pizzeriaName[1].text)
  File ...AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\bs4\element.py", line 1016, in __getitem__
    return self.attrs[key]
KeyError: 1
python python-3.x beautifulsoup



ERROR:


CodeNinjaGrasshopper
255 bronze badges
File "scrape.py", line 98, in <module> print('pizzeriaName:', pizzeriaName[1].text) File "C:\...\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\bs4\element.py", line 1016, in getitem return self.attrs[key] KeyError: 1 
Reply
#2
This page has enough java script in it that I would get the first page using selenium, thes use beautiful soup to get the details
There are examples of this on this forum under tutorials/web scraping (by snippsat)
Reply
#3
(Jul-04-2019, 01:44 AM)Larz60+ Wrote: This page has enough java script in it that I would get the first page using selenium, thes use beautiful soup to get the details
There are examples of this on this forum under tutorials/web scraping (by snippsat)

Okay I will research that! Thank you for your reply! Right now I feel like I dont know what I dont know and I dont even know what to search for unles I am pointed in the right direction as you took the time to do! Thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract Href URL and Text From List knight2000 2 738 Jul-08-2021, 12:53 PM
Last Post: knight2000
  Extract data from sports betting sites nestor 3 2,793 Mar-30-2021, 04:37 PM
Last Post: Larz60+
  Extract data from a table Bob_M 3 1,102 Aug-14-2020, 03:36 PM
Last Post: Bob_M
  Extract data with Selenium and BeautifulSoup nestor 3 1,717 Jun-06-2020, 01:34 AM
Last Post: Larz60+
  Parse a URL list stored in a CSV paulfearn100 0 704 May-07-2020, 02:26 PM
Last Post: paulfearn100
  Extract json-ld schema markup data and store in MongoDB Nuwan16 0 1,272 Apr-05-2020, 04:06 PM
Last Post: Nuwan16
  Extract data from a webpage cycloneseb 5 1,454 Apr-04-2020, 10:17 AM
Last Post: alekson
  Cannot Extract data through charts online AgileAVS 0 830 Feb-01-2020, 01:47 PM
Last Post: AgileAVS
  Parse data from downloaded html nikos48 7 1,744 Jan-26-2020, 03:35 PM
Last Post: nikos48
  Cannot extract data from the next pages nazmulfinance 4 1,296 Nov-11-2019, 08:15 PM
Last Post: nazmulfinance

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020