Python Forum
Read url from CSV and Scrape website
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Read url from CSV and Scrape website
#1
hi,

i have written a code to read data from CSV file and scrape data, but whenever i am running this code i receive HTTP error 400

I have url like 16k but in this i am posting only 5 url.

my csv file has two columns, 1 st one is ID and second one is URL.

whenever i am running this code it prints invalid url 400 http error.


these are my codes:
import csv
from bs4 import BeautifulSoup
import requests
import time
import os

data_obj = {}
final_data = []

def readfile():
    global data_obj
    file = "BOOK.CSV"
    f = open("./"+ file, "r")
    for row in f.readlines():
        lst = row.split(",")
        data_obj[lst[0]] = lst[1]#here reading dictionary
        #print(data_obj)
        f.close()

def writedata(alldata1, filename):
    print(" >>>> FINAL PRINTING DATA >>>> ")
    #import pdb; pdb.set_trace()
    with open("./"+filename,'w') as csvfile:
        csvfile = csv.writer(csvfile, delimiter=',')
        #csvfile.writerow(titleRow)
        csvfile.writerow("")
        for i in range(0, len( alldata1 )):
            #print(alldata1[i])
            csvfile.writerow( alldata1[i]  )

def parsedata():
    global data_obj, final_data
    for sublist in data_obj.keys():
        url = data_obj[sublist]
        #print(url)
        data = getdata(url,{})
        soup = BeautifulSoup(data, "html.parser")
        print(soup)
        
def getdata(url, values):
    r = requests.post(url, data=values, timeout=10)
    text = r.text
    r.close()
    return text


def main():
    readfile()
    parsedata()

main()
this is the error i received:
Error:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd"> <html><head><title>Bad Request</title> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head> <body><h2>Bad Request - Invalid URL</h2> <hr/><p>HTTP Error 400. The request URL is invalid.</p> </body></html>
i am attaching the file, can someone please tell what should i do?

Attached Files

.csv   BOOK.csv (Size: 1,004 bytes / Downloads: 981)
Reply
#2
def getdata(url, values=None):
    r = requests.post(url, data=values, timeout=10)
    text = r.text
    r.close()
    return text
Your code is overcomplicated.

import csv
from bs4 import BeautifulSoup
import requests

def get(urls):
    for url in urls:
        yield requests.get(url).content.decode('utf-8')

with open('BOOK.csv') as csv_:
    reader = csv.reader(csv_)
    urls = [line[1] for line in urls if line]
    
    webpages = list(get(urls))
    
    for html in webpages:
        soup = BeautifulSoup(html, 'lxml')
        print(soup.pretify)
You will be able to put together the rest.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
Thank you so much sir, it is great learning here.
Reply
#4
urls not defined i get this error while running this code
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Unable to Scrape Website muhamdasim 2 2,562 Dec-27-2021, 07:49 PM
Last Post: JohnRaven
  how to scrape a website from a keyword list greenpine 2 2,326 Dec-04-2020, 03:50 PM
Last Post: greenpine
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,087 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  why I can't scrape a website? kmkim319 7 7,393 Sep-27-2019, 03:14 PM
Last Post: kmkim319
  How do i scrape website whose page changes using javsacript _dopostback function and Prince_Bhatia 1 7,157 Aug-06-2018, 09:45 AM
Last Post: wavic
  Scrape A tags from a website Prince_Bhatia 1 4,177 Oct-15-2017, 12:56 AM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020