Web Crawler Not Working

chrisdas · Jan-24-2017, 12:54 PM

Hi All, Not sure why my crawler isn't working. It's pretty simply pulling out the href, the brand, and the fit of t-shirts from a website. It manages to get the fit correct but the href and the brand just loop and repeat themselves for every output. Can't find the error. Thanks, Chris

I've had to remove the http and www from in front of 'theiconic' as it wouldn't let me post with web links.

import requests
from bs4 import BeautifulSoup



def iconic_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = theiconic.com.au/mens-clothing-tshirts-singlets/?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('a', {'class': 'product-details'}):
            href = theiconic.com.au/' + link.get('href')
        for link in soup.findAll('span', {'class': 'brand'}):
            brand = link.string
        for link in soup.findAll('span', {'class': 'name'}):
            fit = link.string
            print(href)
            print(brand)
            print(fit)
        page += 1

iconic_spider(2)

scriptso · (This post was last modified: Feb-02-2017, 11:15 PM by scriptso.)

Have not ran your code just yet... doing some much-needed cleaning but before I reboot and do run it at a quick look, you did no close the url ... in url = ....

(Jan-24-2017, 12:54 PM)chrisdas Wrote: Hi All, Not sure why my crawler isn't working. It's pretty simply pulling out the href, the brand, and the fit of t-shirts from a website. It manages to get the fit correct but the href and the brand just loop and repeat themselves for every output. Can't find the error. Thanks, Chris

I've had to remove the http and www from in front of 'theiconic' as it wouldn't let me post with web links.
import requests
from bs4 import BeautifulSoup



def iconic_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = theiconic.com.au/mens-clothing-tshirts-singlets/?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('a', {'class': 'product-details'}):
            href = theiconic.com.au/' + link.get('href')
        for link in soup.findAll('span', {'class': 'brand'}):
            brand = link.string
        for link in soup.findAll('span', {'class': 'name'}):
            fit = link.string
            print(href)
            print(brand)
            print(fit)
        page += 1

iconic_spider(2)
OKAY! I got your code to work with a couple edit... simple mistakes really... But before I point them out I would ask you to run your script and read the error, 97% it says the underlying immediate error in code in my experience at the very begging or end of the stack trace...

wavic · (This post was last modified: Feb-02-2017, 11:26 PM by wavic.)

The url is not quoted

url = theiconic.com.au/mens-clothing-tshirts-singlets/?page=' + str(page)

scriptso · Feb-02-2017, 11:25 PM

(Feb-02-2017, 11:23 PM)wavic Wrote: The url is not quoted
heiconic.com.au/mens-clothing-tshirts-singlets/?page='

LOL well since we are making things easy for the ... poster! =( I was trying to have a educational moment here! though I pointed that out ... but thats not the only error

wavic · (This post was last modified: Feb-03-2017, 12:39 AM by wavic.)

The page is JS generated.
Requests can't handle such a site.

scriptso · Feb-03-2017, 01:18 AM

(Feb-03-2017, 12:39 AM)wavic Wrote: The page is JS generated.
Requests can't handle such a site.

I didn't check the actual page on a browser, and though I do not doubt what your saiying, you can use request to get the an entire js script, and regex to you get what ya need! But yeah... but I did run the with some minor edits... Like The closing of the URL ( but there was more to it if you're familiar with bs4) another edit here and there... boom

[Image: 17wIa]

Output:theiconic.com.au//afends-tee-402251.html
Afends
90s Short Sleeve T-Shirt
theiconic.com.au//afends-tee-402251.html
Afends
LA Skull Tee
theiconic.com.au//afends-tee-402251.html
Afends
California Tee
theiconic.com.au//afends-tee-402251.html
Afends
New May Crew-Neck Tee
theiconic.com.au//afends-tee-402251.html
Afends
Crew Neck Print T-Shirt
theiconic.com.au//afends-tee-402251.html
Afends
Breathe Hyper Dry SS Training Top

wavic · Feb-03-2017, 01:34 AM

Lets see the changes

scriptso · Feb-03-2017, 01:47 AM

(Feb-03-2017, 01:34 AM)wavic Wrote: Lets see the changes

As you wish! though I was hoping to point out to the OP that in the stack trace.... says exactly where the error was but w.ez

_author_ = 'Erick'
import requests
from bs4 import BeautifulSoup


def iconic_spider(max_pages):
   page = 1
   while page <= max_pages:
       url = 'http://theiconic.com.au/mens-clothing-tshirts-singlets/?page=' + str(page) # closed out, add http.
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, "html.parser")
       for link in soup.findAll('a', {'class': 'product-details'}):
           href = 'theiconic.com.au/' + link.get('href') # another error in not closing!
       for link in soup.findAll('span', {'class': 'brand'}):
           brand = link.string
       for link in soup.findAll('span', {'class': 'name'}):
           fit = link.string
           print(href)
           print(brand)
           print(fit)
       page += 1

iconic_spider(2)

I swear there was some other minor edit made but yeah, there ya go.

wavic · Feb-03-2017, 03:33 AM

This is strange. I am unable to get even product-details class

***snippsat*** · (This post was last modified: Feb-03-2017, 05:59 AM by snippsat.)

Nice that you fixed the code @scriptso.

(Feb-03-2017, 03:33 AM)wavic Wrote: This is strange. I am unable to get even product-details class

It do work.
Some formatting to better see the data.

_author_ = 'Erick'
import requests
from bs4 import BeautifulSoup

def iconic_spider(max_pages):
  page = 1
  print('******* page 1 ********')
  while page <= max_pages:
      url = 'http://theiconic.com.au/mens-clothing-tshirts-singlets/?page={}'.format(page)
      source_code = requests.get(url)
      plain_text = source_code.text
      soup = BeautifulSoup(plain_text, "html.parser")
      for link in soup.findAll('a', {'class': 'product-details'}):
          href = 'theiconic.com.au/' + link.get('href')
      for link in soup.findAll('span', {'class': 'brand'}):
          brand = link.string
      for link in soup.findAll('span', {'class': 'name'}):
          fit = link.string
          print('-----------')
          print(href)
          print(brand)
          print(fit)
      print('******* page {} ********'.format(page+1))
      page += 1

if __name__ == '__main__':
   pages = 2
   iconic_spider(pages)

Output:******* page 1 ********
-----------
theiconic.com.au//basic-crew-neck-pima-tee-363464.html
Lacoste
90s Short Sleeve T-Shirt
-----------
theiconic.com.au//basic-crew-neck-pima-tee-363464.html
Lacoste
LA Skull Tee
-----------
theiconic.com.au//basic-crew-neck-pima-tee-363464.html
Lacoste
The Original Print Tee
-----------
theiconic.com.au//basic-crew-neck-pima-tee-363464.html
Lacoste
Men's Zonal Cooling Relay SS Tee
-----------
.......... etc

******* page 2 ********
-----------
theiconic.com.au//venice-address-tee-199234.html
Deus Ex Machina
Basic Crew-Neck Pima Tee
-----------
theiconic.com.au//venice-address-tee-199234.html
Deus Ex Machina
Crawley Tee
.......... etc

******* page 3 ********

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	1,898	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	Web Crawler help	takaa	39	27,280	Apr-26-2019, 12:14 PM Last Post: stateitreal
	Python - Why multi threads are not working in this web crawler?	ratanbhushan	1	2,815	Nov-17-2017, 05:21 PM Last Post: Larz60+

Web Crawler Not Working

User Panel Messages

Announcements