Web Crawler Not Working - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Web Crawler Not Working (/thread-1758.html) Pages:
1
2
|
Web Crawler Not Working - chrisdas - Jan-24-2017 Hi All, Not sure why my crawler isn't working. It's pretty simply pulling out the href, the brand, and the fit of t-shirts from a website. It manages to get the fit correct but the href and the brand just loop and repeat themselves for every output. Can't find the error. Thanks, Chris I've had to remove the http and www from in front of 'theiconic' as it wouldn't let me post with web links. import requests from bs4 import BeautifulSoup def iconic_spider(max_pages): page = 1 while page <= max_pages: url = theiconic.com.au/mens-clothing-tshirts-singlets/?page=' + str(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") for link in soup.findAll('a', {'class': 'product-details'}): href = theiconic.com.au/' + link.get('href') for link in soup.findAll('span', {'class': 'brand'}): brand = link.string for link in soup.findAll('span', {'class': 'name'}): fit = link.string print(href) print(brand) print(fit) page += 1 iconic_spider(2) RE: Web Crawler Not Working - scriptso - Feb-02-2017 Have not ran your code just yet... doing some much-needed cleaning but before I reboot and do run it at a quick look, you did no close the url ... in url = .... (Jan-24-2017, 12:54 PM)chrisdas Wrote: Hi All, Not sure why my crawler isn't working. It's pretty simply pulling out the href, the brand, and the fit of t-shirts from a website. It manages to get the fit correct but the href and the brand just loop and repeat themselves for every output. Can't find the error. Thanks, Chris RE: Web Crawler Not Working - wavic - Feb-02-2017 The url is not quoted url = theiconic.com.au/mens-clothing-tshirts-singlets/?page=' + str(page) RE: Web Crawler Not Working - scriptso - Feb-02-2017 (Feb-02-2017, 11:23 PM)wavic Wrote: The url is not quoted LOL well since we are making things easy for the ... poster! =( I was trying to have a educational moment here! though I pointed that out ... but thats not the only error RE: Web Crawler Not Working - wavic - Feb-03-2017 The page is JS generated. Requests can't handle such a site. RE: Web Crawler Not Working - scriptso - Feb-03-2017 (Feb-03-2017, 12:39 AM)wavic Wrote: The page is JS generated. I didn't check the actual page on a browser, and though I do not doubt what your saiying, you can use request to get the an entire js script, and regex to you get what ya need! But yeah... but I did run the with some minor edits... Like The closing of the URL ( but there was more to it if you're familiar with bs4) another edit here and there... boom [Image: 17wIa]
RE: Web Crawler Not Working - wavic - Feb-03-2017 Lets see the changes RE: Web Crawler Not Working - scriptso - Feb-03-2017 (Feb-03-2017, 01:34 AM)wavic Wrote: Lets see the changes As you wish! though I was hoping to point out to the OP that in the stack trace.... says exactly where the error was but w.ez _author_ = 'Erick' import requests from bs4 import BeautifulSoup def iconic_spider(max_pages): page = 1 while page <= max_pages: url = 'http://theiconic.com.au/mens-clothing-tshirts-singlets/?page=' + str(page) # closed out, add http. source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") for link in soup.findAll('a', {'class': 'product-details'}): href = 'theiconic.com.au/' + link.get('href') # another error in not closing! for link in soup.findAll('span', {'class': 'brand'}): brand = link.string for link in soup.findAll('span', {'class': 'name'}): fit = link.string print(href) print(brand) print(fit) page += 1 iconic_spider(2)I swear there was some other minor edit made but yeah, there ya go. RE: Web Crawler Not Working - wavic - Feb-03-2017 This is strange. I am unable to get even product-details class RE: Web Crawler Not Working - snippsat - Feb-03-2017 Nice that you fixed the code @scriptso. (Feb-03-2017, 03:33 AM)wavic Wrote: This is strange. I am unable to get even product-details classIt do work. Some formatting to better see the data. _author_ = 'Erick' import requests from bs4 import BeautifulSoup def iconic_spider(max_pages): page = 1 print('******* page 1 ********') while page <= max_pages: url = 'http://theiconic.com.au/mens-clothing-tshirts-singlets/?page={}'.format(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") for link in soup.findAll('a', {'class': 'product-details'}): href = 'theiconic.com.au/' + link.get('href') for link in soup.findAll('span', {'class': 'brand'}): brand = link.string for link in soup.findAll('span', {'class': 'name'}): fit = link.string print('-----------') print(href) print(brand) print(fit) print('******* page {} ********'.format(page+1)) page += 1 if __name__ == '__main__': pages = 2 iconic_spider(pages)
|