I found some thing new that I am not sure about - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: I found some thing new that I am not sure about (/thread-3873.html) |
I found some thing new that I am not sure about - Blue Dog - Jul-04-2017 I have been working with this code from tut and picking the code apart. I got everything to work fine, the in the tut. they made a few changes. That is when the problem started. Thier are a few function I have never seen and the code is identical to the one in the tut. One thing I have never seen befor it the function re.compile. well anyway here is the code and the error. import urllib2 from BeautifulSoup import BeautifulSoup import re opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] url = ('https://en.wikipedia.org/wiki/List_of_American_comedy_films') ourUrl = opener.open(url).read() soup = BeautifulSoup(ourUrl) for link in soup.findall('a', attrs = ('href', re.compile("^/wiki/^"))): print link #body = soup.find(text="Origin").findNext('td') #outfile = open('savestuff_2.txt', 'w') #outfile.write(body.text)Now the error Traceback (most recent call last): File "C:/Users/renny/Desktop/python27/My_ webscraper/movie_scraper.py", line 15, in <module> for link in soup.findall('a', attrs = ('href', re.compile("^/wiki/^"))): TypeError: 'NoneType' object is not callableI am not sure what object can not be called, what is a nonetype object? Thank you renny RE: I found some thing new that I am not sure about - metulburr - Jul-04-2017 change soup.findall to soup.find_allIm actually not sure why that fixes it. But i know that findall isnt a method used...and that it was once findAll, but now is find_all. But changing it to find_all removed the error but you still have no match. Try this.... for link in soup.find_all('a', href=re.compile('^(/wiki/)')): print linkAnd you should be using BeautifulSoup 4 which is now from this from BeautifulSoup import BeautifulSoupto this from bs4 import BeautifulSoupas well as you should be stating a parser soup = BeautifulSoup(html, 'html.parser')or if you have lxml installed.... soup = BeautifulSoup(html, 'lxml') RE: I found some thing new that I am not sure about - snippsat - Jul-04-2017 Take also a look at Web-Scraping part-1 from bs4 import BeautifulSoup import requests import re url = 'https://en.wikipedia.org/wiki/List_of_American_comedy_films' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') for link in soup.find_all('a', href=re.compile(r"/wiki/")): print(link) RE: I found some thing new that I am not sure about - Blue Dog - Jul-04-2017 Thanks Metulburr I made the change and it did not work, I got the same error. I change this like you said "from bs4 import BeautifulSoup" and it worked fine. Snippsat, that is a very nice web scraping tut. Part 2 is good too. Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own? RE: I found some thing new that I am not sure about - metulburr - Jul-04-2017 (Jul-04-2017, 08:16 AM)Blue Dog Wrote: Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own?Once you figure out HTML/CSS/javascript and the combination use of BeautifulSoup and selenium to bypass javascript there isnt much left to stop you from automating around the web. These kind of scripts have become my most useful ones that do such mundane tasks. RE: I found some thing new that I am not sure about - Blue Dog - Jul-04-2017 Thanks For anyone like me whom just picking up on this stuff here what re.compile does. Quote:re.compile(pattern, flags=0) RE: I found some thing new that I am not sure about - snippsat - Jul-04-2017 (Jul-04-2017, 05:50 PM)Blue Dog Wrote: For anyone like me whom just picking up on this stuff here what re.compile doesYou can not look in Regular expression documentation to see how re.compile() work in BeautifulSoup.You most look in BeautifulSoup doc. re.compile() work as helper and in bs4 and is really using the re.search() .bs4 is doing most of searing in the HTML code,the regex is in bs4 is meant to step in as a helper in some cases. If you wonder so can you not or shouldn't use regex alone with HTML/XML. Read the best and most diabolic answer ever. RE: I found some thing new that I am not sure about - Blue Dog - Jul-04-2017 Hey, that was good, so much stuff to learn and so little time. renny RE: I found some thing new that I am not sure about - ichabod801 - Jul-04-2017 (Jul-04-2017, 08:47 PM)snippsat Wrote: Read the best and most diabolic answer ever. I would not call that the best answer ever, as it is not very informative. The best answer would give good reasons for not doing it, hopefully with examples. Then a newbie could be convinced by the logic, rather that being expected to swallow the dogma based on cool phraseology. RE: I found some thing new that I am not sure about - snippsat - Jul-04-2017 Yes i can agree,one coolest answer would be a better description. I have linked to that answer a lot of times, it's always as a little side kick against using regex with html. Examples and reason i like to do myself on this topic |