Posts: 146
Threads: 34
Joined: Oct 2016
I have been working with this code from tut and picking the code apart. I got everything to work fine, the in the tut. they made a few changes. That is when the problem started. Thier are a few function I have never seen and the code is identical to the one in the tut. One thing I have never seen befor it the function re.compile. well anyway here is the code and the error.
import urllib2
from BeautifulSoup import BeautifulSoup
import re
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = ('https://en.wikipedia.org/wiki/List_of_American_comedy_films')
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
for link in soup.findall('a', attrs = ('href', re.compile("^/wiki/^"))):
print link
#body = soup.find(text="Origin").findNext('td')
#outfile = open('savestuff_2.txt', 'w')
#outfile.write(body.text) Now the error
Traceback (most recent call last):
File "C:/Users/renny/Desktop/python27/My_ webscraper/movie_scraper.py", line 15, in <module>
for link in soup.findall('a', attrs = ('href', re.compile("^/wiki/^"))):
TypeError: 'NoneType' object is not callable I am not sure what object can not be called, what is a nonetype object?
Thank you
renny
Posts: 5,150
Threads: 396
Joined: Sep 2016
Jul-04-2017, 02:26 AM
(This post was last modified: Jul-04-2017, 02:26 AM by metulburr.)
change soup.findall to
soup.find_all Im actually not sure why that fixes it. But i know that findall isnt a method used...and that it was once findAll, but now is find_all. But changing it to find_all removed the error but you still have no match. Try this.... for link in soup.find_all('a', href=re.compile('^(/wiki/)')):
print link And you should be using BeautifulSoup 4 which is now from this
from BeautifulSoup import BeautifulSoup to this
from bs4 import BeautifulSoup as well as you should be stating a parser
soup = BeautifulSoup(html, 'html.parser') or if you have lxml installed.... soup = BeautifulSoup(html, 'lxml')
Recommended Tutorials:
Posts: 7,068
Threads: 122
Joined: Sep 2016
Jul-04-2017, 05:27 AM
(This post was last modified: Jul-04-2017, 05:27 AM by snippsat.)
Take also a look at Web-Scraping part-1
from bs4 import BeautifulSoup
import requests
import re
url = 'https://en.wikipedia.org/wiki/List_of_American_comedy_films'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
for link in soup.find_all('a', href=re.compile(r"/wiki/")):
print(link)
Posts: 146
Threads: 34
Joined: Oct 2016
Thanks Metulburr I made the change and it did not work, I got the same error. I change this like you said "from bs4 import BeautifulSoup" and it worked fine.
Snippsat, that is a very nice web scraping tut. Part 2 is good too.
Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own?
Posts: 5,150
Threads: 396
Joined: Sep 2016
(Jul-04-2017, 08:16 AM)Blue Dog Wrote: Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own? Once you figure out HTML/CSS/javascript and the combination use of BeautifulSoup and selenium to bypass javascript there isnt much left to stop you from automating around the web. These kind of scripts have become my most useful ones that do such mundane tasks.
Recommended Tutorials:
Posts: 146
Threads: 34
Joined: Oct 2016
Jul-04-2017, 05:50 PM
(This post was last modified: Jul-04-2017, 08:50 PM by snippsat.)
Thanks
For anyone like me whom just picking up on this stuff here what re.compile does.
Quote:re.compile(pattern, flags=0)
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.
The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).
The sequence
prog = re.compile(pattern)
result = prog.match(string)
is equivalent to
result = re.match(pattern, string)
but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
Posts: 7,068
Threads: 122
Joined: Sep 2016
(Jul-04-2017, 05:50 PM)Blue Dog Wrote: For anyone like me whom just picking up on this stuff here what re.compile does You can not look in Regular expression documentation to see how re.compile() work in BeautifulSoup.
You most look in BeautifulSoup doc.
re.compile() work as helper and in bs4 and is really using the re.search() .
bs4 is doing most of searing in the HTML code,the regex is in bs4 is meant to step in as a helper in some cases.
If you wonder so can you not or shouldn't use regex alone with HTML/XML.
Read the best and most diabolic answer ever.
Posts: 146
Threads: 34
Joined: Oct 2016
Hey, that was good, so much stuff to learn and so little time.
renny
Posts: 4,229
Threads: 97
Joined: Sep 2016
(Jul-04-2017, 08:47 PM)snippsat Wrote: Read the best and most diabolic answer ever.
I would not call that the best answer ever, as it is not very informative. The best answer would give good reasons for not doing it, hopefully with examples. Then a newbie could be convinced by the logic, rather that being expected to swallow the dogma based on cool phraseology.
Posts: 7,068
Threads: 122
Joined: Sep 2016
Yes i can agree,one coolest answer would be a better description.
I have linked to that answer a lot of times,
it's always as a little side kick against using regex with html.
Examples and reason i like to do myself on this topic
|