Python Forum
I found some thing new that I am not sure about
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I found some thing new that I am not sure about
#1
I have been working with this code from tut and picking the code apart. I got everything to work fine, the in the tut. they made a few changes. That is when the problem started. Thier are a few function I have never seen and the code is identical to the one in the tut. One thing I have never seen befor it the function re.compile. well anyway here is the code and the error.

import urllib2
from BeautifulSoup import BeautifulSoup
import re


opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = ('https://en.wikipedia.org/wiki/List_of_American_comedy_films')

ourUrl = opener.open(url).read()

soup = BeautifulSoup(ourUrl)

for link in soup.findall('a', attrs = ('href',  re.compile("^/wiki/^"))):
   print link
  

#body = soup.find(text="Origin").findNext('td')
#outfile = open('savestuff_2.txt', 'w')
#outfile.write(body.text)
Now the error

Traceback (most recent call last):
 File "C:/Users/renny/Desktop/python27/My_ webscraper/movie_scraper.py", line 15, in <module>
   for link in soup.findall('a', attrs = ('href',  re.compile("^/wiki/^"))):
TypeError: 'NoneType' object is not callable
I am not sure what object can not be called, what is a nonetype object?
Thank you
renny Wall
Reply
#2
change soup.findall to
soup.find_all
Im actually not sure why that fixes it. But i know that findall isnt a method used...and that it was once findAll, but now is find_all. But changing it to find_all removed the error but you still have no match. Try this....
for link in soup.find_all('a', href=re.compile('^(/wiki/)')):
  print link
And you should be using BeautifulSoup 4 which is now from this
from BeautifulSoup import BeautifulSoup
to this
from bs4 import BeautifulSoup
as well as you should be stating a parser
soup = BeautifulSoup(html, 'html.parser') 
or if you have lxml installed....
soup = BeautifulSoup(html, 'lxml') 
Recommended Tutorials:
Reply
#3
Take also a look at Web-Scraping part-1
from bs4 import BeautifulSoup
import requests
import re

url = 'https://en.wikipedia.org/wiki/List_of_American_comedy_films'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
for link in soup.find_all('a', href=re.compile(r"/wiki/")):
   print(link)
Reply
#4
Thanks Metulburr I made the change and it did not work, I got the same error. I change this like you said "from bs4 import BeautifulSoup" and it worked fine.
Snippsat, that is a very nice web scraping tut. Part 2 is good too.

Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own?
Reply
#5
(Jul-04-2017, 08:16 AM)Blue Dog Wrote: Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own?
Once you figure out HTML/CSS/javascript and the combination use of BeautifulSoup and selenium to bypass javascript there isnt much left to stop you from automating around the web. These kind of scripts have become my most useful ones that do such mundane tasks.
Recommended Tutorials:
Reply
#6
Thanks
For anyone like me whom just picking up on this stuff here what re.compile does.
Quote:re.compile(pattern, flags=0)

  Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

  The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

  The sequence

  prog = re.compile(pattern)
  result = prog.match(string)

  is equivalent to

  result = re.match(pattern, string)

  but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
Reply
#7
(Jul-04-2017, 05:50 PM)Blue Dog Wrote: For anyone like me whom just picking up on this stuff here what re.compile does
You can not look in Regular expression documentation to see how re.compile() work in BeautifulSoup.
You most look in BeautifulSoup doc.
re.compile() work as helper and in bs4 and is really using the re.search().
bs4 is doing most of searing in the HTML code,the regex is in bs4 is meant to step in as a helper in some cases.

If you wonder so can you not or shouldn't use regex alone with HTML/XML.
Read the best and most diabolic answer ever.
Reply
#8
Hey, that was good, so much stuff to learn and so little time.
renny
Reply
#9
(Jul-04-2017, 08:47 PM)snippsat Wrote: Read the best and most diabolic answer ever.

I would not call that the best answer ever, as it is not very informative. The best answer would give good reasons for not doing it, hopefully with examples. Then a newbie could be convinced by the logic, rather that being expected to swallow the dogma based on cool phraseology.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#10
Yes i can agree,one coolest answer would be a better description.
I have linked to that answer a lot of times,
it's always as a little side kick against using regex with html.
Examples and reason i like to do myself on this topic Wink
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020