I found some thing new that I am not sure about

I found some thing new that I am not sure about - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: I found some thing new that I am not sure about (/thread-3873.html)

I found some thing new that I am not sure about - Blue Dog - Jul-04-2017

I have been working with this code from tut and picking the code apart. I got everything to work fine, the in the tut. they made a few changes. That is when the problem started. Thier are a few function I have never seen and the code is identical to the one in the tut. One thing I have never seen befor it the function re.compile. well anyway here is the code and the error.

import urllib2
from BeautifulSoup import BeautifulSoup
import re


opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = ('https://en.wikipedia.org/wiki/List_of_American_comedy_films')

ourUrl = opener.open(url).read()

soup = BeautifulSoup(ourUrl)

for link in soup.findall('a', attrs = ('href',  re.compile("^/wiki/^"))):
   print link
  

#body = soup.find(text="Origin").findNext('td')
#outfile = open('savestuff_2.txt', 'w')
#outfile.write(body.text)

Now the error

Traceback (most recent call last):
 File "C:/Users/renny/Desktop/python27/My_ webscraper/movie_scraper.py", line 15, in <module>
   for link in soup.findall('a', attrs = ('href',  re.compile("^/wiki/^"))):
TypeError: 'NoneType' object is not callable

I am not sure what object can not be called, what is a nonetype object?
Thank you
renny Wall

RE: I found some thing new that I am not sure about - metulburr - Jul-04-2017

change soup.findall to

soup.find_all

Im actually not sure why that fixes it. But i know that findall isnt a method used...and that it was once findAll, but now is find_all. But changing it to find_all removed the error but you still have no match. Try this....

for link in soup.find_all('a', href=re.compile('^(/wiki/)')):
  print link

And you should be using BeautifulSoup 4 which is now from this

from BeautifulSoup import BeautifulSoup

to this

from bs4 import BeautifulSoup

as well as you should be stating a parser

soup = BeautifulSoup(html, 'html.parser')

or if you have lxml installed....

soup = BeautifulSoup(html, 'lxml')

RE: I found some thing new that I am not sure about - snippsat - Jul-04-2017

Take also a look at Web-Scraping part-1

from bs4 import BeautifulSoup
import requests
import re

url = 'https://en.wikipedia.org/wiki/List_of_American_comedy_films'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
for link in soup.find_all('a', href=re.compile(r"/wiki/")):
   print(link)

RE: I found some thing new that I am not sure about - Blue Dog - Jul-04-2017

Thanks Metulburr I made the change and it did not work, I got the same error. I change this like you said "from bs4 import BeautifulSoup" and it worked fine.
Snippsat, that is a very nice web scraping tut. Part 2 is good too.

Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own?

RE: I found some thing new that I am not sure about - metulburr - Jul-04-2017

(Jul-04-2017, 08:16 AM)Blue Dog Wrote: Ok, I am trying to learn this stuff to make some money, Do you think a guy can make some scraping on his own?

Once you figure out HTML/CSS/javascript and the combination use of BeautifulSoup and selenium to bypass javascript there isnt much left to stop you from automating around the web. These kind of scripts have become my most useful ones that do such mundane tasks.

RE: I found some thing new that I am not sure about - Blue Dog - Jul-04-2017

Thanks
For anyone like me whom just picking up on this stuff here what re.compile does.

Quote:re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

The sequence

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

RE: I found some thing new that I am not sure about - snippsat - Jul-04-2017

(Jul-04-2017, 05:50 PM)Blue Dog Wrote: For anyone like me whom just picking up on this stuff here what re.compile does

You can not look in Regular expression documentation to see how re.compile() work in BeautifulSoup.
You most look in BeautifulSoup doc.
re.compile() work as helper and in bs4 and is really using the re.search().
bs4 is doing most of searing in the HTML code,the regex is in bs4 is meant to step in as a helper in some cases.

If you wonder so can you not or shouldn't use regex alone with HTML/XML.
Read the best and most diabolic answer ever.

RE: I found some thing new that I am not sure about - Blue Dog - Jul-04-2017

Hey, that was good, so much stuff to learn and so little time.
renny

RE: I found some thing new that I am not sure about - ichabod801 - Jul-04-2017

(Jul-04-2017, 08:47 PM)snippsat Wrote: Read the best and most diabolic answer ever.

I would not call that the best answer ever, as it is not very informative. The best answer would give good reasons for not doing it, hopefully with examples. Then a newbie could be convinced by the logic, rather that being expected to swallow the dogma based on cool phraseology.

RE: I found some thing new that I am not sure about - snippsat - Jul-04-2017

Yes i can agree,one coolest answer would be a better description.
I have linked to that answer a lot of times,
it's always as a little side kick against using regex with html.
Examples and reason i like to do myself on this topic Wink