Python Forum

Full Version: Web page Extractor
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi All,

I am trying to read the web page from the given URL and search for a particular text in that page.
I am using beautifulsoup to achieve this.

code snippet:

def count_words(url, the_word):
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    #words = soup.find(text=lambda text: text and the_word in text)
    words = soup.find(text=lambda text: text and the_word in text)
    print(words)
    return len(words)
 
 
def main():
    url = input("Enter URL Link")
    #url = 'https://en.wikipedia.org/wiki/Page'
    word = input("Word to Count:")
    #word = 'Page'
    count = count_words(url, word)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
But it is not counting exact count of the given word.
It looks like it is trying to read the words inside the HTML source as well.
I need to get the content from Webpage Alone.


Can anyone help me on this.

Thanks,
>>> text = """Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. Python combines remarkable power with very cle
ar syntax. It has interfaces to many system calls and libraries, as well as to various window systems, and is extensible in C or C++. It is also usable as an extension language for applications that need a programmable interface. Finally, 
Python is portable: it runs on many Unix variants, on the Mac, and on Windows 2000 and later."""
>>> text.count('Python')
3
>>>