Python Forum
Web page Extractor - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Web page Extractor (/thread-13669.html)



Web page Extractor - sathiyarajmca - Oct-26-2018

Hi All,

I am trying to read the web page from the given URL and search for a particular text in that page.
I am using beautifulsoup to achieve this.

code snippet:

def count_words(url, the_word):
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    #words = soup.find(text=lambda text: text and the_word in text)
    words = soup.find(text=lambda text: text and the_word in text)
    print(words)
    return len(words)
 
 
def main():
    url = input("Enter URL Link")
    #url = 'https://en.wikipedia.org/wiki/Page'
    word = input("Word to Count:")
    #word = 'Page'
    count = count_words(url, word)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
But it is not counting exact count of the given word.
It looks like it is trying to read the words inside the HTML source as well.
I need to get the content from Webpage Alone.


Can anyone help me on this.

Thanks,


RE: Web page Extractor - wavic - Oct-26-2018

>>> text = """Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. Python combines remarkable power with very cle
ar syntax. It has interfaces to many system calls and libraries, as well as to various window systems, and is extensible in C or C++. It is also usable as an extension language for applications that need a programmable interface. Finally, 
Python is portable: it runs on many Unix variants, on the Mac, and on Windows 2000 and later."""
>>> text.count('Python')
3
>>>