Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web page Extractor
#1
Hi All,

I am trying to read the web page from the given URL and search for a particular text in that page.
I am using beautifulsoup to achieve this.

code snippet:

def count_words(url, the_word):
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    #words = soup.find(text=lambda text: text and the_word in text)
    words = soup.find(text=lambda text: text and the_word in text)
    print(words)
    return len(words)
 
 
def main():
    url = input("Enter URL Link")
    #url = 'https://en.wikipedia.org/wiki/Page'
    word = input("Word to Count:")
    #word = 'Page'
    count = count_words(url, word)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
But it is not counting exact count of the given word.
It looks like it is trying to read the words inside the HTML source as well.
I need to get the content from Webpage Alone.


Can anyone help me on this.

Thanks,
Reply
#2
>>> text = """Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. Python combines remarkable power with very cle
ar syntax. It has interfaces to many system calls and libraries, as well as to various window systems, and is extensible in C or C++. It is also usable as an extension language for applications that need a programmable interface. Finally, 
Python is portable: it runs on many Unix variants, on the Mac, and on Windows 2000 and later."""
>>> text.count('Python')
3
>>> 
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,578 Mar-19-2020, 06:13 PM
Last Post: apollo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020