Web page Extractor

sathiyarajmca · (This post was last modified: Oct-26-2018, 11:15 AM by buran.)

Hi All,

I am trying to read the web page from the given URL and search for a particular text in that page.
I am using beautifulsoup to achieve this.

code snippet:

def count_words(url, the_word):
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    #words = soup.find(text=lambda text: text and the_word in text)
    words = soup.find(text=lambda text: text and the_word in text)
    print(words)
    return len(words)
 
 
def main():
    url = input("Enter URL Link")
    #url = 'https://en.wikipedia.org/wiki/Page'
    word = input("Word to Count:")
    #word = 'Page'
    count = count_words(url, word)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))

But it is not counting exact count of the given word.
It looks like it is trying to read the words inside the HTML source as well.
I need to get the content from Webpage Alone.

Can anyone help me on this.

Thanks,

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	use Xpath in Python :: libxml2 for a page-to-page skip-setting	apollo	2	3,669	Mar-19-2020, 06:13 PM Last Post: apollo

Web page Extractor

User Panel Messages

Announcements