Jun-16-2020, 08:01 PM
I was using to scrape a website to look for wordpress on as "/wp-", and it partially works, but it also partially doesn't.
The problem is that when it looks and counts for /wp-, it gives way too many results on all the sites I am looking at. If I manually inspect https://arstechnica.com/ and look for /wp- on it using ctrl+f, it would bring up around 46 results.
If I use the code, it brings up 922 results.
Is there a way to fix it from bring up so many results?
Also, is there a way to bring up only the first result of /wp- too?
I am curious in trying to incorporate both ways in a future code.
Thank you very much for your help and any advice you might have on how to fix this!
The problem is that when it looks and counts for /wp-, it gives way too many results on all the sites I am looking at. If I manually inspect https://arstechnica.com/ and look for /wp- on it using ctrl+f, it would bring up around 46 results.
If I use the code, it brings up 922 results.
Is there a way to fix it from bring up so many results?
Also, is there a way to bring up only the first result of /wp- too?
I am curious in trying to incorporate both ways in a future code.
Thank you very much for your help and any advice you might have on how to fix this!
#!bin/usr/python3 import urllib.request import urlopen import bs4 import queue import urllib.request as urllib2 import urllib3 import re import requests from bs4 import BeautifulSoup def count_words(url, the_word): r = requests.get(url, allow_redirects=False) soup = BeautifulSoup(r.content, 'lxml') words = soup.find(text=lambda text: text and the_word in text) print(words) return len(words) def main(): url = 'https://arstechnica.com/' word = '/wp-' count = count_words(url, word) print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word)) if __name__ == '__main__': main()