Python Forum
BS4 - Is There A More Efficient Way Of Doing This?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BS4 - Is There A More Efficient Way Of Doing This?
#1
Say I want to search for 30 keywords within each set of scraped html data, what would be the best way to go about it? Should I keep repeating the same re.compile and if statement I'm using?

from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd

# COUNTER TO INCREMENT THROUGH URL_LIST
list_counter = 0

# CREATE URL LIST FROM CSV
url_list = df = pd.read_csv('example.csv')  # df = dataframe

# GET URL TOTAL FROM CSV
url_total = len(df.index) - 1  # take away 1 other lists start at zero

# MAIN LOOP TO CHECK FOR COMMENTS
while list_counter <= url_total:
    scrape = urllib.request.Request(df.iloc[list_counter, 0],
                                    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'})
    html = urllib.request.urlopen(scrape)
    soup = BeautifulSoup(html, 'lxml')

    comment_search = soup.body.find_all(text=re.compile("keyword1", re.IGNORECASE))
    if len(comment_search) > 0:
        df.iloc[list_counter, 1] = 'Keyword1 Found'

    comment_search = soup.body.find_all(string=re.compile("keyword2", re.IGNORECASE))
    if len(comment_search) > 0:
        df.iloc[list_counter, 1] = 'Keyword2 Found'

    comment_search = soup.body.find_all(text=re.compile("keyword3", re.IGNORECASE))
    if len(comment_search) > 0:
        df.iloc[list_counter, 1] = 'Keyword3 Found'

    print(list_counter)
    list_counter = list_counter + 1

df.to_csv("example2.csv")
df = pd.read_csv('example2.csv')
print(df)
Reply
#2
Have you tried putting all the keywords in a list, and then running just 1 if statement instead of 30 separate ones? IE:

keywordlist = [keyword1, keyword2, keyword3]
comment_search = soup.body.find_all(text=re.compile((keywordlist), re.IGNORECASE))
EDIT: Nvm. My suggestion didn't work.
Reply
#3
I would probably pre compile them beforehand not in the middle of your loop. Also each of your if conditions are the same which would warrant a function with passing soup, and the proper re.compile, output in.
Recommended Tutorials:
Reply
#4
I'd also recommend using of regular expressions, but when you want to find all the keywords, I would search for:
re.compile('keyword\d', re.IGNORECASE)
The \d will also find the strings keyword, following by a single digit.
Regular expressions are extremely useful in this case. It's worth spending time in learning them!
Reply
#5
You are not doing any scraping just counting,so then regex can be okay.
\b allows you to perform a whole words only search.
So match sunny and not sunnyboy.
What could break this is if JavaScript is mix in with html that use this words,
then may need to first use BS to get location where text is.
import re

html = '''\
<html>
<body>
  <div id='foo'>today is a sunny day</div>
  <div>I love when it's sunny outside</div>
  Call me sunnyboy
  <div>sunny is a cool word sunny</div>
</body>
</html>'''

r = re.compile(r'\bsunny\b|\btoday\b|\bis\b', flags=re.I | re.X)
print(r.findall(html))
print(len(r.findall(html)))
Output:
['today', 'is', 'sunny', 'sunny', 'sunny', 'is', 'sunny'] 7
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How can I make this as computationally efficient as possible? IAMK 8 4,835 Apr-17-2018, 08:24 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020