Python Forum

Full Version: BS4 Not Able To Find Text In CSS Comments
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Random example:

import requests
from bs4 import BeautifulSoup
import re

scrape = requests.get('http://www.seacoastonline.com/news/20171113/lets-not-let-politics-divide-us', headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
html = scrape.content
soup = BeautifulSoup(html, 'html.parser')

'''

if you search manually in the source (soup) you can find the string "houzz page"

but when when I use find_all, it returns nothing

'''

# print(soup)

comment_search = soup.body.find_all(string=re.compile("houzz page", re.IGNORECASE))
if len(comment_search) > 0:
    print("houzz found")
else:
    print("houzz not found")
Also is my technique ok for returning the results (if len > 0)?
try:
soup = BeautifulSoup(html, 'lxml')
styles = soup.find_all('style')
for style in styles:
    # filter out what you want here
    print (style.text, style.next_sibling)
(Feb-27-2018, 12:20 AM)Larz60+ Wrote: [ -> ]try:
soup = BeautifulSoup(html, 'lxml')
styles = soup.find_all('style')
for style in styles:
    # filter out what you want here
    print (style.text, style.next_sibling)

Can't get it to work.. I always get error:

"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

styles = soup.find_all('style')

for style in styles.find_all(string=re.compile("houzz", re.IGNORECASE)):
    print(style.text)
styles = soup.find_all('style')

for style in styles:
    styles.find_all(string=re.compile("houzz", re.IGNORECASE))
    print(style.text)
Why are you trying to parse CSS comments?
Can do it like this,first find style tag the write regex for comments.
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint

scrape = requests.get('http://www.seacoastonline.com/news/20171113/lets-not-let-politics-divide-us')
html = scrape.content
soup = BeautifulSoup(html, 'lxml')
style = soup.find('style')
css_comments = re.findall(r'\/\*(.*)\*\/', str(style))
pprint(css_comments)
Output:
['houzz page', 'legacy-header', '==== ARTICLE ======', 'story strip article ad', ' cssUpdates branch', ' cssUpdates branch', ' Buzz widget ', ' TERMS OF SERVICE LINK - under viafoura comments submit button ', ' TOUT MID ARTICLE PLAYER ', ' MOBILE article story stack ', ' margin: 0 3vw 0 0; ']
(Feb-27-2018, 03:16 AM)snippsat Wrote: [ -> ]Why are you trying to parse CSS comments?
Can do it like this,first find style tag the write regex for comments.
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint

scrape = requests.get('http://www.seacoastonline.com/news/20171113/lets-not-let-politics-divide-us')
html = scrape.content
soup = BeautifulSoup(html, 'lxml')
style = soup.find('style')
css_comments = re.findall(r'\/\*(.*)\*\/', str(style))
pprint(css_comments)
Output:
['houzz page', 'legacy-header', '==== ARTICLE ======', 'story strip article ad', ' cssUpdates branch', ' cssUpdates branch', ' Buzz widget ', ' TERMS OF SERVICE LINK - under viafoura comments submit button ', ' TOUT MID ARTICLE PLAYER ', ' MOBILE article story stack ', ' margin: 0 3vw 0 0; ']

Thanks! This method works for me.

I should have done better to explain what I'm trying to do in my OP (my bad). I just need to scan the entire source code including CSS comments for a keyword (in this case "houzz"), and if it exists take an action.

I had a script that was working for lots of keywords, but since this specific keyword is located in CSS comments it didn't work.

Here's the working code if anyone comes across this thread and needs it:

import requests
from bs4 import BeautifulSoup
import re

scrape = requests.get('http://www.seacoastonline.com/news/20171113/lets-not-let-politics-divide-us', headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
html = scrape.content
soup = BeautifulSoup(html, 'lxml')

css_comments = re.findall("houzz", str(soup))

if len(css_comments) > 0:
    print("houzz keyword found")
else:
    print("houzz keyword not found")