Python Forum

I have a webpage in the variable soup

I try to search it with a regx but get an error
TypeError: expected string or bytes-like object

If I set soup equal some garbage characters with my 'target' in the middle...
soup = ("adklfjdkd 12345 xyz target dfkdkfj")
it works.

I think I need to convert the webpage soup to a "raw" string but I'm unsure how to do this for a string already in a variable.

Normally I would do something like... myString = (r"bla bla bla") # I think

But if you already have the string in a variable how do you make it a "raw" string" ?

I'm guessing concatenate but an example would really be helpful as it has been a very long day :-)

Thanks for any help.

A string is a string, regardless if it's in quotes, or a variable.

What's more likely, is that "soup" isn't a string. That particular name is used in BeautifulSoup's documentation and examples, so it's probably a BeautifulSoup object. Which makes sense, since bs parses webpages. What you should do, is filter through the soup to find the specific tag you're after, and then use something like soup.text to get the contents of that tag, which would be a string.

Quote:I try to search it with a regx but get an error

This is your first problem here. If you are using BeautifulSoup to parse the site, then you shouldnt need to use regex. Otherwise you dont need BeautifulSoup at all. BeautifulSoup will parse the site to find tags, and in those tags you can get the string content from them. Search BeautifulSoup website for tutorials on how to use BeautifulSoup. We also have some tutorials here too.

I think nilamo ic correct and it is a Beautiful Soup object and not a string.

I Googled "how to search beautiful soup with regx" and found a thread on another forum suggesting Beautiful Soup has a find_all object for regx and the code might look something like the below

import re
>>> soup.find_all(re.compile("(a|div)"))

nope. problems with this too. I'll research more tonight but if anyone knows how to search a beautiful soup object using regx expression let me know. In particular I'm looking for the following on a web page...

my text 1
misc html code
my text 2

If I use Chrome to copy the page source and put it into a string I can use the regx ...
'search' method to do this and return the above in three groups with
'my text 1' in the first, misc html code in the 2nd, and 'my text 2' in the third
or use the
'findall' method to return the three items in a tuple
but this doesn't work with soup.

thanks for any help and some good code examples :-)

(Aug-16-2017, 01:27 PM)Fran_3 Wrote: [ -> ]I'll research more tonight but if anyone knows how to search a beautiful soup object using regx expression let me know

You use only regex in bs4 as helper in rare cases,
BeautifulSoup has all this build in with soup.find() ,soup.find_all() and CSS selector used trough soup.select("div > a")
As a example to get my text 1 my text 2 in a HTML page could do it like this.

from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<html>
  <head>
     <title>My Site</title>
  </head>
  <body>
     <title>First chapter</title>
     <p>my text 1</p>
     <p>Page2</p>
     <b id=foo my>my text 2</b>
  </body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.select('body > p')[0].text)
print(soup.select('#foo')[0].text)

Output:my text 1
my text 2

Has many examples here Web-Scraping part-1.

provide your full code and the URL where you are getting your content...a well as what content you want.

snippsat,

your example works for the sample html in your demo code but can you give me an explanation of the syntax following soup.select in these two lines...

print(soup.select('body > p')[0].text)
print(soup.select('#foo')[0].text)

I searched for Beautiful Sup select method but guess I missed explanations for terms like...
('body > p')[0].text)
('#foo')[0].text)
Can you give me some explanation and maybe a linke to the arguments/syntax for the bs select method?
thanks

Fran_3 Wrote:I searched for Beautiful Sup select method but guess I missed explanations for terms like...

It's called CSS selectors,i have demo of usage in tutorial as posted.
CSS Selector Reference, Beautiful Soup 4 Cheatsheet.

Knowing at least the basics of HTML and CSS is mandatory if you want to parse a web page.
Because of there are many p tags in the body of the page you get a list of them as a result. [0] is the first element of the result and you get the content with .text

Fran_3

nilamo

metulburr

Fran_3

snippsat

metulburr

Fran_3

snippsat

wavic