Python Forum
how to convert string soup to raw string ? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: how to convert string soup to raw string ? (/thread-4430.html)



how to convert string soup to raw string ? - Fran_3 - Aug-15-2017

I have a webpage in the variable soup

I try to search it with a regx but get an error
TypeError: expected string or bytes-like object

If I set soup equal some garbage characters with my 'target' in the middle...
soup = ("adklfjdkd 12345 xyz target dfkdkfj")
it works.

I think I need to convert the webpage soup to a "raw" string but I'm unsure how to do this for a string already in a variable.

Normally I would do something like... myString = (r"bla bla bla") # I think

But if you already have the string in a variable how do you make it a "raw" string" ?

I'm guessing concatenate but an example would really be helpful as it has been a very long day :-)

Thanks for any help.


RE: how to convert string soup to raw string ? - nilamo - Aug-16-2017

A string is a string, regardless if it's in quotes, or a variable.

What's more likely, is that "soup" isn't a string.  That particular name is used in BeautifulSoup's documentation and examples, so it's probably a BeautifulSoup object.  Which makes sense, since bs parses webpages.  What you should do, is filter through the soup to find the specific tag you're after, and then use something like soup.text to get the contents of that tag, which would be a string.


RE: how to convert string soup to raw string ? - metulburr - Aug-16-2017

Quote:I try to search it with a regx but get an error
This is your first problem here. If you are using BeautifulSoup to parse the site, then you shouldnt need to use regex. Otherwise you dont need BeautifulSoup at all. BeautifulSoup will parse the site to find tags, and in those tags you can get the string content from them. Search BeautifulSoup website for tutorials on how to use BeautifulSoup. We also have some tutorials here too.


RE: how to convert string soup to raw string ? - Fran_3 - Aug-16-2017

I think nilamo ic correct and it is a Beautiful Soup object and not a string.

I Googled "how to search beautiful soup with regx" and found a thread on another forum suggesting Beautiful Soup has a find_all object for regx and the code might look something like the below

import re
>>> soup.find_all(re.compile("(a|div)"))

nope. problems with this too. I'll research more tonight but if anyone knows how to search a beautiful soup object using regx expression let me know. In particular I'm looking for the following on a web page...

my text 1
misc html code
my text 2

If I use Chrome to copy the page source and put it into a string I can use the regx ...
'search' method to do this and return the above in three groups with
'my text 1' in the first, misc html code in the 2nd, and 'my text 2' in the third
or use the
'findall' method to return the three items in a tuple
but this doesn't work with soup.

thanks for any help and some good code examples :-)


RE: how to convert string soup to raw string ? - snippsat - Aug-16-2017

(Aug-16-2017, 01:27 PM)Fran_3 Wrote: I'll research more tonight but if anyone knows how to search a beautiful soup object using regx expression let me know
You use only regex in bs4 as helper in rare cases,
BeautifulSoup has all this build in with soup.find() ,soup.find_all() and CSS selector used trough soup.select("div > a")
As a example to get my text 1 my text 2 in a HTML page could do it like this.
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<html>
  <head>
     <title>My Site</title>
  </head>
  <body>
     <title>First chapter</title>
     <p>my text 1</p>
     <p>Page2</p>
     <b id=foo my>my text 2</b>
  </body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.select('body > p')[0].text)
print(soup.select('#foo')[0].text)
Output:
my text 1 my text 2
Has many examples here Web-Scraping part-1.


RE: how to convert string soup to raw string ? - metulburr - Aug-16-2017

provide your full code and the URL where you are getting your content...a well as what content you want.


RE: how to convert string soup to raw string ? - Fran_3 - Aug-16-2017

snippsat,

your example works for the sample html in your demo code but can you give me an explanation of the syntax following soup.select in these two lines...

print(soup.select('body > p')[0].text)
print(soup.select('#foo')[0].text)

I searched for Beautiful Sup select method but guess I missed explanations for terms like...
('body > p')[0].text)
('#foo')[0].text)
Can you give me some explanation and maybe a linke to the arguments/syntax for the bs select method?
thanks


RE: how to convert string soup to raw string ? - snippsat - Aug-17-2017

Fran_3 Wrote:I searched for Beautiful Sup select method but guess I missed explanations for terms like...
It's called CSS selectors,i have demo of usage in tutorial as posted.
CSS Selector ReferenceBeautiful Soup 4 Cheatsheet.


RE: how to convert string soup to raw string ? - wavic - Aug-18-2017

Knowing at least the basics of HTML and CSS is mandatory if you want to parse a web page.
Because of there are many p tags in the body of the page you get a list of them as a result. [0] is the first element of the result and you get the content with .text