Python Forum
how to convert string soup to raw string ?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
how to convert string soup to raw string ?
#1
I have a webpage in the variable soup

I try to search it with a regx but get an error
TypeError: expected string or bytes-like object

If I set soup equal some garbage characters with my 'target' in the middle...
soup = ("adklfjdkd 12345 xyz target dfkdkfj")
it works.

I think I need to convert the webpage soup to a "raw" string but I'm unsure how to do this for a string already in a variable.

Normally I would do something like... myString = (r"bla bla bla") # I think

But if you already have the string in a variable how do you make it a "raw" string" ?

I'm guessing concatenate but an example would really be helpful as it has been a very long day :-)

Thanks for any help.
Reply
#2
A string is a string, regardless if it's in quotes, or a variable.

What's more likely, is that "soup" isn't a string.  That particular name is used in BeautifulSoup's documentation and examples, so it's probably a BeautifulSoup object.  Which makes sense, since bs parses webpages.  What you should do, is filter through the soup to find the specific tag you're after, and then use something like soup.text to get the contents of that tag, which would be a string.
Reply
#3
Quote:I try to search it with a regx but get an error
This is your first problem here. If you are using BeautifulSoup to parse the site, then you shouldnt need to use regex. Otherwise you dont need BeautifulSoup at all. BeautifulSoup will parse the site to find tags, and in those tags you can get the string content from them. Search BeautifulSoup website for tutorials on how to use BeautifulSoup. We also have some tutorials here too.
Recommended Tutorials:
Reply
#4
I think nilamo ic correct and it is a Beautiful Soup object and not a string.

I Googled "how to search beautiful soup with regx" and found a thread on another forum suggesting Beautiful Soup has a find_all object for regx and the code might look something like the below

import re
>>> soup.find_all(re.compile("(a|div)"))

nope. problems with this too. I'll research more tonight but if anyone knows how to search a beautiful soup object using regx expression let me know. In particular I'm looking for the following on a web page...

my text 1
misc html code
my text 2

If I use Chrome to copy the page source and put it into a string I can use the regx ...
'search' method to do this and return the above in three groups with
'my text 1' in the first, misc html code in the 2nd, and 'my text 2' in the third
or use the
'findall' method to return the three items in a tuple
but this doesn't work with soup.

thanks for any help and some good code examples :-)
Reply
#5
(Aug-16-2017, 01:27 PM)Fran_3 Wrote: I'll research more tonight but if anyone knows how to search a beautiful soup object using regx expression let me know
You use only regex in bs4 as helper in rare cases,
BeautifulSoup has all this build in with soup.find() ,soup.find_all() and CSS selector used trough soup.select("div > a")
As a example to get my text 1 my text 2 in a HTML page could do it like this.
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<html>
  <head>
     <title>My Site</title>
  </head>
  <body>
     <title>First chapter</title>
     <p>my text 1</p>
     <p>Page2</p>
     <b id=foo my>my text 2</b>
  </body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.select('body > p')[0].text)
print(soup.select('#foo')[0].text)
Output:
my text 1 my text 2
Has many examples here Web-Scraping part-1.
Reply
#6
provide your full code and the URL where you are getting your content...a well as what content you want.
Recommended Tutorials:
Reply
#7
snippsat,

your example works for the sample html in your demo code but can you give me an explanation of the syntax following soup.select in these two lines...

print(soup.select('body > p')[0].text)
print(soup.select('#foo')[0].text)

I searched for Beautiful Sup select method but guess I missed explanations for terms like...
('body > p')[0].text)
('#foo')[0].text)
Can you give me some explanation and maybe a linke to the arguments/syntax for the bs select method?
thanks
Reply
#8
Fran_3 Wrote:I searched for Beautiful Sup select method but guess I missed explanations for terms like...
It's called CSS selectors,i have demo of usage in tutorial as posted.
CSS Selector ReferenceBeautiful Soup 4 Cheatsheet.
Reply
#9
Knowing at least the basics of HTML and CSS is mandatory if you want to parse a web page.
Because of there are many p tags in the body of the page you get a list of them as a result. [0] is the first element of the result and you get the content with .text
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Scrape for html based on url string and output into csv dana 13 5,359 Jan-13-2021, 03:52 PM
Last Post: snippsat
  How to send unicode string encoded in utf-8 in http request in Python MaverinCode 1 32,360 Nov-08-2020, 06:45 AM
Last Post: JaiM
  string parsing with re.search() delahug 9 3,563 Jun-04-2020, 07:02 PM
Last Post: delahug
  URL String with parameters nikoloz 14 5,873 May-15-2020, 08:20 AM
Last Post: DeaD_EyE
  Pandas tuple list returning html string shansaran 0 1,666 Mar-23-2020, 08:44 PM
Last Post: shansaran
  Cannot get contents from ul.li.span.string LLLLLL 8 3,953 Nov-29-2019, 10:30 AM
Last Post: LLLLLL
  [Learning:bs4, re.search] - RegEx string cutoff jarmerfohn 5 3,615 Nov-23-2019, 09:32 AM
Last Post: buran
  ValueError: could not convert string to float Prince_Bhatia 2 4,094 Jan-26-2019, 02:37 PM
Last Post: perfringo
  TypeError: string indices must be integer vanderdecken 1 4,069 Nov-30-2018, 02:24 PM
Last Post: Larz60+
  string indices must be integers Kapu141984 4 7,083 Oct-31-2018, 02:53 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020