Python Forum
find chars after chars in 'pre' tag
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
find chars after chars in 'pre' tag
#1
given text between two 'pre' tags as shown in the example code

I want to search for specified text... ABCD in the example
while ignoring any preceeding text on the line... "XYZ%7 " in the example
then put the six digits afterwards into a string... "141800" in the example

and also... how to put the entire line of text in a string... XYZ57 ABCD 141800" so I can manage it with the Python string methods.

thanks for any help.


html = '''\
<html>
<body>

<pre>
Text in a pre element
is displayed in a fixed-width
XYZ57 ABCD 141800
font, and it preserves
both      spaces and
line breaks
</pre>

</body>
</html>'''

import bs4 as bs
soup = bs.BeautifulSoup(html,'lxml')
y = soup.find('pre')
print (y)

print ("\ndebug break 1\n")

# goal = find ABCD ###### and ignore the
# first 5 characters and space at line begining
# as they may not always be the same

y=str(y)
print (type(y))
if "ABCD" in y:
    print ('found it')
    
# OK ABCD is in the string now how do I capture the
# six digits after it? 141800 in this example
Reply
#2
You need to convert it from a BeautifulSoup object to a string. There isnt much more BeautifulSoup can do after acquiring the pre tag content. Most of the time there are more nested tags to narrow the needle search. You would get the text as a string via
print(y.text)
then you just have to use python string methods to split it out
print(y.text.split())
Output:
['Text', 'in', 'a', 'pre', 'element', 'is', 'displayed', 'in', 'a', 'fixed-width', 'XYZ57', 'ABCD', '141800', 'font,', 'and', 'it', 'preserves', 'both', 'spaces', 'and', 'line', 'breaks']
if you looking for ABCD...if its always the 12th element, you can grab it that way. Is ABCD always static? Or does that change too? What are a few different outcomes to get the proper target?
Recommended Tutorials:
Reply
#3
You never call text,if you do text inside pre tag will get a \n for each line.
Then can split on \n and take out line you want.
from bs4 import BeautifulSoup

html = '''\
<html>
<body>

<pre>
Text in a pre element
is displayed in a fixed-width
XYZ57 ABCD 141800
font, and it preserves
both spaces and
line breaks
</pre>

</body>
</html>'''

soup = BeautifulSoup(html,'lxml')
y = soup.find('pre')
# Call text
text = y.text
my_line = ''
for line in text.split('\n'):
    if line.startswith('XYZ'):
    my_line += line

print(my_line)
Output:
XYZ57 ABCD 141800
Reply
#4
snippsat,

What is the alternative to using 'find' inside pre tag so as to avoid \n new line ?

thanks again for your kind assistance.
Reply
#5
(Aug-19-2017, 04:27 PM)Fran_3 Wrote: What is the alternative to using 'find' inside pre tag so as to avoid \n new line ?
For more specific choice than down to lines inside a tag,
can use Python string tool or regex.
Example:
import re

text = '''\
<pre>
Text in a pre element
is displayed in a fixed-width
XYZ57 ABCD 141800
font, and it preserves
both spaces and
line breaks
</pre>'''

line = re.search(r"\XYZ.*", text)
print(line.group())
number = re.search(r'\d{6}', text)
print(number.group())
Output:
XYZ57 ABCD 141800 141800
Reply
#6
I think this gets me back to (one of) my original problems/questions...

1 - If I'm using bs to capture the contents of a pre tag... then that is a hierarchical bs object... or some such... right?
And as such I can't use regx to search it... right?

2 - Your earlier code in this thread seems to be a valid solution for dealing with \n issue when bs 'finds' pre tag contents... right?

3 - But since I invested a bunch of time in learning regx it would be nice to know that when bs does not provide an obvious (to me) way to drill down and get my target text... how do I convert the thing bs returns via using the find, find_all or select method to a string upon which a regx search will work?
Reply
#7
(Aug-19-2017, 08:30 PM)Fran_3 Wrote: 1 - If I'm using bs to capture the contents of a pre tag... then that is a hierarchical bs object... or some such... right? And as such I can't use regx to search it... right?
Bye using text call then it's just a string that can be used bye Python string tool or regex.
(Aug-19-2017, 08:30 PM)Fran_3 Wrote: 2 - Your earlier code in this thread seems to be a valid solution for dealing with \n issue when bs 'finds' pre tag contents... right?
There is no \n issue,if only one line there is no \n.
Multiple lines they are separated bye \n,just like all multiple text lines in Python.
Fran_3 Wrote:3 - But since I invested a bunch of time in learning regx it would be nice to know that when bs does not provide an obvious (to me) way to drill down and get my target text... how do I convert the thing bs returns via using the find, find_all or select method to a string upon which a regx search will work?
Often the way HTML/XML is structured there is no need to further search with regex.
If need to search more specific as mention before you call text(and use tool on that text).
Example this is a typical way with text and values are in separated tags.
from bs4 import BeautifulSoup

html = '''\
<html>
  <head>
    <meta charset="UTF-8">
    <title>Title of the document</title>
  </head>
  <body>
    <p id="calc_text">Calculation Results is</p>
    <span class="BMIScore">158</span>
  </body>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
Use it:
>> p = soup.find('p')
>>> p
<p id="calc_text">Calculation Results is</p>
>>> type(p)
<class 'bs4.element.Tag'>

>>> # Calling text take it out of BS to a string
>>> p = soup.find('p').text
>>> p
'Calculation Results is'
>>> type(p)
<class 'str'>
The value is in separated tag.
>>> bmi = soup.select_one('.BMIScore')
>>> bmi
<span class="BMIScore">158</span>
>>> bmi.text
'158'
# Or if integer is needed
>>> int(bmi.text)
158
Reply
#8
Snippsat, this has been _very_ helpful!

Can you tell me where is the documentation on the .text method? I Googled & looked in BS doc's but couldn't find.

Also the difference in get_text and .text is ?

Thanks.
Reply
#9
(Aug-20-2017, 02:56 PM)Fran_3 Wrote: Can you tell me where is the documentation on the .text method? I Googled & looked in BS doc's but couldn't find.
lso the difference in get_text and .text is ?
There is no difference,they should have update there doc on get_text() an mention that .text is preferred. 
 .text is just a @property that calls get_text().
They also have they old getText(),to keep backward compatibility.
>>> p.text
'Calculation Results is'
>>> p.get_text()
'Calculation Results is'
>>> p.getText()
'Calculation Results is'
However, get_text() can also support various keyword arguments to change how it behaves (separator, strip, types).
If need more control over the result.
Reply
#10
Thanks snappsat ! ! !
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020