string parsing with re.search()

delahug · Jun-03-2020, 08:24 AM

hi,

i've been trying to parse a string using the re.search function but am running into trouble when it encounters ½ (the numeral representation of a half)...

l = re.search(r'[[]',str(viola.text),re.I).start()+1

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 1: ordinal not in range(128)

how should i proceed here or is there another way to do the string parsing?

thanks

**Gribouillis** · Jun-03-2020, 01:12 PM

Make sure you are using python 3.

***snippsat*** · Jun-03-2020, 01:35 PM

There is no u'' in Python 3,so follow advice over.

# Python 3.8
>>> s = u'\xbd' 
>>> s
'½'

# Can remove <u> make no difference
>>> s = '\xbd' 
>>> s
'½'

# Python 2.7
>>> s = u'\xbd' 
>>> s
u'\xbd'
>>> s.encode()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)

# Try the obvious one  first  
>>> s.encode('utf-8')
'\xc2\xbd'
>>> print(s.encode('utf-8'))
Â½

# Make a guess
>>> print(s.encode('latin-1'))
½

On of the biggest changes moving to Python 3 was to make Unicode better Wink

delahug · Jun-03-2020, 09:23 PM

(Jun-03-2020, 01:35 PM)snippsat Wrote: There is no u'' in Python 3,so follow advice over.

# Python 3.8
>>> s = u'\xbd' 
>>> s
'½'

# Can remove <u> make no difference
>>> s = '\xbd' 
>>> s
'½'

# Python 2.7
>>> s = u'\xbd' 
>>> s
u'\xbd'
>>> s.encode()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)

# Try the obvious one  first  
>>> s.encode('utf-8')
'\xc2\xbd'
>>> print(s.encode('utf-8'))
Â½

# Make a guess
>>> print(s.encode('latin-1'))
½

On of the biggest changes moving to Python 3 was to make Unicode better Wink

Thanks for your help.

But I don't get where this will fit in my code?

Specifically what I am looking at is this:


2½
[5½]


I want what's in the square brackets, within the second nested span.
If I grab the whole lot by referencing the span class, I then run into the problem above when using re.search() on the square bracket. It's caused (apparently) by the fraction in the first span.

Can I get at the second span directly?

thanks

***snippsat*** · (This post was last modified: Jun-03-2020, 10:20 PM by snippsat.)

Now is that html so should not be using regex anyway,if want a funny read.

from bs4 import BeautifulSoup

html = '''\
<span class="rp-horseTable__pos__length">
<span>2½</span>
<span>[5½]</span>
</span>'''

soup = BeautifulSoup(html, 'lxml')

Usage:

>>> tag = soup.select_one('span > span:nth-child(2)')
>>> tag
<span>[5½]</span>
>>> tag.text
'[5½]'

So here find second span tag directly using CSS selector .
After using .text the parser has done it's job,so now can use regex if want what's inside square bracket

>>> import re
>>> 
>>> r = re.search(r"\[(.*)\]", tag.text)
>>> r.group(1)
'5½'

In a lager code may want to first match eg the class name the do what posted over.
Or can use find_all() as an other approach.

>>> tag = soup.find(class_="rp-horseTable__pos__length")
>>> tag
<span class="rp-horseTable__pos__length">
<span>2½</span>
<span>[5½]</span>
</span>

>>> tag.find_all('span')
[<span>2½</span>, <span>[5½]</span>]
>>> tag.find_all('span')[1]
<span>[5½]</span>

**Gribouillis** · (This post was last modified: Jun-04-2020, 03:09 AM by Gribouillis.)

delahug Wrote:I then run into the problem above when using re.search() on the square bracket. It's caused (apparently) by the fraction in the first span.

If you are running python 2.7, the problem is not caused by the fraction, it is caused by the implicit attempt to encode the string to the ascii encoding with the str() function, while the fraction character cannot be encoded with this encoding because it is not an ascii character. In python 3, there would be no such problem because str() doesnt try to encode the unicode string.

>>> # python 2.7
>>> text = u"\xbd"
>>> str(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)

You could perhaps try a unicode regex like u"[[]" and remove the call to str(), or better switch to python 3 because python 2 is no longer supported.

delahug · Jun-04-2020, 09:39 AM

(Jun-04-2020, 03:09 AM)Gribouillis Wrote:
delahug Wrote:I then run into the problem above when using re.search() on the square bracket. It's caused (apparently) by the fraction in the first span.
If you are running python 2.7, the problem is not caused by the fraction, it is caused by the implicit attempt to encode the string to the ascii encoding with the str() function, while the fraction character cannot be encoded with this encoding because it is not an ascii character. In python 3, there would be no such problem because str() doesnt try to encode the unicode string.
>>> # python 2.7
>>> text = u"\xbd"
>>> str(text)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)
You could perhaps try a unicode regex like u"[[]" and remove the call to str(), or better switch to python 3 because python 2 is no longer supported.

Thank you, sir!

Taking out str() gets me going forward again...

As for Python version. I installed an Anaconda environment (is that what it's called?!) so I have the version of Python (2.something) which came with this...

**Gribouillis** · Jun-04-2020, 10:12 AM

delahug Wrote:I have the version of Python (2.something) which came with this

I think Anaconda can use python 3. Carrying on with python 2 exposes your code to a myriad of tiny issues like this one that simply don't exist with python 3. I could not stress enough how absurd it is to write code in an obsolete language.

***snippsat*** · (This post was last modified: Jun-04-2020, 10:35 AM by snippsat.)

(Jun-04-2020, 09:39 AM)delahug Wrote: I installed an Anaconda environment (is that what it's called?!)

When install you use the Python 3.7 version of Anaconda.

To use my parse code post with BeautifulSoup and lxml,
then there is no install as Anaconda comes with these pre-installed.
Can look list here.
Anaconda and other ways to run Python

delahug · Jun-04-2020, 07:02 PM

(Jun-04-2020, 10:34 AM)snippsat Wrote:
(Jun-04-2020, 09:39 AM)delahug Wrote: I installed an Anaconda environment (is that what it's called?!)
When install you use the Python 3.7 version of Anaconda.

To use my parse code post with BeautifulSoup and lxml,
then there is no install as Anaconda comes with these pre-installed.
Can look list here.
Anaconda and other ways to run Python

Thanks for this. Apologies that I appeared to overlook your previous reply - I didn't notice it because there was another after it.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[Learning:bs4, re.search] - RegEx string cutoff	jarmerfohn	5	3,649	Nov-23-2019, 09:32 AM Last Post: buran
	Regex search for string	DBS	3	4,541	Feb-06-2017, 11:39 PM Last Post: Ofnuts

string parsing with re.search()

User Panel Messages

Announcements