XML (utf-8) question

DPaul · (This post was last modified: Mar-25-2020, 10:02 AM by DPaul.)

Hi,
After a lot of reading an searching to no avail, this is my problem:

1. I can read & write xml files using ElementTree : no problems as long as i stick to "normal" chars. (ascii < 128)
2. Normally utf-8 is the standard encoding, i tested that in IDLE with: sys.getdefaultencoding() => it says "utf-8"
3. But now i try to sneek in a "French" char like so:
elementx.text = "test" works, but as soon as i do elementx.text = "testç", my python program throws an error while building the tree:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 4, column 16

The thing is that while building the tree in code, i can find no place where i can reference one encoding system or other.
For completeness sake, i build a tree like this:

yx = ET.Element(x)
yxx = ET.SubElement(yx,xx)
yxxx = ET.SubElement(yxx, xxx)
yxxx.text = "testç"

UPDATE : a new search suggested that i should install "unidecode" => import unidecode
then i can

dat = unidecode.unidecode("testç")
yxxx.text = dat

No more error thrown by python but it changes the string into "testc" . The c-cédille has gone!
So this is something but not all !

Any suggestions ?
thx,
Paul

***snippsat*** · (This post was last modified: Mar-25-2020, 05:29 PM by snippsat.)

Use BS-4 is better in most parts and Unicode support is very good.
I never use parser in standard library,same with urllib use Requests.
The standard library has strong modules that has a more stable platform and do not need so much changing,
but with parser and HTTP stuff is better to use modules that keep up with the rabbit changing of web.

doc Wrote:Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode
When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with

So as i just did post a answer here,can use that code and do some Unicode stuff.

from bs4 import BeautifulSoup

xml = '''\
<provider>
  <identity>chess king♟♜♞</identity>
  <endpoint>some point.com</endpoint>
</provider>'''

soup = BeautifulSoup(xml, 'xml')

>> result = soup.find('identity')
>>> result
<identity>chess king♟♜♞</identity>
>>> result.string.replace_with("testç")
'chess king♟♜♞'

>>> soup
<?xml version="1.0" encoding="utf-8"?>
<provider>
<identity>testç</identity>
<endpoint>some point.com</endpoint>
</provider>

>>> result = soup.find('identity')
>>> result
<identity>testç</identity>
>>> result.string = '♟♜♞'

>>> soup
<?xml version="1.0" encoding="utf-8"?>
<provider>
<identity>♟♜♞</identity>
<endpoint>some point.com</endpoint>
</provider>

DPaul · Mar-25-2020, 06:49 PM

Hi,

Thanks, your beautifulsoup method seems very strong, especially with the chess symbols!
I've used beautifulsoup for scraping, but this problem is about simple XML database writing and reading, no www.

After some searching and testing i found a solution, and i have drawn some conclusions:
1. Python is not so good at writing utf-8
2 i convert the text string before attaching it to the tree, like so:

 s= '>testç"<@/€#ê'
s.encode('utf-8')
    xxx.text = s

3. When reading the database from file with elementtree, the formatting is flawed when using ET.dump(...) to print.
But this works perfectly :

for field in child:
    print('Field:', field.tag,':', field.text)

4. It would seem that the python default for reading is utf-8

5. No I can't do chess symbols with that, i tried. :-(
Dpaul

***snippsat*** · (This post was last modified: Mar-25-2020, 07:59 PM by snippsat.)

(Mar-25-2020, 06:49 PM)DPaul Wrote: 1. Python is not so good at writing utf-8

This is not right,one of biggest change when moving to Python 3 was to fix Unicode.
The default encoding for Python 3 source code is now UTF-8.
In Python 3 are all strings are stored as Unicode,for inn/output from Python may need or recommended to specify encoding to utf-8.
Example.

s = 'Crème and Spicy jalapeño ☂ ⛄日本語のキ'
with open('unicode.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

with open('unicode.txt', encoding='utf8') as f:
    data = f.read()
    print(data)

Output:
Crème and Spicy jalapeño ☂ ⛄日本語のキ

Quote:3. When reading the database from file with elementtree, the formatting is flawed when using ET.dump(...) to print.
But this works perfectly :

How ElementTree handle this do i not know,as mention i never use it.

DPaul · (This post was last modified: Mar-26-2020, 07:37 AM by DPaul.)

"1. Python is not good at writing utf-8"
Wrong, i should have written: ... at writing default in utf-8.
You have to specify the encoding at some point.
This seems not to be the case when reading utf-8 files.

It took me hours to get both right,
and especially elementtree has not much specific info on that subject.
But, your comments were appreciated.
Dpaul

XML (utf-8) question

User Panel Messages

Announcements