Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
XML (utf-8) question
#1
Hi,
After a lot of reading an searching to no avail, this is my problem:

1. I can read & write xml files using ElementTree : no problems as long as i stick to "normal" chars. (ascii < 128)
2. Normally utf-8 is the standard encoding, i tested that in IDLE with: sys.getdefaultencoding() => it says "utf-8"
3. But now i try to sneek in a "French" char like so:
elementx.text = "test" works, but as soon as i do elementx.text = "testç", my python program throws an error while building the tree:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 4, column 16

The thing is that while building the tree in code, i can find no place where i can reference one encoding system or other.
For completeness sake, i build a tree like this:
yx = ET.Element(x)
yxx = ET.SubElement(yx,xx)
yxxx = ET.SubElement(yxx, xxx)
yxxx.text = "testç"
UPDATE : a new search suggested that i should install "unidecode" => import unidecode
then i can
dat = unidecode.unidecode("testç")
yxxx.text = dat
No more error thrown by python but it changes the string into "testc" . The c-cédille has gone!
So this is something but not all !

Any suggestions ?
thx,
Paul
Reply
#2
Use BS-4 is better in most parts and Unicode support is very good.
I never use parser in standard library,same with urllib use Requests.
The standard library has strong modules that has a more stable platform and do not need so much changing,
but with parser and HTTP stuff is better to use modules that keep up with the rabbit changing of web.
doc Wrote:Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode
When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with
So as i just did post a answer here,can use that code and do some Unicode stuff.
from bs4 import BeautifulSoup

xml = '''\
<provider>
  <identity>chess king♟♜♞</identity>
  <endpoint>some point.com</endpoint>
</provider>'''

soup = BeautifulSoup(xml, 'xml')
>> result = soup.find('identity')
>>> result
<identity>chess king♟♜♞</identity>
>>> result.string.replace_with("testç")
'chess king♟♜♞'

>>> soup
<?xml version="1.0" encoding="utf-8"?>
<provider>
<identity>testç</identity>
<endpoint>some point.com</endpoint>
</provider>

>>> result = soup.find('identity')
>>> result
<identity>testç</identity>
>>> result.string = '♟♜♞'

>>> soup
<?xml version="1.0" encoding="utf-8"?>
<provider>
<identity>♟♜♞</identity>
<endpoint>some point.com</endpoint>
</provider>
Reply
#3
Hi,

Thanks, your beautifulsoup method seems very strong, especially with the chess symbols!
I've used beautifulsoup for scraping, but this problem is about simple XML database writing and reading, no www.

After some searching and testing i found a solution, and i have drawn some conclusions:
1. Python is not so good at writing utf-8
2 i convert the text string before attaching it to the tree, like so:

 s= '>testç"<@/€#ê'
s.encode('utf-8')
    xxx.text = s 
3. When reading the database from file with elementtree, the formatting is flawed when using ET.dump(...) to print.
But this works perfectly :
for field in child:
    print('Field:', field.tag,':', field.text)
4. It would seem that the python default for reading is utf-8

5. No I can't do chess symbols with that, i tried. :-(
Dpaul
Reply
#4
(Mar-25-2020, 06:49 PM)DPaul Wrote: 1. Python is not so good at writing utf-8
This is not right,one of biggest change when moving to Python 3 was to fix Unicode.
The default encoding for Python 3 source code is now UTF-8.
In Python 3 are all strings are stored as Unicode,for inn/output from Python may need or recommended to specify encoding to utf-8.
Example.
s = 'Crème and Spicy jalapeño ☂ ⛄日本語のキ'
with open('unicode.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

with open('unicode.txt', encoding='utf8') as f:
    data = f.read()
    print(data)
Output:
Crème and Spicy jalapeño ☂ ⛄日本語のキ
Quote:3. When reading the database from file with elementtree, the formatting is flawed when using ET.dump(...) to print.
But this works perfectly :
How ElementTree handle this do i not know,as mention i never use it.
Reply
#5
"1. Python is not good at writing utf-8"
Wrong, i should have written: ... at writing default in utf-8.
You have to specify the encoding at some point.
This seems not to be the case when reading utf-8 files.

It took me hours to get both right,
and especially elementtree has not much specific info on that subject.
But, your comments were appreciated.
Dpaul
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020