Python Forum
Remove escape characters / Unicode characters from string
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Remove escape characters / Unicode characters from string
#1
I retrieve some JSON from a web page. It is real JSON, however due to all the backslash escape characters it doesn't want to format correct.
There's two fixes I've though of although I'm not sure how to do either.

Here's a snippet of the JSON:
\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0
As you can see, not only is it full of "\" it's also full of unicode characters.

The first method was to remove backslashes. If I did that with .replace() it would get rid of every backslash of course, so I need a way to get rid of only one backslash everytime it encounters a backslash.

The other method is to firstly, convert the unicode characters to actual characters, and then replace all the backslashes but I'm not sure how I would go about coverting the characters. Sadly, decode("utf-8") doesn't work.

Whats the best way to do this?
Reply
#2
Please show what you've tried so far.
Reply
#3
(May-14-2020, 06:11 PM)Larz60+ Wrote: Please show what you've tried so far.

As I said, I have already tried combinations of .encode()/.decode(), with no luck.

My next idea was have a function which loops through the string until it finds a backslash. It would then replace that backslash, and that backslash only. (Every time it found a backslash it would just replace a single one of them). So that would mean, if it was an escaped Unicode id like this: \\u... , once it had gone through the function it would become: \u...
The issue with that is that the number of backslashes is inconsistent - there's one backslash before a quotation mark, but there's three in the html line breaks, meaning my function wouldn't work.

The most simple way, I think, is this:
chars = {
        "\u00a0", "	 ", #no break space
        "\u00fc", "ü",
        "...", "...",
        "...", "..."
}

for char in chars:
    if (char in json):
        json.replace(char, chars[char])
Just replace the Unicode ids with their respective characters. I have to get all the Unicode characters which are likely to show up - this isn't too much of an issue since I know there isn't that many.

The reason I haven't done thus is that I'm not a big fan of hard-coding values because if you hard-code and a change happens, for instance the json changes, then nine times out of ten, it's going to break your program.
This may well have to be the route I have to take, however.


I'll also give a little more info on where this json comes from:
The json is stored in a JavaScript file, in a variable called 'ACTIVITY_DATA'
When I use requests to get the whole page, I start by replacing that string:
answer_json = requests.get(src).text.replace("ACTIVITY_DATA = ", "")
The json has some weird quotes in it that make it invalid: (highlighted red)

{"1":"[{.........}]"}

So I do some substringing to remove them:
new_answers = answer_json[:5] + answer_json[6:-2] + answer_json[-3:-2]
And then that leaves me with the json I have now, but still full of escape characters.

Never mind, ignore everything. Turns out I'm an idiot. The json is valid with those quotation marks. The only issue I was having was that it wouldn't format with those in there. But formatting doesn't even matter when you don't see the json anyway.
The reason I was getting errors when parsing in the json was because there's some characters in the json (maybe some of the no space breaks?) that the library doesn't like when you try and parse it as json.
Reply
#4
Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:
>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'
Not json,here use text or content no will see \ or \\\ as as you get.
>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')
Reply
#5
(May-15-2020, 08:59 AM)snippsat Wrote: Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:
>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'
Not json,here use text or content no will see \ or \\\ as as you get.
>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')

The .json attribute would be very helpful but unfortunately, as I mentioned, the json data is stored inside a javascript variable so I have to return the data as a string to allow me to use .replace()
Reply
#6
Maybe there is better way if could have looked at source or maybe not.
Can take quick run on that string data,as i have clean up much worse stuff that this Wink
>>> s = '\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0' 
>>> ss = s.replace('\\u00a0', '').replace('\\\\', '').strip()
>>> ss = d.replace('\\u00fc', '\u00fc')
>>> print(ss)
"text_with_blanks":"<b>Tagesmenü im Restaurant</b><br/>Samstag, 12. August<br/><b>Suppen</b><br/>Tomatensuppe

# Now need a parser
>>> from bs4 import BeautifulSoup
>>>
>>> soup = BeautifulSoup(ss, 'lxml')
>>> print(soup.prettify())
<html>
 <body>
  <p>
   "text_with_blanks":"
   <b>
    Tagesmenü im Restaurant
   </b>
   <br/>
   Samstag, 12. August
   <br/>
   <b>
    Suppen
   </b>
   <br/>
   Tomatensuppe
  </p>
 </body>
</html>

>>> soup.select_one('p > b')
<b>Tagesmenü im Restaurant</b>
>>> print(soup.select_one('p > b').text)
Tagesmenü im Restaurant
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  DIY Escape Room for fun StannemanPython 1 224 Feb-17-2021, 10:53 PM
Last Post: maurom82
  Rename Multiple files in directory to remove special characters nyawadasi 9 348 Feb-16-2021, 09:49 PM
Last Post: BashBedlam
  Extract continuous numeric characters from a string in Python Robotguy 2 267 Jan-16-2021, 12:44 AM
Last Post: snippsat
  super newbie question: escape character tsavoSG 3 285 Jan-13-2021, 04:31 AM
Last Post: tsavoSG
  Split Characters As Lines in File quest_ 3 379 Dec-28-2020, 09:31 AM
Last Post: quest_
  How to escape OrderedDict as an argument? Mark17 2 242 Dec-23-2020, 06:47 PM
Last Post: Mark17
  Read characters of line and return positions Gizzmo28 2 320 Nov-04-2020, 09:27 AM
Last Post: perfringo
  Python win32api keybd_event: How do I input a string of characters? JaneTan 3 435 Oct-19-2020, 04:16 AM
Last Post: deanhystad
  Print characters in a single line rather than one at a time hhydration 1 382 Oct-10-2020, 10:00 PM
Last Post: bowlofred
  get two characters, count and print from a .txt file Pleiades 9 710 Oct-05-2020, 09:22 AM
Last Post: perfringo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020