Python Forum

Full Version: Remove escape characters / Unicode characters from string
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I retrieve some JSON from a web page. It is real JSON, however due to all the backslash escape characters it doesn't want to format correct.
There's two fixes I've though of although I'm not sure how to do either.

Here's a snippet of the JSON:
\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0
As you can see, not only is it full of "\" it's also full of unicode characters.

The first method was to remove backslashes. If I did that with .replace() it would get rid of every backslash of course, so I need a way to get rid of only one backslash everytime it encounters a backslash.

The other method is to firstly, convert the unicode characters to actual characters, and then replace all the backslashes but I'm not sure how I would go about coverting the characters. Sadly, decode("utf-8") doesn't work.

Whats the best way to do this?
Please show what you've tried so far.
(May-14-2020, 06:11 PM)Larz60+ Wrote: [ -> ]Please show what you've tried so far.

As I said, I have already tried combinations of .encode()/.decode(), with no luck.

My next idea was have a function which loops through the string until it finds a backslash. It would then replace that backslash, and that backslash only. (Every time it found a backslash it would just replace a single one of them). So that would mean, if it was an escaped Unicode id like this: \\u... , once it had gone through the function it would become: \u...
The issue with that is that the number of backslashes is inconsistent - there's one backslash before a quotation mark, but there's three in the html line breaks, meaning my function wouldn't work.

The most simple way, I think, is this:
chars = {
        "\u00a0", "	 ", #no break space
        "\u00fc", "ü",
        "...", "...",
        "...", "..."
}

for char in chars:
    if (char in json):
        json.replace(char, chars[char])
Just replace the Unicode ids with their respective characters. I have to get all the Unicode characters which are likely to show up - this isn't too much of an issue since I know there isn't that many.

The reason I haven't done thus is that I'm not a big fan of hard-coding values because if you hard-code and a change happens, for instance the json changes, then nine times out of ten, it's going to break your program.
This may well have to be the route I have to take, however.


I'll also give a little more info on where this json comes from:
The json is stored in a JavaScript file, in a variable called 'ACTIVITY_DATA'
When I use requests to get the whole page, I start by replacing that string:
answer_json = requests.get(src).text.replace("ACTIVITY_DATA = ", "")
The json has some weird quotes in it that make it invalid: (highlighted red)

{"1":"[{.........}]"}

So I do some substringing to remove them:
new_answers = answer_json[:5] + answer_json[6:-2] + answer_json[-3:-2]
And then that leaves me with the json I have now, but still full of escape characters.

Never mind, ignore everything. Turns out I'm an idiot. The json is valid with those quotation marks. The only issue I was having was that it wouldn't format with those in there. But formatting doesn't even matter when you don't see the json anyway.
The reason I was getting errors when parsing in the json was because there's some characters in the json (maybe some of the no space breaks?) that the library doesn't like when you try and parse it as json.
Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:
>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'
Not json,here use text or content no will see \ or \\\ as as you get.
>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')
(May-15-2020, 08:59 AM)snippsat Wrote: [ -> ]Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:
>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'
Not json,here use text or content no will see \ or \\\ as as you get.
>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')

The .json attribute would be very helpful but unfortunately, as I mentioned, the json data is stored inside a javascript variable so I have to return the data as a string to allow me to use .replace()
Maybe there is better way if could have looked at source or maybe not.
Can take quick run on that string data,as i have clean up much worse stuff that this Wink
>>> s = '\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0' 
>>> ss = s.replace('\\u00a0', '').replace('\\\\', '').strip()
>>> ss = d.replace('\\u00fc', '\u00fc')
>>> print(ss)
"text_with_blanks":"<b>Tagesmenü im Restaurant</b><br/>Samstag, 12. August<br/><b>Suppen</b><br/>Tomatensuppe

# Now need a parser
>>> from bs4 import BeautifulSoup
>>>
>>> soup = BeautifulSoup(ss, 'lxml')
>>> print(soup.prettify())
<html>
 <body>
  <p>
   "text_with_blanks":"
   <b>
    Tagesmenü im Restaurant
   </b>
   <br/>
   Samstag, 12. August
   <br/>
   <b>
    Suppen
   </b>
   <br/>
   Tomatensuppe
  </p>
 </body>
</html>

>>> soup.select_one('p > b')
<b>Tagesmenü im Restaurant</b>
>>> print(soup.select_one('p > b').text)
Tagesmenü im Restaurant