Remove escape characters / Unicode characters from string

DreamingInsanity · May-14-2020, 02:54 PM

I retrieve some JSON from a web page. It is real JSON, however due to all the backslash escape characters it doesn't want to format correct.
There's two fixes I've though of although I'm not sure how to do either.

Here's a snippet of the JSON:

\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0

As you can see, not only is it full of "\" it's also full of unicode characters.

The first method was to remove backslashes. If I did that with .replace() it would get rid of every backslash of course, so I need a way to get rid of only one backslash everytime it encounters a backslash.

The other method is to firstly, convert the unicode characters to actual characters, and then replace all the backslashes but I'm not sure how I would go about coverting the characters. Sadly, decode("utf-8") doesn't work.

Whats the best way to do this?

**Larz60+** · May-14-2020, 06:11 PM

Please show what you've tried so far.

DreamingInsanity · (This post was last modified: May-15-2020, 08:43 AM by DreamingInsanity.)

(May-14-2020, 06:11 PM)Larz60+ Wrote: Please show what you've tried so far.

As I said, I have already tried combinations of .encode()/.decode(), with no luck.

My next idea was have a function which loops through the string until it finds a backslash. It would then replace that backslash, and that backslash only. (Every time it found a backslash it would just replace a single one of them). So that would mean, if it was an escaped Unicode id like this: \\u... , once it had gone through the function it would become: \u...
The issue with that is that the number of backslashes is inconsistent - there's one backslash before a quotation mark, but there's three in the html line breaks, meaning my function wouldn't work.

The most simple way, I think, is this:

chars = {
        "\u00a0", "	 ", #no break space
        "\u00fc", "ü",
        "...", "...",
        "...", "..."
}

for char in chars:
    if (char in json):
        json.replace(char, chars[char])

Just replace the Unicode ids with their respective characters. I have to get all the Unicode characters which are likely to show up - this isn't too much of an issue since I know there isn't that many.

The reason I haven't done thus is that I'm not a big fan of hard-coding values because if you hard-code and a change happens, for instance the json changes, then nine times out of ten, it's going to break your program.
This may well have to be the route I have to take, however.

I'll also give a little more info on where this json comes from:
The json is stored in a JavaScript file, in a variable called 'ACTIVITY_DATA'
When I use requests to get the whole page, I start by replacing that string:
answer_json = requests.get(src).text.replace("ACTIVITY_DATA = ", "")
The json has some weird quotes in it that make it invalid: (highlighted red)

{"1":"[{.........}]"}

So I do some substringing to remove them:
new_answers = answer_json[:5] + answer_json[6:-2] + answer_json[-3:-2]
And then that leaves me with the json I have now, but still full of escape characters.

Never mind, ignore everything. Turns out I'm an idiot. The json is valid with those quotation marks. The only issue I was having was that it wouldn't format with those in there. But formatting doesn't even matter when you don't see the json anyway.
The reason I was getting errors when parsing in the json was because there's some characters in the json (maybe some of the no space breaks?) that the library doesn't like when you try and parse it as json.

***snippsat*** · May-15-2020, 08:59 AM

Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:

>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'

Not json,here use text or content no will see \ or \\\ as as you get.

>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')

DreamingInsanity · May-15-2020, 12:55 PM

(May-15-2020, 08:59 AM)snippsat Wrote: Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:

>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'

Not json,here use text or content no will see \ or \\\ as as you get.

>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')

The .json attribute would be very helpful but unfortunately, as I mentioned, the json data is stored inside a javascript variable so I have to return the data as a string to allow me to use .replace()

***snippsat*** · May-15-2020, 01:37 PM

Maybe there is better way if could have looked at source or maybe not.
Can take quick run on that string data,as i have clean up much worse stuff that this Wink

>>> s = '\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0' 
>>> ss = s.replace('\\u00a0', '').replace('\\\\', '').strip()
>>> ss = d.replace('\\u00fc', '\u00fc')
>>> print(ss)
"text_with_blanks":"<b>Tagesmenü im Restaurant</b><br/>Samstag, 12. August<br/><b>Suppen</b><br/>Tomatensuppe

# Now need a parser
>>> from bs4 import BeautifulSoup
>>>
>>> soup = BeautifulSoup(ss, 'lxml')
>>> print(soup.prettify())
<html>
 <body>
  <p>
   "text_with_blanks":"
   <b>
    Tagesmenü im Restaurant
   </b>
   <br/>
   Samstag, 12. August
   <br/>
   <b>
    Suppen
   </b>
   <br/>
   Tomatensuppe
  </p>
 </body>
</html>

>>> soup.select_one('p > b')
<b>Tagesmenü im Restaurant</b>
>>> print(soup.select_one('p > b').text)
Tagesmenü im Restaurant

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Copy xml content from webpage and save to locally without special characters	Nik1811	14	959	Mar-26-2024, 09:28 AM Last Post: Nik1811
	remove gilberishs from a "string"	kucingkembar	2	272	Mar-15-2024, 08:51 AM Last Post: kucingkembar
	sort search results by similarity of characters	jacksfrustration	5	446	Feb-16-2024, 11:59 PM Last Post: deanhystad
	non-latin characters in console from clipboard	Johanson	3	705	Oct-26-2023, 10:10 PM Last Post: deanhystad
	Special Characters read-write	Prisonfeed	1	629	Sep-17-2023, 08:26 PM Last Post: Gribouillis
	doing string split with 2 or more split characters	Skaperen	22	2,554	Aug-13-2023, 01:57 AM Last Post: Skaperen
	How do I check if the first X characters of a string are numbers?	FirstBornAlbratross	6	1,551	Apr-12-2023, 10:39 AM Last Post: jefsummers
	use of escape character in re.sub and find	WJSwan	1	919	Feb-16-2023, 05:19 PM Last Post: Larz60+
	How to remove patterns of characters from text	aaander	4	1,124	Nov-19-2022, 03:34 PM Last Post: snippsat
	please help me remove error for string.strip()	jamie_01	3	1,207	Oct-14-2022, 07:48 AM Last Post: Pedroski55

Remove escape characters / Unicode characters from string

User Panel Messages

Announcements