Python Forum
Remove escape characters / Unicode characters from string
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Remove escape characters / Unicode characters from string
#1
I retrieve some JSON from a web page. It is real JSON, however due to all the backslash escape characters it doesn't want to format correct.
There's two fixes I've though of although I'm not sure how to do either.

Here's a snippet of the JSON:
\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0
As you can see, not only is it full of "\" it's also full of unicode characters.

The first method was to remove backslashes. If I did that with .replace() it would get rid of every backslash of course, so I need a way to get rid of only one backslash everytime it encounters a backslash.

The other method is to firstly, convert the unicode characters to actual characters, and then replace all the backslashes but I'm not sure how I would go about coverting the characters. Sadly, decode("utf-8") doesn't work.

Whats the best way to do this?
Reply
#2
Please show what you've tried so far.
Reply
#3
(May-14-2020, 06:11 PM)Larz60+ Wrote: Please show what you've tried so far.

As I said, I have already tried combinations of .encode()/.decode(), with no luck.

My next idea was have a function which loops through the string until it finds a backslash. It would then replace that backslash, and that backslash only. (Every time it found a backslash it would just replace a single one of them). So that would mean, if it was an escaped Unicode id like this: \\u... , once it had gone through the function it would become: \u...
The issue with that is that the number of backslashes is inconsistent - there's one backslash before a quotation mark, but there's three in the html line breaks, meaning my function wouldn't work.

The most simple way, I think, is this:
chars = {
        "\u00a0", "	 ", #no break space
        "\u00fc", "ü",
        "...", "...",
        "...", "..."
}

for char in chars:
    if (char in json):
        json.replace(char, chars[char])
Just replace the Unicode ids with their respective characters. I have to get all the Unicode characters which are likely to show up - this isn't too much of an issue since I know there isn't that many.

The reason I haven't done thus is that I'm not a big fan of hard-coding values because if you hard-code and a change happens, for instance the json changes, then nine times out of ten, it's going to break your program.
This may well have to be the route I have to take, however.


I'll also give a little more info on where this json comes from:
The json is stored in a JavaScript file, in a variable called 'ACTIVITY_DATA'
When I use requests to get the whole page, I start by replacing that string:
answer_json = requests.get(src).text.replace("ACTIVITY_DATA = ", "")
The json has some weird quotes in it that make it invalid: (highlighted red)

{"1":"[{.........}]"}

So I do some substringing to remove them:
new_answers = answer_json[:5] + answer_json[6:-2] + answer_json[-3:-2]
And then that leaves me with the json I have now, but still full of escape characters.

Never mind, ignore everything. Turns out I'm an idiot. The json is valid with those quotation marks. The only issue I was having was that it wouldn't format with those in there. But formatting doesn't even matter when you don't see the json anyway.
The reason I was getting errors when parsing in the json was because there's some characters in the json (maybe some of the no space breaks?) that the library doesn't like when you try and parse it as json.
Reply
#4
Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:
>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'
Not json,here use text or content no will see \ or \\\ as as you get.
>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')
Reply
#5
(May-15-2020, 08:59 AM)snippsat Wrote: Look like you are using text or content to get data back.
json do not look that like if website give back real json.
Example:
>>> import requests
>>> 
>>> r = requests.get('http://httpbin.org/get')
>>> r.json()
{'args': {},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Host': 'httpbin.org',
             'User-Agent': 'python-requests/2.22.0',
             'X-Amzn-Trace-Id': 'Root=1-5ebe589e-a1eff28e599b122b81950144'},
 'origin': '46.246.118.243',
 'url': 'http://httpbin.org/get'}

# Now can acess data like this
>>> r.json()['origin']
'46.246.118.243'
Not json,here use text or content no will see \ or \\\ as as you get.
>> r.content
(b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encodi'
 b'ng": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "py'
 b'thon-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e'
 b'599b122b81950144"\n  }, \n  "origin": "46.246.118.243", \n  "url": "http://'
 b'httpbin.org/get"\n}\n')

>>> r.text
('{\n'
 '  "args": {}, \n'
 '  "headers": {\n'
 '    "Accept": "*/*", \n'
 '    "Accept-Encoding": "gzip, deflate", \n'
 '    "Host": "httpbin.org", \n'
 '    "User-Agent": "python-requests/2.22.0", \n'
 '    "X-Amzn-Trace-Id": "Root=1-5ebe589e-a1eff28e599b122b81950144"\n'
 '  }, \n'
 '  "origin": "46.246.118.243", \n'
 '  "url": "http://httpbin.org/get"\n'
 '}\n')

The .json attribute would be very helpful but unfortunately, as I mentioned, the json data is stored inside a javascript variable so I have to return the data as a string to allow me to use .replace()
Reply
#6
Maybe there is better way if could have looked at source or maybe not.
Can take quick run on that string data,as i have clean up much worse stuff that this Wink
>>> s = '\"text_with_blanks\":\"<b>Tagesmen\\u00fc im Restaurant<\\\/b><br\\\/>\\u00a0Samstag, 12. August<br\\\/>\\u00a0<b>Suppen<\\\/b><br\\\/>Tomatensuppe \\u00a0 \\u00a0' 
>>> ss = s.replace('\\u00a0', '').replace('\\\\', '').strip()
>>> ss = d.replace('\\u00fc', '\u00fc')
>>> print(ss)
"text_with_blanks":"<b>Tagesmenü im Restaurant</b><br/>Samstag, 12. August<br/><b>Suppen</b><br/>Tomatensuppe

# Now need a parser
>>> from bs4 import BeautifulSoup
>>>
>>> soup = BeautifulSoup(ss, 'lxml')
>>> print(soup.prettify())
<html>
 <body>
  <p>
   "text_with_blanks":"
   <b>
    Tagesmenü im Restaurant
   </b>
   <br/>
   Samstag, 12. August
   <br/>
   <b>
    Suppen
   </b>
   <br/>
   Tomatensuppe
  </p>
 </body>
</html>

>>> soup.select_one('p > b')
<b>Tagesmenü im Restaurant</b>
>>> print(soup.select_one('p > b').text)
Tagesmenü im Restaurant
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Copy xml content from webpage and save to locally without special characters Nik1811 14 596 Mar-26-2024, 09:28 AM
Last Post: Nik1811
  remove gilberishs from a "string" kucingkembar 2 202 Mar-15-2024, 08:51 AM
Last Post: kucingkembar
  sort search results by similarity of characters jacksfrustration 5 364 Feb-16-2024, 11:59 PM
Last Post: deanhystad
  non-latin characters in console from clipboard Johanson 3 655 Oct-26-2023, 10:10 PM
Last Post: deanhystad
Question Special Characters read-write Prisonfeed 1 580 Sep-17-2023, 08:26 PM
Last Post: Gribouillis
  doing string split with 2 or more split characters Skaperen 22 2,317 Aug-13-2023, 01:57 AM
Last Post: Skaperen
  How do I check if the first X characters of a string are numbers? FirstBornAlbratross 6 1,427 Apr-12-2023, 10:39 AM
Last Post: jefsummers
  use of escape character in re.sub and find WJSwan 1 876 Feb-16-2023, 05:19 PM
Last Post: Larz60+
  How to remove patterns of characters from text aaander 4 1,068 Nov-19-2022, 03:34 PM
Last Post: snippsat
Smile please help me remove error for string.strip() jamie_01 3 1,149 Oct-14-2022, 07:48 AM
Last Post: Pedroski55

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020