Python Forum
HTML Decoder pandas dataframe column
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
HTML Decoder pandas dataframe column
#1
I am getting html that I want to decode. If I do it with an example it works but not with my pandas dataframe. Any suggestions?

 #!/usr/bin/env python
# coding: utf-8

# import statements
import requests
import pandas as pd
import html

# constants
url = "https://chartexp1.sha.maryland.gov/CHARTExportClientService/getDMSMapDataJSON.do"

# getting response
response = requests.request("GET", url).json()

# converting to dataframe
df = pd.DataFrame(response['data'])

#adding new column/converting msgHTML Encoded to decoded
df['decodedHtml'] = html.unescape(df['msgHTML'])

# saving dataframe to csv
df.to_csv('output/response_python.csv')


##TESTING ONLY##
myHtml = "<body><h1> How to use html.unescape() in Python </h1></body>"
encodedHtml = html.escape(myHtml)
print("Encoded HTML: ", encodedHtml)
decodedHtml = html.unescape(encodedHtml)

print("Decoded HTML: ", decodedHtml)

print(html.unescape('&copy; 2023'))

 
Reply
#2
Hi,

the information provided is a bit thin... Is `response['data'] really HTML? I tried to make an API call with the URL from you post, but I receive a time-out error...

What do you get instead when exporting your dataframe to CSV?

Regards, noisefloor
Reply
#3
Thanks for the reply. I apologize for the lack of information

The issue is with the following line:

#adding new column/converting msgHTML Encoded to decoded
df['decodedHtml'] = html.unescape(df['msgHTML'])
The issue is df['msgHTML'] has content similiar to the following

&lt;table class='dmsMsg'&gt;&lt;tr class='dmsMsgRow'&gt;&lt;td class='dmsMsgTextCenter'&gt;I-695       15 MILES&lt;/td&gt;&lt;/tr&gt;&lt;tr class='dmsMsgRow'&gt;&lt;td class='dmsMsgTextCenter'&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr class='dmsMsgRow'&gt;&lt;td class='dmsMsgTextCenter'&gt; 14 MINUTES&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
What I am attempting to do is the convert that to the following format

<table class='dmsMsg'><tr class='dmsMsgRow'><td class='dmsMsgTextCenter'>I-695       15 MILES</td></tr><tr class='dmsMsgRow'><td class='dmsMsgTextCenter'>&nbsp;</td></tr><tr class='dmsMsgRow'><td class='dmsMsgTextCenter'> 14 MINUTES</td></tr></table>
Reply
#4
html.unsescape(str) cannot be used in a vectorized solution. Have to fall back to using DataFrame.apply(func)
df["msgHTML"] = df["msgHTML"].apply(html.unescape)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Add NER output to pandas dataframe dg3000 0 140 Apr-22-2024, 08:14 PM
Last Post: dg3000
  concat 3 columns of dataframe to one column flash77 2 854 Oct-03-2023, 09:29 PM
Last Post: flash77
  attempt to split values from within a dataframe column mbrown009 8 2,383 Apr-10-2023, 02:06 AM
Last Post: mbrown009
  Use pandas to obtain cartesian product between a dataframe of int and equations? haihal 0 1,135 Jan-06-2023, 10:53 PM
Last Post: haihal
  pandas column percentile nuncio 7 2,476 Aug-10-2022, 04:41 AM
Last Post: nuncio
  Pandas Dataframe Filtering based on rows mvdlm 0 1,451 Apr-02-2022, 06:39 PM
Last Post: mvdlm
  Pandas dataframe: calculate metrics by year mcva 1 2,332 Mar-02-2022, 08:22 AM
Last Post: mcva
  Pandas dataframe comparing anto5 0 1,274 Jan-30-2022, 10:21 AM
Last Post: anto5
  PANDAS: DataFrame | Replace and others questions moduki1 2 1,814 Jan-10-2022, 07:19 PM
Last Post: moduki1
  PANDAS: DataFrame | Saving the wrong value moduki1 0 1,561 Jan-10-2022, 04:42 PM
Last Post: moduki1

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020