Python Forum
Data Cleanup Help Needed
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Data Cleanup Help Needed
#1
Hello,

I am pulling CRM entity in our datalake using APIs. Most columns from this particular entity are bringing in the correct information but the column 'description' is actually brining in the complete css/html code. I need to clean up this column to show only the text.

I used
dfObj = pd.DataFrame([sub.split("|") for sub in activity])
to create the pipe delimited file and then pulled in this specific field using
dfObj.iloc[0,15]

Here is the example of what is coming in the description field:


'<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\r\n<meta name="Generator" content="Microsoft Word 15 (filtered medium)">\r\n<!--[if !mso]><style>v\\:* {behavior:url(#default#VML);}\r\no\\:* {behavior:url(#default#VML);}\r\nw\\:* {behavior:url(#default#VML);}\r\n.shape {behavior:url(#default#VML);}\r\n</style><![endif]--><style><!--\r\n/* Font Definitions */\r\n@font-face\r\n\t{font-family:"Cambria Math";\r\n\tpanose-1:2 4 5 3 5 4 6 3 2 4;}\r\n@font-face\r\n\t{font-family:Calibri;\r\n\tpanose-1:2 15 5 2 2 2 4 3 2 4;}\r\n/* Style Definitions */\r\np.MsoNormal, li.MsoNormal, div.MsoNormal\r\n\t{margin:0in;\r\n\tmargin-bottom:.0001pt;\r\n\tfont-size:11.0pt;\r\n\tfont-family:"Calibri",sans-serif;}\r\na:link, span.MsoHyperlink\r\n\t{mso-style-priority:99;\r\n\tcolor:#0563C1;\r\n\ttext-decoration:underline;}\r\na:visited, span.MsoHyperlinkFollowed\r\n\t{mso-style-priority:99;\r\n\tcolor:#954F72;\r\n\ttext-decoration:underline;}\r\nspan.EmailStyle17\r\n\t{mso-style-type:personal-compose;\r\n\tfont-family:"Calibri",sans-serif;\r\n\tcolor:windowtext;}\r\n.MsoChpDefault\r\n\t{mso-style-type:export-only;\r\n\tfont-family:"Calibri",sans-serif;}\r\n@page WordSection1\r\n\t{size:8.5in 11.0in;\r\n\tmargin:1.0in 1.0in 1.0in 1.0in;}\r\ndiv.WordSection1\r\n\t{page:WordSection1;}\r\n--></style><!--[if gte mso 9]><xml>\r\n<o:shapedefaults v:ext="edit" spidmax="1026" />\r\n</xml><![endif]--><!--[if gte mso 9]><xml>\r\n<o:shapelayout v:ext="edit">\r\n<o:idmap v:ext="edit" data="1" />\r\n</o:shapelayout></xml><![endif]-->\r\n</head>\r\n<body lang="EN-US" link="#0563C1" vlink="#954F72">\r\n<div class="WordSection1">\r\n<p class="MsoNormal">Garreth,<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal">Attached are the W9 and COI for MedEd. Looking forward to our call this afternoon. Let me know if you need any additional documents from MedEd.<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal">Thanks,<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><b><span style="font-family:&quot;Arial&quot;,sans-serif;color:black">Jason Smith</span></b><span style="font-family:&quot;Arial&quot;,sans-serif;color:black"><br>\r\nDirector, Marketing<o:p></o:p></span></p>\r\n<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><br>\r\n<b><img width="256" height="57" style="width:2.6666in;height:.5937in" id="Picture_x0020_1" src="imagepath?AttachmentType=1001&AttachmentId=8da1f035-eb7e-e811-8146-e0071b6a7141" alt="MedEd"></b><br>\r\n<br>\r\n2827 test address<br>\r\ncity, state, zip<br>\r\nOffice: 317-975-7394<br>\r\n<a href="https://www.medxcelfm.com/"><span style="color:#5877A4">www.medxcelfm.com</span></a><o:p></o:p></span></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n</div>\r\n</body>\r\n</html>\r\n'

I have tried the following to clean up this data:
import re

descriptioncleanup = 
re.search("<[^<]+?>",dfObj.iloc[0,15])
re.sub("<[^<]+?>", "",dfObj.iloc[0,15])
result = re.sub("<[^<]+?>", "",dfObj.iloc[0,15])
result2 = re.sub("\n","",result)
result3 = re.sub("\r","",result2)
result4 = re.sub(";","",result3)
result5 = re.sub("&nbsp","",result4)
result9 = re.sub("behaior:url","",result5)
result10 = re.sub("behairurl","",result9)
result11 = re.sub("#default#VML","",result10)
result12 = re.sub(".shape","", result11)

print(result12)
This cleans up most of things but does not clean all. I would like to show only the bolded text in the column. I am very new to the Python programming and was wondering if there is a better way to clean this data. Any help is appreciated.

Thanks,
Sonali
Reply
#2
I would use BeautifulSoup or lxml for this, e.g.

import bs4

data = """<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\r\n<meta name="Generator" content="Microsoft Word 15 (filtered medium)">\r\n<!--[if !mso]><style>v\\:* {behavior:url(#default#VML);}\r\no\\:* {behavior:url(#default#VML);}\r\nw\\:* {behavior:url(#default#VML);}\r\n.shape {behavior:url(#default#VML);}\r\n</style><![endif]--><style><!--\r\n/* Font Definitions */\r\n@font-face\r\n\t{font-family:"Cambria Math";\r\n\tpanose-1:2 4 5 3 5 4 6 3 2 4;}\r\n@font-face\r\n\t{font-family:Calibri;\r\n\tpanose-1:2 15 5 2 2 2 4 3 2 4;}\r\n/* Style Definitions */\r\np.MsoNormal, li.MsoNormal, div.MsoNormal\r\n\t{margin:0in;\r\n\tmargin-bottom:.0001pt;\r\n\tfont-size:11.0pt;\r\n\tfont-family:"Calibri",sans-serif;}\r\na:link, span.MsoHyperlink\r\n\t{mso-style-priority:99;\r\n\tcolor:#0563C1;\r\n\ttext-decoration:underline;}\r\na:visited, span.MsoHyperlinkFollowed\r\n\t{mso-style-priority:99;\r\n\tcolor:#954F72;\r\n\ttext-decoration:underline;}\r\nspan.EmailStyle17\r\n\t{mso-style-type:personal-compose;\r\n\tfont-family:"Calibri",sans-serif;\r\n\tcolor:windowtext;}\r\n.MsoChpDefault\r\n\t{mso-style-type:export-only;\r\n\tfont-family:"Calibri",sans-serif;}\r\n@page WordSection1\r\n\t{size:8.5in 11.0in;\r\n\tmargin:1.0in 1.0in 1.0in 1.0in;}\r\ndiv.WordSection1\r\n\t{page:WordSection1;}\r\n--></style><!--[if gte mso 9]><xml>\r\n<o:shapedefaults v:ext="edit" spidmax="1026" />\r\n</xml><![endif]--><!--[if gte mso 9]><xml>\r\n<o:shapelayout v:ext="edit">\r\n<o:idmap v:ext="edit" data="1" />\r\n</o:shapelayout></xml><![endif]-->\r\n</head>\r\n<body lang="EN-US" link="#0563C1" vlink="#954F72">\r\n<div class="WordSection1">\r\n<p class="MsoNormal">Garreth,<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal">Attached are the W9 and COI for MedEd. Looking forward to our call this afternoon. Let me know if you need any additional documents from MedEd.<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal">Thanks,<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><b><span style="font-family:&quot;Arial&quot;,sans-serif;color:black">Jason Smith</span></b><span style="font-family:&quot;Arial&quot;,sans-serif;color:black"><br>\r\nDirector, Marketing<o:p></o:p></span></p>\r\n<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><br>\r\n<b><img width="256" height="57" style="width:2.6666in;height:.5937in" id="Picture_x0020_1" src="imagepath?AttachmentType=1001&AttachmentId=8da1f035-eb7e-e811-8146-e0071b6a7141" alt="MedEd"></b><br>\r\n<br>\r\n2827 test address<br>\r\ncity, state, zip<br>\r\nOffice: 317-975-7394<br>\r\n<a href="https://www.medxcelfm.com/"><span style="color:#5877A4">www.medxcelfm.com</span></a><o:p></o:p></span></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n</div>\r\n</body>\r\n</html>\r\n"""

ss = bs4.BeautifulSoup(data)
print(ss.find('body').text)
Output:
Garreth, Attached are the W9 and COI for MedEd. Looking forward to our call this afternoon. Let me know if you need any additional documents from MedEd. Thanks, Jason Smith Director, Marketing 2827 test address city, state, zip Office: 317-975-7394 www.medxcelfm.com
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020