Python Forum
How to remove multiple tags using regex
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to remove multiple tags using regex
#1
Hi all,

Newbie here! I am using Python 3.8.3 and am trying to remove tags from the attached text file https://drive.google.com/file/d/1V3s8w8a...sp=sharing

I want to extract 3 lists - titles, publication dates, and main text of the articles and remove the tags. In the code below, I have been able to remove the tags from the titles and publication dates. However, I am not able to properly remove all tags from the main texts. In the text file, the main text starts with the tag <div class="story-element story-element-text"> and ends before the next <h1 class tag.

Any help in extracting this part of the text would be highly appreciated!! The article text is in a non-English script, but all the html tags are in English.


#opening text file which contains newspaper article information scraped off website using beautifulsoup
import re
with open('listfile.txt', 'r', encoding='utf8') as my_file:
    text = my_file.read()
    print(text)
. #not printing output here because it's too large


#removing tags and generating list of newspaper article titles    
titles = re.findall('<h1.*?>(.*?)</h1>', text)
print(titles)
Output:
['ধর্ষণ প্রতিরোধে প্রয়োজন আইনের প্রয়োগ: সালমা আলী', 'ধর্ষণ ও যৌন হয়রানির দুটি আলোচিত ঘটনা', 'ধর্ষণ ও পুরুষের সম্মানহানি', '‘স্বীকার করছেন, ধর্ষণ-নিপীড়নে আপনাদের মদদ রয়েছে’', 'ব্যাধি যখন ধর্ষণ !', 'ধর্ষণ অপরাধ, প্রতিবাদ এবং আইন', 'করোনায় থামেনি ধর্ষণ, স্বামীর পীড়ন', 'তাবিজ দেওয়ার কথা বলে গৃহবধূকে ধর্ষণ, কবিরাজ আটক']
#removing tags and generating list of newspaper article publication dates 
dates = re.findall('<div class=\"storyPageMetaData-m__publish-time__19bdV\"><span>(.*?)</span>', text)
print(dates)
Output:
['আপডেট: ১৬ জানুয়ারি ২০২০, ১১: ১২ ', 'আপডেট: ১২ জানুয়ারি ২০২০, ১২: ৫১ ', 'আপডেট: ২৮ জুন ২০২০, ১৪: ৩৭ ']
#removing tags and generating list containing main text of articles
#this is where I am unable to remove the tags
bodytext= re.findall('<div class=\"story-element story-element-text\">(.*?)</div>', text)
print(bodytext)
Reply
#2
why have you created three threads for the same topic?
Please use the same thread if you need modifications.
Reply
#3
(Mar-10-2021, 02:06 AM)Larz60+ Wrote: why have you created three threads for the same topic?
Please use the same thread if you need modifications.

Sorry, I had a connection issue when posting and ending up duplicating the messages. It seems that I cannot delete any of the threads now.
Reply
#4
Don't use regex to parse html. Use parser like Beautiful Soup.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020