Python Forum

Full Version: How to remove multiple tags using regex
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

Newbie here! I am using Python 3.8.3 and am trying to remove tags from the attached text file https://drive.google.com/file/d/1V3s8w8a...sp=sharing

I want to extract 3 lists - titles, publication dates, and main text of the articles and remove the tags. In the code below, I have been able to remove the tags from the titles and publication dates. However, I am not able to properly remove all tags from the main texts. In the text file, the main text starts with the tag <div class="story-element story-element-text"> and ends before the next <h1 class tag.

Any help in extracting this part of the text would be highly appreciated!! The article text is in a non-English script, but all the html tags are in English.


#opening text file which contains newspaper article information scraped off website using beautifulsoup
import re
with open('listfile.txt', 'r', encoding='utf8') as my_file:
    text = my_file.read()
    print(text)
. #not printing output here because it's too large


#removing tags and generating list of newspaper article titles    
titles = re.findall('<h1.*?>(.*?)</h1>', text)
print(titles)
Output:
['ধর্ষণ প্রতিরোধে প্রয়োজন আইনের প্রয়োগ: সালমা আলী', 'ধর্ষণ ও যৌন হয়রানির দুটি আলোচিত ঘটনা', 'ধর্ষণ ও পুরুষের সম্মানহানি', '‘স্বীকার করছেন, ধর্ষণ-নিপীড়নে আপনাদের মদদ রয়েছে’', 'ব্যাধি যখন ধর্ষণ !', 'ধর্ষণ অপরাধ, প্রতিবাদ এবং আইন', 'করোনায় থামেনি ধর্ষণ, স্বামীর পীড়ন', 'তাবিজ দেওয়ার কথা বলে গৃহবধূকে ধর্ষণ, কবিরাজ আটক']
#removing tags and generating list of newspaper article publication dates 
dates = re.findall('<div class=\"storyPageMetaData-m__publish-time__19bdV\"><span>(.*?)</span>', text)
print(dates)
Output:
['আপডেট: ১৬ জানুয়ারি ২০২০, ১১: ১২ ', 'আপডেট: ১২ জানুয়ারি ২০২০, ১২: ৫১ ', 'আপডেট: ২৮ জুন ২০২০, ১৪: ৩৭ ']
#removing tags and generating list containing main text of articles
#this is where I am unable to remove the tags
bodytext= re.findall('<div class=\"story-element story-element-text\">(.*?)</div>', text)
print(bodytext)
why have you created three threads for the same topic?
Please use the same thread if you need modifications.
(Mar-10-2021, 02:06 AM)Larz60+ Wrote: [ -> ]why have you created three threads for the same topic?
Please use the same thread if you need modifications.

Sorry, I had a connection issue when posting and ending up duplicating the messages. It seems that I cannot delete any of the threads now.