Mar-09-2021, 11:55 PM
Hi all,
Newbie here! I am using Python 3.8.3 and am trying to remove tags from the attached text file https://drive.google.com/file/d/1V3s8w8a...sp=sharing
I want to extract 3 lists - titles, publication dates, and main text of the articles and remove the tags. In the code below, I have been able to remove the tags from the titles and publication dates. However, I am not able to properly remove all tags from the main texts. In the text file, the main text starts with the tag <div class="story-element story-element-text"> and ends before the next <h1 class tag.
Any help in extracting this part of the text would be highly appreciated!! The article text is in a non-English script, but all the html tags are in English.
Newbie here! I am using Python 3.8.3 and am trying to remove tags from the attached text file https://drive.google.com/file/d/1V3s8w8a...sp=sharing
I want to extract 3 lists - titles, publication dates, and main text of the articles and remove the tags. In the code below, I have been able to remove the tags from the titles and publication dates. However, I am not able to properly remove all tags from the main texts. In the text file, the main text starts with the tag <div class="story-element story-element-text"> and ends before the next <h1 class tag.
Any help in extracting this part of the text would be highly appreciated!! The article text is in a non-English script, but all the html tags are in English.
#opening text file which contains newspaper article information scraped off website using beautifulsoup import re with open('listfile.txt', 'r', encoding='utf8') as my_file: text = my_file.read() print(text). #not printing output here because it's too large
#removing tags and generating list of newspaper article titles titles = re.findall('<h1.*?>(.*?)</h1>', text) print(titles)
Output:['ধর্ষণ প্রতিরোধে প্রয়োজন আইনের প্রয়োগ: সালমা আলী', 'ধর্ষণ ও যৌন হয়রানির দুটি আলোচিত ঘটনা', 'ধর্ষণ ও পুরুষের সম্মানহানি', '‘স্বীকার করছেন, ধর্ষণ-নিপীড়নে আপনাদের মদদ রয়েছে’', 'ব্যাধি যখন ধর্ষণ !', 'ধর্ষণ অপরাধ, প্রতিবাদ এবং আইন', 'করোনায় থামেনি ধর্ষণ, স্বামীর পীড়ন', 'তাবিজ দেওয়ার কথা বলে গৃহবধূকে ধর্ষণ, কবিরাজ আটক']
#removing tags and generating list of newspaper article publication dates dates = re.findall('<div class=\"storyPageMetaData-m__publish-time__19bdV\"><span>(.*?)</span>', text) print(dates)
Output:['আপডেট: ১৬ জানুয়ারি ২০২০, ১১: ১২ ', 'আপডেট: ১২ জানুয়ারি ২০২০, ১২: ৫১ ', 'আপডেট: ২৮ জুন ২০২০, ১৪: ৩৭ ']
#removing tags and generating list containing main text of articles #this is where I am unable to remove the tags bodytext= re.findall('<div class=\"story-element story-element-text\">(.*?)</div>', text) print(bodytext)