Jan-02-2022, 11:14 PM
Thanks, that is much better than my attempt, gleaned from the internet!
It was a bit harder to extract the defintions, but once I had the terms, I got them by getting the text that was not in the list of business finance terms.
I used random to shuffle the list of business finance terms, but I kept the definitions in the order they are on the webpage. Didn't want to make it too difficult!
It was a bit harder to extract the defintions, but once I had the terms, I got them by getting the text that was not in the list of business finance terms.
I used random to shuffle the list of business finance terms, but I kept the definitions in the order they are on the webpage. Didn't want to make it too difficult!
import requests from bs4 import BeautifulSoup import random url = 'https://www.fundera.com/blog/business-finance-terms-and-definitions' res = requests.get(url) html_page = res.content soup = BeautifulSoup(html_page, 'html.parser') text = soup.find_all(text=True) output = '' blacklist = [ '[document]', 'noscript', 'header', 'html', 'meta', 'head', 'input', 'script', 'style' # there may be more elements you don't want, such as "style", etc. ] for t in text: if t.parent.name not in blacklist: output += '{} '.format(t) text_list = output.split('\n') useful_text = '\n'.join(text_list) savepath = '/home/pedro/temp/' with open(savepath + 'biz_definitions.txt', 'w') as f: f.write(useful_text) print('All done! Text saved to', savepath + 'biz_definitions.txt' )