Parsing based on variables in the website - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Parsing based on variables in the website (/thread-23722.html) |
Parsing based on variables in the website - nikos48 - Jan-14-2020 Hi, I am a newby at Python, so bear with me. My code is already working for multiple websites with the same setup (example: https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0). My code is now based on one specific CEO, but i want this to work for all executives named in the top of every individual html. These executives are named in the HTML part as shown below https://i.stack.imgur.com/9zqOb.png Could someone help me further? Below the code till this far. import textwrap import os from bs4 import BeautifulSoup directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out' for filename in os.listdir(directory): if filename.endswith('.html'): fname = os.path.join(directory,filename) with open(fname, 'r') as f: soup = BeautifulSoup(f.read(),'html.parser') print('{:<30} {:<70}'.format('Name', 'Answer')) print('-' * 101) for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'): txt = answer.get_text(strip=True) s = answer.find_next_sibling() while s: if s.name == 'strong' or s.find('strong'): break if s.name == 'p': txt += ' ' + s.get_text(strip=True) s = s.find_next_sibling() txt = ('\n' + ' '*31).join(textwrap.wrap(txt)) print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a")) RE: Parsing based on variables in the website - micseydel - Jan-14-2020 I'm not sure about your overall problem, but I suspect there's a bug in your program on line 19 It's going to be true if "strong" is not in the string. It would only be false if s starts with "strong" because that would result in an index of 0, which would be treated as false.
RE: Parsing based on variables in the website - nikos48 - Jan-16-2020 Thank you! Is there also someone who can help me with my main problem? In principe i need all the answers of the executives which are mentioned in the html (executives are identified in the top of the html). RE: Parsing based on variables in the website - nikos48 - Jan-19-2020 Maybe i could clarify my question: In my (downloaded) HTMLs i have in the top of every file executives mentioned (like Dror Ben Asher" in the code below): Quote:<DIV id=article_participants class="content_part hid"> Further along the html these executives name reaccurs multiple times where after the name follows an text element i want to parse Example Quote:<P> For now i have a code (see above posting) which identifies one executive "Dror Ben Asher" and graps all the text which accurs after in the P element. But I would like this to work for all executives and for Multiple html files where different executives are mentioned (different company). In dropbox i shared the download html file: dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0 |