Python Forum

Full Version: Parse data from downloaded html
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I want to extract from a directory (where i have stored my downloaded htmls), all the "executives". In this directory their are app 1.000 htmls stored wich should have the div id= article_participants element (if not, the files can be ignored) :

Quote:<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>

My output would need to be Name, Function, Period, Symbol:
Example: Ori Shilo | Deputy CEO,Finance and operations | q4 2014 | RDHL
I tried the following, but it's not sufficient:
import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
Can someone help me?Q4
soup.find('div': {'id','article_participants'})
Sorry on a phone and was a pain to type that
sorry, i hope we can finish this discussion on SO.
I dont use SO, so i will never see anything other than what is posted here.

Sorry my syntax was wrong:
All you have to do is accommodate it for searching files

from bs4 import BeautifulSoup

html = '''<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>
'''

soup = BeautifulSoup(html, 'html.parser')
found = soup.find('div', {'id':'article_participants'})
if found:
    print('found')
else:
    print('not found')
hmmm but how would this help me with multiple htmls? The above mentioned html is only an example. Although the structure of the HTML is the same.
On mobile now so I cant write a full example but you have already done that in your first code snippet. You just left soup with no search or results if it was found or not.

Loop the directory for each html file, then search each file for that soup I gave. If that returns data then it was found and log that file as it contains that ID. Then proceed in that for loop to the next file and repeat.
But my output is now only "found" or "not found". I want the data, mentioned in the first post.