Parse data from downloaded html

nikos48 · Jan-22-2020, 06:58 PM

I want to extract from a directory (where i have stored my downloaded htmls), all the "executives". In this directory their are app 1.000 htmls stored wich should have the div id= article_participants element (if not, the files can be ignored) :

Quote:<DIV id=article_participants class="content_part hid">
Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)
Q4 2014 Earnings Conference Call
February 26, 2015 9:00 AM ET
Executives
Dror Ben Asher - CEO
Ori Shilo - Deputy CEO, Finance and Operations
Guy Goldberg - Chief Business Officer
Analysts

My output would need to be Name, Function, Period, Symbol:
Example: Ori Shilo | Deputy CEO,Finance and operations | q4 2014 | RDHL
I tried the following, but it's not sufficient:

import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)

Can someone help me?Q4

***metulburr*** · (This post was last modified: Jan-23-2020, 02:55 PM by buran.)

soup.find('div': {'id','article_participants'})

Sorry on a phone and was a pain to type that

**buran** · Jan-23-2020, 02:57 PM

cross-posted on SO

nikos48 · Jan-23-2020, 06:36 PM

sorry, i hope we can finish this discussion on SO.

***metulburr*** · (This post was last modified: Jan-25-2020, 01:37 AM by metulburr.)

I dont use SO, so i will never see anything other than what is posted here.

Sorry my syntax was wrong:
All you have to do is accommodate it for searching files

from bs4 import BeautifulSoup

html = '''<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>
'''

soup = BeautifulSoup(html, 'html.parser')
found = soup.find('div', {'id':'article_participants'})
if found:
    print('found')
else:
    print('not found')

nikos48 · Jan-25-2020, 11:24 AM

hmmm but how would this help me with multiple htmls? The above mentioned html is only an example. Although the structure of the HTML is the same.

***metulburr*** · Jan-25-2020, 07:36 PM

On mobile now so I cant write a full example but you have already done that in your first code snippet. You just left soup with no search or results if it was found or not.

Loop the directory for each html file, then search each file for that soup I gave. If that returns data then it was found and log that file as it contains that ID. Then proceed in that for loop to the next file and repeat.

nikos48 · Jan-26-2020, 03:35 PM

But my output is now only "found" or "not found". I want the data, mentioned in the first post.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Trying to scrape data from HTML with no identifiers	pythonpaul32	2	2,257	Dec-02-2023, 03:42 AM Last Post: pythonpaul32
	Deployed Spider on Heroku: How do I email downloaded files?	JaneTan	2	2,412	Mar-24-2022, 08:31 AM Last Post: JaneTan
	Post HTML Form Data to API Endpoints	Dexty	0	1,911	Nov-11-2021, 10:51 PM Last Post: Dexty
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	6,127	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Cleaning HTML data using Jupyter Notebook	jacob1986	7	6,209	Mar-05-2021, 10:44 PM Last Post: snippsat
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	4,579	Nov-02-2020, 08:12 PM Last Post: Larz60+
	html data cell attribute issue	delahug	5	4,260	May-31-2020, 09:18 AM Last Post: delahug
	Extracting html data using attributes	WiPi	14	9,601	May-04-2020, 02:04 PM Last Post: snippsat
	extrat data from a button html	windows11	1	2,682	Mar-24-2020, 03:39 PM Last Post: Larz60+
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	3,130	Mar-22-2020, 06:10 AM Last Post: BrandonKastning

Parse data from downloaded html

User Panel Messages

Announcements