Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parse data from downloaded html
#1
I want to extract from a directory (where i have stored my downloaded htmls), all the "executives". In this directory their are app 1.000 htmls stored wich should have the div id= article_participants element (if not, the files can be ignored) :

Quote:<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>

My output would need to be Name, Function, Period, Symbol:
Example: Ori Shilo | Deputy CEO,Finance and operations | q4 2014 | RDHL
I tried the following, but it's not sufficient:
import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
Can someone help me?Q4
Reply
#2
soup.find('div': {'id','article_participants'})
Sorry on a phone and was a pain to type that
Reply
#3
cross-posted on SO
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
sorry, i hope we can finish this discussion on SO.
Reply
#5
I dont use SO, so i will never see anything other than what is posted here.

Sorry my syntax was wrong:
All you have to do is accommodate it for searching files

from bs4 import BeautifulSoup

html = '''<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>
'''

soup = BeautifulSoup(html, 'html.parser')
found = soup.find('div', {'id':'article_participants'})
if found:
    print('found')
else:
    print('not found')
Reply
#6
hmmm but how would this help me with multiple htmls? The above mentioned html is only an example. Although the structure of the HTML is the same.
Reply
#7
On mobile now so I cant write a full example but you have already done that in your first code snippet. You just left soup with no search or results if it was found or not.

Loop the directory for each html file, then search each file for that soup I gave. If that returns data then it was found and log that file as it contains that ID. Then proceed in that for loop to the next file and repeat.
Reply
#8
But my output is now only "found" or "not found". I want the data, mentioned in the first post.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Post HTML Form Data to API Endpoints Dexty 0 182 Nov-11-2021, 10:51 PM
Last Post: Dexty
  HTML multi select HTML listbox with Flask/Python rfeyer 0 1,557 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Cleaning HTML data using Jupyter Notebook jacob1986 7 1,487 Mar-05-2021, 10:44 PM
Last Post: snippsat
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 1,132 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  html data cell attribute issue delahug 5 1,239 May-31-2020, 09:18 AM
Last Post: delahug
  Extracting html data using attributes WiPi 14 2,293 May-04-2020, 02:04 PM
Last Post: snippsat
  extrat data from a button html windows11 1 988 Mar-24-2020, 03:39 PM
Last Post: Larz60+
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 1,090 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  How to POST html data to be handled by a route endpoint nikos 1 1,166 Mar-07-2020, 03:14 PM
Last Post: nikos
  Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR IanTheLMT 2 2,241 Jul-04-2019, 02:31 AM
Last Post: IanTheLMT

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020