Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parse data from downloaded html
#1
I want to extract from a directory (where i have stored my downloaded htmls), all the "executives". In this directory their are app 1.000 htmls stored wich should have the div id= article_participants element (if not, the files can be ignored) :

Quote:<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>

My output would need to be Name, Function, Period, Symbol:
Example: Ori Shilo | Deputy CEO,Finance and operations | q4 2014 | RDHL
I tried the following, but it's not sufficient:
import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
Can someone help me?Q4
Reply
#2
soup.find('div': {'id','article_participants'})
Sorry on a phone and was a pain to type that
Recommended Tutorials:
Reply
#3
cross-posted on SO
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
sorry, i hope we can finish this discussion on SO.
Reply
#5
I dont use SO, so i will never see anything other than what is posted here.

Sorry my syntax was wrong:
All you have to do is accommodate it for searching files

from bs4 import BeautifulSoup

html = '''<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>
'''

soup = BeautifulSoup(html, 'html.parser')
found = soup.find('div', {'id':'article_participants'})
if found:
    print('found')
else:
    print('not found')
Recommended Tutorials:
Reply
#6
hmmm but how would this help me with multiple htmls? The above mentioned html is only an example. Although the structure of the HTML is the same.
Reply
#7
On mobile now so I cant write a full example but you have already done that in your first code snippet. You just left soup with no search or results if it was found or not.

Loop the directory for each html file, then search each file for that soup I gave. If that returns data then it was found and log that file as it contains that ID. Then proceed in that for loop to the next file and repeat.
Recommended Tutorials:
Reply
#8
But my output is now only "found" or "not found". I want the data, mentioned in the first post.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 795 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
  Deployed Spider on Heroku: How do I email downloaded files? JaneTan 2 1,523 Mar-24-2022, 08:31 AM
Last Post: JaneTan
  Post HTML Form Data to API Endpoints Dexty 0 1,382 Nov-11-2021, 10:51 PM
Last Post: Dexty
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,532 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Cleaning HTML data using Jupyter Notebook jacob1986 7 4,052 Mar-05-2021, 10:44 PM
Last Post: snippsat
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,412 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  html data cell attribute issue delahug 5 3,085 May-31-2020, 09:18 AM
Last Post: delahug
  Extracting html data using attributes WiPi 14 5,335 May-04-2020, 02:04 PM
Last Post: snippsat
  extrat data from a button html windows11 1 1,952 Mar-24-2020, 03:39 PM
Last Post: Larz60+
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,329 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020