Python Forum

Full Version: Trying to scrape data from HTML with no identifiers
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am using selenium and beautifulsoup and trying to scrape data from an html structure like this

Output:
<h2>Education</h2> Entry1 <br> Entry2 <h2>Employment
I cannot figure out how to scrape everything under the Education section. The HTML is causing problems for me. I've tried a ton of different things, but nothing seems to get the data consistently. Does anyone have any idea how I can get around the HTML.
The HTML do not show content of tag over,which have to parse to get Entry1 and Entry2.
Iterates over the contents of the div element and checks if each content is a NavigableString which mean text nodes,
so we don't get content of h2 tag.
from bs4 import BeautifulSoup, NavigableString

html = '''\
<html>
<head>
  <title>Page Title</title>
</head>
<body>
  <div class="something">
    <h2>Education</h2>
    Entry1
    <br>
    Entry2
    <h2>Employment
  </div>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='something')
entries = []
for content in div:
    if isinstance(content, NavigableString) and content.strip():
        entries.append(content.strip())

print(entries)
Output:
['Entry1', 'Entry2']
The HTML is not so good written so it harder to parse,would be better like eg this.
<div class="something">
    <h2>Education</h2>
    <ul>
      <li>Entry1</li>
      <li>Entry2</li>
    </ul>
    <h2>Employment</h2
</div>
Thank you. Yes, the HTML is not good. It is a pain, so I had to come up with other solutions.