Python Forum
Trying to scrape data from HTML with no identifiers
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Trying to scrape data from HTML with no identifiers
#1
I am using selenium and beautifulsoup and trying to scrape data from an html structure like this

Output:
<h2>Education</h2> Entry1 <br> Entry2 <h2>Employment
I cannot figure out how to scrape everything under the Education section. The HTML is causing problems for me. I've tried a ton of different things, but nothing seems to get the data consistently. Does anyone have any idea how I can get around the HTML.
Reply
#2
The HTML do not show content of tag over,which have to parse to get Entry1 and Entry2.
Iterates over the contents of the div element and checks if each content is a NavigableString which mean text nodes,
so we don't get content of h2 tag.
from bs4 import BeautifulSoup, NavigableString

html = '''\
<html>
<head>
  <title>Page Title</title>
</head>
<body>
  <div class="something">
    <h2>Education</h2>
    Entry1
    <br>
    Entry2
    <h2>Employment
  </div>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='something')
entries = []
for content in div:
    if isinstance(content, NavigableString) and content.strip():
        entries.append(content.strip())

print(entries)
Output:
['Entry1', 'Entry2']
The HTML is not so good written so it harder to parse,would be better like eg this.
<div class="something">
    <h2>Education</h2>
    <ul>
      <li>Entry1</li>
      <li>Entry2</li>
    </ul>
    <h2>Employment</h2
</div>
pythonpaul32 likes this post
Reply
#3
Thank you. Yes, the HTML is not good. It is a pain, so I had to come up with other solutions.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  I am trying to scrape data to broadcast it on Telegram BarryBoos 1 2,133 Jun-10-2023, 02:36 PM
Last Post: snippsat
  How can I target and scrape a data-stat never5000 5 2,822 Feb-11-2022, 07:59 PM
Last Post: snippsat
Lightbulb Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB BrandonKastning 5 2,920 Dec-29-2021, 02:26 AM
Last Post: BrandonKastning
  Python Obstacles | Karate | HTML/Scrape Specific Tag and Store it in MariaDB BrandonKastning 8 3,171 Nov-22-2021, 01:38 AM
Last Post: BrandonKastning
  Post HTML Form Data to API Endpoints Dexty 0 1,413 Nov-11-2021, 10:51 PM
Last Post: Dexty
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,650 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Cleaning HTML data using Jupyter Notebook jacob1986 7 4,152 Mar-05-2021, 10:44 PM
Last Post: snippsat
  Scrape for html based on url string and output into csv dana 13 5,477 Jan-13-2021, 03:52 PM
Last Post: snippsat
  Is it possible to scrape this data from Google Searches rosjo 1 2,202 Nov-06-2020, 06:51 PM
Last Post: Larz60+
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,478 Nov-02-2020, 08:12 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020