Trying to scrape data from HTML with no identifiers

pythonpaul32

I am using selenium and beautifulsoup and trying to scrape data from an html structure like this

Output:<h2>Education</h2>
Entry1
<br>
Entry2
<h2>Employment

I cannot figure out how to scrape everything under the Education section. The HTML is causing problems for me. I've tried a ton of different things, but nothing seems to get the data consistently. Does anyone have any idea how I can get around the HTML.

***snippsat*** · (This post was last modified: Nov-29-2023, 04:27 PM by snippsat.)

The HTML do not show content of tag over,which have to parse to get Entry1 and Entry2.
Iterates over the contents of the div element and checks if each content is a NavigableString which mean text nodes,
so we don't get content of h2 tag.

from bs4 import BeautifulSoup, NavigableString

html = '''\
<html>
<head>
  <title>Page Title</title>
</head>
<body>
  <div class="something">
    <h2>Education</h2>
    Entry1
    <br>
    Entry2
    <h2>Employment
  </div>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='something')
entries = []
for content in div:
    if isinstance(content, NavigableString) and content.strip():
        entries.append(content.strip())

print(entries)

Output:
['Entry1', 'Entry2']

The HTML is not so good written so it harder to parse,would be better like eg this.

<div class="something">
    <h2>Education</h2>
    <ul>
      <li>Entry1</li>
      <li>Entry2</li>
    </ul>
    <h2>Employment</h2
</div>

pythonpaul32 · Dec-02-2023, 03:42 AM

Thank you. Yes, the HTML is not good. It is a pain, so I had to come up with other solutions.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	I am trying to scrape data to broadcast it on Telegram	BarryBoos	1	2,133	Jun-10-2023, 02:36 PM Last Post: snippsat
	How can I target and scrape a data-stat	never5000	5	2,822	Feb-11-2022, 07:59 PM Last Post: snippsat
	Python Obstacles \| Kung-Fu \| Full File HTML Document Scrape and Store it in MariaDB	BrandonKastning	5	2,920	Dec-29-2021, 02:26 AM Last Post: BrandonKastning
	Python Obstacles \| Karate \| HTML/Scrape Specific Tag and Store it in MariaDB	BrandonKastning	8	3,171	Nov-22-2021, 01:38 AM Last Post: BrandonKastning
	Post HTML Form Data to API Endpoints	Dexty	0	1,413	Nov-11-2021, 10:51 PM Last Post: Dexty
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,650	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Cleaning HTML data using Jupyter Notebook	jacob1986	7	4,152	Mar-05-2021, 10:44 PM Last Post: snippsat
	Scrape for html based on url string and output into csv	dana	13	5,477	Jan-13-2021, 03:52 PM Last Post: snippsat
	Is it possible to scrape this data from Google Searches	rosjo	1	2,202	Nov-06-2020, 06:51 PM Last Post: Larz60+
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,478	Nov-02-2020, 08:12 PM Last Post: Larz60+

Trying to scrape data from HTML with no identifiers

User Panel Messages

Announcements