Posts: 68
Threads: 21
Joined: May 2021
Hi all,
I thought I would give a go in trying to extract all the H2 Headers and H3 Headers from a webpage and used the following code:
import requests
from bs4 import BeautifulSoup
import pandas
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, features="html.parser")
body_find = soup.find('body')
for heading in body_find:
h2= heading.find('h2')
print(h2) When running this simple code, I get:
Output: -1
None
-1
None
-1
None
-1
None
-1
-1
-1
None
-1
-1
-1
<h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-10"> <span class="mntl-sc-block-heading__text"> Best Man Speech Template </span> </h2>
-1
None
-1
None
-1
None
-1
None
-1
I've not seen this before in my short webscraping practice and wasn't sure what I was doing wrong, nor what the -1 represented. I haven't yet tried to extract any H3's yet- I presume I'd have a similar issue.
Ideally, I'd like to be able to extract the text from the H2 and H3 into a word document, so it looked something like:
H2 (Text)
h3 (text)
H3 (text)
H2 (text)
h3 (text)
But at the moment, stuck on just getting the H2 and H3 text's returned. I presume once I could get that I'd have to do an extraction into Excel and then somehow create a template for Microsoft Word?
Appreciate any advice please as I've been stuck on this for a little while now. Thanking you
Posts: 7,313
Threads: 123
Joined: Sep 2016
Here some changes,same with h3 .
import requests
from bs4 import BeautifulSoup
import pandas
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
response = requests.get(url)
soup = BeautifulSoup(response.content, features="html.parser")
for tag in soup.find_all('h2'):
print(tag.text.strip()) Output: Best Man Speech Template
Best Man Speech Tips
A Best Man Speech Example to Make Your Own
Best Man Speech Openers
Related Stories
knight2000 likes this post
Posts: 1,027
Threads: 16
Joined: Dec 2016
Mar-02-2023, 10:08 AM
(This post was last modified: Mar-02-2023, 10:08 AM by Axel_Erfurt.)
maybe this
import requests
from bs4 import BeautifulSoup
import pandas
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
r = requests.get(url)
soup = BeautifulSoup(r.text, features="lxml")
body_find = soup.find('body')
print("h2:")
for heading in body_find.find_all('h2'):
print(heading.text)
print("\nh3:")
for heading in body_find.find_all('h3'):
print(heading.text) Output: h2:
Best Man Speech Template
Best Man Speech Tips
A Best Man Speech Example to Make Your Own
Best Man Speech Openers
Related Stories
h3:
Introduce Yourself With a Twist
Crack a Joke, Even a Corny One
Be Hilarious With a Straight Face
Introduce a Recurring Theme
Ask a Question to Answer Throughout
Rhyme-Master Flex
Read a Definition From a Dictionary
Tell a Story of How You Met
Begin With a Quote
Read Something in a Different Language
A Guide to Wedding Reception Toasts
knight2000 likes this post
Posts: 68
Threads: 21
Joined: May 2021
(Mar-02-2023, 10:00 AM)snippsat Wrote: Here some changes,same with h3 .
import requests
from bs4 import BeautifulSoup
import pandas
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
response = requests.get(url)
soup = BeautifulSoup(response.content, features="html.parser")
for tag in soup.find_all('h2'):
print(tag.text.strip()) Output: Best Man Speech Template
Best Man Speech Tips
A Best Man Speech Example to Make Your Own
Best Man Speech Openers
Related Stories
Awesome. Thank you Snippsat. I forgot to try the find_all within the loop. D'oh!
Posts: 68
Threads: 21
Joined: May 2021
(Mar-02-2023, 10:08 AM)Axel_Erfurt Wrote: maybe this
import requests
from bs4 import BeautifulSoup
import pandas
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
r = requests.get(url)
soup = BeautifulSoup(r.text, features="lxml")
body_find = soup.find('body')
print("h2:")
for heading in body_find.find_all('h2'):
print(heading.text)
print("\nh3:")
for heading in body_find.find_all('h3'):
print(heading.text) Output: h2:
Best Man Speech Template
Best Man Speech Tips
A Best Man Speech Example to Make Your Own
Best Man Speech Openers
Related Stories
h3:
Introduce Yourself With a Twist
Crack a Joke, Even a Corny One
Be Hilarious With a Straight Face
Introduce a Recurring Theme
Ask a Question to Answer Throughout
Rhyme-Master Flex
Read a Definition From a Dictionary
Tell a Story of How You Met
Begin With a Quote
Read Something in a Different Language
A Guide to Wedding Reception Toasts
Thank you Axel- that output looks nice and neat. I might run with this.
Posts: 68
Threads: 21
Joined: May 2021
(Mar-03-2023, 09:57 AM)knight2000 Wrote: (Mar-02-2023, 10:08 AM)Axel_Erfurt Wrote: maybe this
import requests
from bs4 import BeautifulSoup
import pandas
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
r = requests.get(url)
soup = BeautifulSoup(r.text, features="lxml")
body_find = soup.find('body')
print("h2:")
for heading in body_find.find_all('h2'):
print(heading.text)
print("\nh3:")
for heading in body_find.find_all('h3'):
print(heading.text) Output: h2:
Best Man Speech Template
Best Man Speech Tips
A Best Man Speech Example to Make Your Own
Best Man Speech Openers
Related Stories
h3:
Introduce Yourself With a Twist
Crack a Joke, Even a Corny One
Be Hilarious With a Straight Face
Introduce a Recurring Theme
Ask a Question to Answer Throughout
Rhyme-Master Flex
Read a Definition From a Dictionary
Tell a Story of How You Met
Begin With a Quote
Read Something in a Different Language
A Guide to Wedding Reception Toasts
Thank you Axel- that output looks nice and neat. I might run with this.
Hi Axel,
Just wondering, with the above code, is there anyway of identifying which H2 the H3 belongs to?
For example, let's assume under "Best Man Template", the following H3's reside:
Introduce yourself with a twist
Crack a joke, even a corny one
etc etc
I know I've only used "body" and not something more specific to drill further into- I've done that because I thought that if I wanted to scrape another wedding speech site for example, I'd easily be able to apply the same code to get the next sites H2 and H3 headers.
Is that possible with generic code like this please?
Posts: 68
Threads: 21
Joined: May 2021
I'm pretty close to what I wanted with:
import requests
from bs4 import BeautifulSoup
import pandas
base_url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, features="html.parser")
for headings in soup.find_all(['h2', 'h3']):
print(headings, headings.text) A sample of that output is:
Output: <h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-10"> <span class="mntl-sc-block-heading__text"> Best Man Speech Template </span> </h2> Best Man Speech Template
<h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-22"> <span class="mntl-sc-block-heading__text"> Best Man Speech Tips </span> </h2> Best Man Speech Tips
<h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-46"> <span class="mntl-sc-block-heading__text"> A Best Man Speech Example to Make Your Own </span> </h2> A Best Man Speech Example to Make Your Own
<h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-51"> <span class="mntl-sc-block-heading__text"> Best Man Speech Openers </span> </h2> Best Man Speech Openers
<h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-54"> <span class="mntl-sc-block-subheading__text"> Introduce Yourself With a Twist </span> </h3> Introduce Yourself With a Twist
<h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-61"> <span class="mntl-sc-block-subheading__text"> Crack a Joke, Even a Corny One </span> </h3> Crack a Joke, Even a Corny One
<h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-68"> <span class="mntl-sc-block-subheading__text"> Be Hilarious With a Straight Face </span> </h3> Be Hilarious With a Straight Face
<h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-75"> <span class="mntl-sc-block-subheading__text"> Introduce a Recurring Theme </span> </h3> Introduce a Recurring Theme
Instead of having the whole html line + the heading text, I was wanting to have it look something like:
H2, Best Man Speech Template
H2, Best Man Speech Tips
H2, Best Man Speech Openers
H3, Introduce Yourself With a Twist etc etc
In other words, I would like whether it is a H2 or H3, followed by the text. Not sure how to just print the tag object? Was wondering whether someone could enlighten me please?
Thank you
Posts: 1,027
Threads: 16
Joined: Dec 2016
try this
for headings in soup.find_all(['h2', 'h3']):
if str(headings).startswith("<h2"):
print(f"H2, {headings.text.strip()}")
else:
print(f"H3, {headings.text.strip()}") Output: H2, Best Man Speech Template
H2, Best Man Speech Tips
H2, A Best Man Speech Example to Make Your Own
H2, Best Man Speech Openers
H3, Introduce Yourself With a Twist
H3, Crack a Joke, Even a Corny One
H3, Be Hilarious With a Straight Face
H3, Introduce a Recurring Theme
H3, Ask a Question to Answer Throughout
H3, Rhyme-Master Flex
H3, Read a Definition From a Dictionary
H3, Tell a Story of How You Met
H3, Begin With a Quote
H3, Read Something in a Different Language
H3, A Guide to Wedding Reception Toasts
H2, Related Stories
snippsat and knight2000 like this post
Posts: 68
Threads: 21
Joined: May 2021
(Mar-06-2023, 10:23 AM)Axel_Erfurt Wrote: try this
for headings in soup.find_all(['h2', 'h3']):
if str(headings).startswith("<h2"):
print(f"H2, {headings.text.strip()}")
else:
print(f"H3, {headings.text.strip()}") Output: H2, Best Man Speech Template
H2, Best Man Speech Tips
H2, A Best Man Speech Example to Make Your Own
H2, Best Man Speech Openers
H3, Introduce Yourself With a Twist
H3, Crack a Joke, Even a Corny One
H3, Be Hilarious With a Straight Face
H3, Introduce a Recurring Theme
H3, Ask a Question to Answer Throughout
H3, Rhyme-Master Flex
H3, Read a Definition From a Dictionary
H3, Tell a Story of How You Met
H3, Begin With a Quote
H3, Read Something in a Different Language
H3, A Guide to Wedding Reception Toasts
H2, Related Stories
Thank you for taking the time to show me how this is done Axel_Erfurt. I tried looking through the BeautifulSoup documentation for something like a 'left' or 'start's with' but couldn't seem to find it. Maybe blind. Now I've learnt the 'startswith'- thank you.
Posts: 1,027
Threads: 16
Joined: Dec 2016
|