Python Forum

Full Version: Using BeautifulSoup And Getting -1 Results
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi all,

I thought I would give a go in trying to extract all the H2 Headers and H3 Headers from a webpage and used the following code:

import requests
from bs4 import BeautifulSoup
import pandas

url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'

r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, features="html.parser")
body_find = soup.find('body')



for heading in body_find:
    h2= heading.find('h2')
    print(h2)
When running this simple code, I get:

Output:
-1 None -1 None -1 None -1 None -1 -1 -1 None -1 -1 -1 <h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-10"> <span class="mntl-sc-block-heading__text"> Best Man Speech Template  </span> </h2> -1 None -1 None -1 None -1 None -1
I've not seen this before in my short webscraping practice and wasn't sure what I was doing wrong, nor what the -1 represented. I haven't yet tried to extract any H3's yet- I presume I'd have a similar issue.

Ideally, I'd like to be able to extract the text from the H2 and H3 into a word document, so it looked something like:

H2 (Text)
h3 (text)
H3 (text)

H2 (text)
h3 (text)

But at the moment, stuck on just getting the H2 and H3 text's returned. I presume once I could get that I'd have to do an extraction into Excel and then somehow create a template for Microsoft Word?

Appreciate any advice please as I've been stuck on this for a little while now. Thanking you Smile
Here some changes,same with h3.
import requests
from bs4 import BeautifulSoup
import pandas

url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
response = requests.get(url)
soup = BeautifulSoup(response.content, features="html.parser")
for tag in soup.find_all('h2'):
    print(tag.text.strip())
Output:
Best Man Speech Template Best Man Speech Tips A Best Man Speech Example to Make Your Own Best Man Speech Openers Related Stories
maybe this

import requests
from bs4 import BeautifulSoup
import pandas
 
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
 
r = requests.get(url)
soup = BeautifulSoup(r.text, features="lxml")
body_find = soup.find('body')
 
print("h2:")
for heading in body_find.find_all('h2'):
    print(heading.text)

print("\nh3:")    
for heading in body_find.find_all('h3'):
    print(heading.text)
Output:
h2: Best Man Speech Template Best Man Speech Tips A Best Man Speech Example to Make Your Own Best Man Speech Openers Related Stories h3: Introduce Yourself With a Twist Crack a Joke, Even a Corny One Be Hilarious With a Straight Face Introduce a Recurring Theme Ask a Question to Answer Throughout Rhyme-Master Flex Read a Definition From a Dictionary Tell a Story of How You Met Begin With a Quote Read Something in a Different Language A Guide to Wedding Reception Toasts
(Mar-02-2023, 10:00 AM)snippsat Wrote: [ -> ]Here some changes,same with h3.
import requests
from bs4 import BeautifulSoup
import pandas

url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
response = requests.get(url)
soup = BeautifulSoup(response.content, features="html.parser")
for tag in soup.find_all('h2'):
    print(tag.text.strip())
Output:
Best Man Speech Template Best Man Speech Tips A Best Man Speech Example to Make Your Own Best Man Speech Openers Related Stories

Awesome. Thank you Snippsat. I forgot to try the find_all within the loop. D'oh!
(Mar-02-2023, 10:08 AM)Axel_Erfurt Wrote: [ -> ]maybe this

import requests
from bs4 import BeautifulSoup
import pandas
 
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
 
r = requests.get(url)
soup = BeautifulSoup(r.text, features="lxml")
body_find = soup.find('body')
 
print("h2:")
for heading in body_find.find_all('h2'):
    print(heading.text)

print("\nh3:")    
for heading in body_find.find_all('h3'):
    print(heading.text)
Output:
h2: Best Man Speech Template Best Man Speech Tips A Best Man Speech Example to Make Your Own Best Man Speech Openers Related Stories h3: Introduce Yourself With a Twist Crack a Joke, Even a Corny One Be Hilarious With a Straight Face Introduce a Recurring Theme Ask a Question to Answer Throughout Rhyme-Master Flex Read a Definition From a Dictionary Tell a Story of How You Met Begin With a Quote Read Something in a Different Language A Guide to Wedding Reception Toasts

Thank you Axel- that output looks nice and neat. I might run with this.
(Mar-03-2023, 09:57 AM)knight2000 Wrote: [ -> ]
(Mar-02-2023, 10:08 AM)Axel_Erfurt Wrote: [ -> ]maybe this

import requests
from bs4 import BeautifulSoup
import pandas
 
url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
 
r = requests.get(url)
soup = BeautifulSoup(r.text, features="lxml")
body_find = soup.find('body')
 
print("h2:")
for heading in body_find.find_all('h2'):
    print(heading.text)

print("\nh3:")    
for heading in body_find.find_all('h3'):
    print(heading.text)
Output:
h2: Best Man Speech Template Best Man Speech Tips A Best Man Speech Example to Make Your Own Best Man Speech Openers Related Stories h3: Introduce Yourself With a Twist Crack a Joke, Even a Corny One Be Hilarious With a Straight Face Introduce a Recurring Theme Ask a Question to Answer Throughout Rhyme-Master Flex Read a Definition From a Dictionary Tell a Story of How You Met Begin With a Quote Read Something in a Different Language A Guide to Wedding Reception Toasts

Thank you Axel- that output looks nice and neat. I might run with this.

Hi Axel,

Just wondering, with the above code, is there anyway of identifying which H2 the H3 belongs to?

For example, let's assume under "Best Man Template", the following H3's reside:

Introduce yourself with a twist
Crack a joke, even a corny one

etc etc

I know I've only used "body" and not something more specific to drill further into- I've done that because I thought that if I wanted to scrape another wedding speech site for example, I'd easily be able to apply the same code to get the next sites H2 and H3 headers.

Is that possible with generic code like this please?
I'm pretty close to what I wanted with:

import requests
from bs4 import BeautifulSoup
import pandas


base_url = 'https://www.brides.com/story/how-to-write-the-perfect-best-man-speech'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, features="html.parser")

for headings in soup.find_all(['h2', 'h3']):
   print(headings, headings.text)
A sample of that output is:
Output:
<h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-10"> <span class="mntl-sc-block-heading__text"> Best Man Speech Template  </span> </h2> Best Man Speech Template  <h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-22"> <span class="mntl-sc-block-heading__text"> Best Man Speech Tips </span> </h2> Best Man Speech Tips <h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-46"> <span class="mntl-sc-block-heading__text"> A Best Man Speech Example to Make Your Own </span> </h2> A Best Man Speech Example to Make Your Own <h2 class="comp mntl-sc-block beauty-sc-block-heading mntl-sc-block-heading" id="mntl-sc-block_1-0-51"> <span class="mntl-sc-block-heading__text"> Best Man Speech Openers </span> </h2> Best Man Speech Openers <h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-54"> <span class="mntl-sc-block-subheading__text"> Introduce Yourself With a Twist </span> </h3> Introduce Yourself With a Twist <h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-61"> <span class="mntl-sc-block-subheading__text"> Crack a Joke, Even a Corny One </span> </h3> Crack a Joke, Even a Corny One <h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-68"> <span class="mntl-sc-block-subheading__text"> Be Hilarious With a Straight Face </span> </h3> Be Hilarious With a Straight Face <h3 class="comp mntl-sc-block beauty-sc-block-subheading mntl-sc-block-subheading" id="mntl-sc-block_1-0-75"> <span class="mntl-sc-block-subheading__text"> Introduce a Recurring Theme </span> </h3> Introduce a Recurring Theme
Instead of having the whole html line + the heading text, I was wanting to have it look something like:

H2, Best Man Speech Template
H2, Best Man Speech Tips
H2, Best Man Speech Openers
H3, Introduce Yourself With a Twist etc etc

In other words, I would like whether it is a H2 or H3, followed by the text. Not sure how to just print the tag object? Was wondering whether someone could enlighten me please?

Thank you
try this

for headings in soup.find_all(['h2', 'h3']):
    if str(headings).startswith("<h2"):
        print(f"H2, {headings.text.strip()}")
    else:
        print(f"H3, {headings.text.strip()}")
Output:
H2, Best Man Speech Template H2, Best Man Speech Tips H2, A Best Man Speech Example to Make Your Own H2, Best Man Speech Openers H3, Introduce Yourself With a Twist H3, Crack a Joke, Even a Corny One H3, Be Hilarious With a Straight Face H3, Introduce a Recurring Theme H3, Ask a Question to Answer Throughout H3, Rhyme-Master Flex H3, Read a Definition From a Dictionary H3, Tell a Story of How You Met H3, Begin With a Quote H3, Read Something in a Different Language H3, A Guide to Wedding Reception Toasts H2, Related Stories
(Mar-06-2023, 10:23 AM)Axel_Erfurt Wrote: [ -> ]try this

for headings in soup.find_all(['h2', 'h3']):
    if str(headings).startswith("<h2"):
        print(f"H2, {headings.text.strip()}")
    else:
        print(f"H3, {headings.text.strip()}")
Output:
H2, Best Man Speech Template H2, Best Man Speech Tips H2, A Best Man Speech Example to Make Your Own H2, Best Man Speech Openers H3, Introduce Yourself With a Twist H3, Crack a Joke, Even a Corny One H3, Be Hilarious With a Straight Face H3, Introduce a Recurring Theme H3, Ask a Question to Answer Throughout H3, Rhyme-Master Flex H3, Read a Definition From a Dictionary H3, Tell a Story of How You Met H3, Begin With a Quote H3, Read Something in a Different Language H3, A Guide to Wedding Reception Toasts H2, Related Stories

Thank you for taking the time to show me how this is done Axel_Erfurt. I tried looking through the BeautifulSoup documentation for something like a 'left' or 'start's with' but couldn't seem to find it. Maybe blind. Now I've learnt the 'startswith'- thank you.
Pages: 1 2