Python Forum
web scraping extract particular Div section
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraping extract particular Div section
#1
In my html code I have Div section, and multiple Div sections have the same class name.


<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
.
.
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu">
..
In above if we see <div class="_2RngUh"> is repeated,

I used beautiful soup soup.find(class_="_2RngUh"), but it always give the first occurence.
but I want to get this occurenace basesd on child name General, Processor And Memory Features how to provide this.
Reply
#2
you need:
results = soup.find('div', {'class': '_2RngUh'})
also, place your html in python tags. Even though it's not python, it will maintain indentation.
Reply
#3
Thanks for your reply,

results = soup.find('div', {'class': '_2RngUh'})
even this is giving only the first occurrence of class.

But want to fetch 2nd occurrence or 3rd occurrence based on child name(General, Processor And Memory Features)
Reply
#4
change find to find_all, and select wanted item
suppose it's the third item:
results = soup.find_all('div', {'class': '_2RngUh'})
desired_result = results[2]
Reply
#5
Thank you so much, I can get it based on index.
But can we get index based on its child tag General, Processor And Memory Features...?

<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
.
.
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu">
Reply
#6
(May-12-2020, 09:02 AM)AjayBachu Wrote: But can we get index based on its child tag General, Processor And Memory Features...?
from bs4 import BeautifulSoup

html = '''\
<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu">'''

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all(class_="_2RngUh")
>>> t = tags[1]
>>> t
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu"></table></div>

>>> t.findChild()
<div class="_2lzn0o">Processor And Memory Features</div>
>>> t.findChild().text
'Processor And Memory Features'
So this is example how you can test stuff out.
There are many function/methods can use dir() to show all.
A good editor or REPL will show you these option in a Autocomplete way.
>>> dir(t)
['HTML_FORMATTERS',
 'XML_FORMATTERS',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_find_all',
 '_find_one',
 '_formatter_for_name',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_should_pretty_print',
 'append',
 'attrs',
 'can_be_empty_element',
 'childGenerator',
 'children',
 'clear',
 'contents',
 'decode',
 'decode_contents',
 'decompose',
 'descendants',
 'encode',
 'encode_contents',
 'extend',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 'find',
 'findAll',
 'findAllNext',
 'findAllPrevious',
 'findChild',
 'findChildren',
 'findNext',
 'findNextSibling',
 'findNextSiblings',
 'findParent',
 'findParents',
 'findPrevious',
 'findPreviousSibling',
 'findPreviousSiblings',
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',
 'find_next_sibling',
 'find_next_siblings',
 'find_parent',
 'find_parents',
 'find_previous',
 'find_previous_sibling',
 'find_previous_siblings',
 'format_string',
 'get',
 'getText',
 'get_attribute_list',
 'get_text',
 'has_attr',
 'has_key',
 'hidden',
 'index',
 'insert',
 'insert_after',
 'insert_before',
 'isSelfClosing',
 'is_empty_element',
 'known_xml',
 'name',
 'namespace',
 'next',
 'nextGenerator',
 'nextSibling',
 'nextSiblingGenerator',
 'next_element',
 'next_elements',
 'next_sibling',
 'next_siblings',
 'parent',
 'parentGenerator',
 'parents',
 'parserClass',
 'parser_class',
 'prefix',
 'preserve_whitespace_tags',
 'prettify',
 'previous',
 'previousGenerator',
 'previousSibling',
 'previousSiblingGenerator',
 'previous_element',
 'previous_elements',
 'previous_sibling',
 'previous_siblings',
 'recursiveChildGenerator',
 'renderContents',
 'replaceWith',
 'replaceWithChildren',
 'replace_with',
 'replace_with_children',
 'select',
 'select_one',
 'setup',
 'string',
 'strings',
 'stripped_strings',
 'text',
 'unwrap',
 'wrap']
So would eg find_next() work Think
>>> t.find_next()
<div class="_2lzn0o">Processor And Memory Features</div>
>>> t.find_next().text
'Processor And Memory Features
Reply
#7
Larz60+ Wrote:you need:
results = soup.find('div', {'class': '_2RngUh'})
Don't need that @Larz60+,i do not use the dictionary call way anymore.
Because you can just copy class name direct for source code and just add class_ to make it work.
Example:
from bs4 import BeautifulSoup

html = '<div class="cities">London</div>'
soup = BeautifulSoup(html, 'lxml')
Usage:
# Only add _
>>> tag = soup.find(class_="cities")
>>> tag.text
'London'

>>> # A dictionary call need more changing of what is organically is and also need a div tag 
>>> tag = soup.find('div', {'class': 'cities'})
>>> tag.text
'London'
Reply
#8
Thank you so much.. I will use this.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraping read particular section AjayBachu 4 3,005 May-08-2020, 07:33 AM
Last Post: AjayBachu
  how to print out all the link <a> under each h2 section using beautifulsoup HenryJ 2 12,340 Feb-02-2018, 02:55 AM
Last Post: HenryJ
  Monitor a section of a webpage for changes yeto 1 3,112 Dec-05-2017, 08:09 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020