Python Forum

Full Version: web scraping extract particular Div section
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
In my html code I have Div section, and multiple Div sections have the same class name.


<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
.
.
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu">
..
In above if we see <div class="_2RngUh"> is repeated,

I used beautiful soup soup.find(class_="_2RngUh"), but it always give the first occurence.
but I want to get this occurenace basesd on child name General, Processor And Memory Features how to provide this.
you need:
results = soup.find('div', {'class': '_2RngUh'})
also, place your html in python tags. Even though it's not python, it will maintain indentation.
Thanks for your reply,

results = soup.find('div', {'class': '_2RngUh'})
even this is giving only the first occurrence of class.

But want to fetch 2nd occurrence or 3rd occurrence based on child name(General, Processor And Memory Features)
change find to find_all, and select wanted item
suppose it's the third item:
results = soup.find_all('div', {'class': '_2RngUh'})
desired_result = results[2]
Thank you so much, I can get it based on index.
But can we get index based on its child tag General, Processor And Memory Features...?

<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
.
.
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu">
(May-12-2020, 09:02 AM)AjayBachu Wrote: [ -> ]But can we get index based on its child tag General, Processor And Memory Features...?
from bs4 import BeautifulSoup

html = '''\
<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu">'''

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all(class_="_2RngUh")
>>> t = tags[1]
>>> t
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu"></table></div>

>>> t.findChild()
<div class="_2lzn0o">Processor And Memory Features</div>
>>> t.findChild().text
'Processor And Memory Features'
So this is example how you can test stuff out.
There are many function/methods can use dir() to show all.
A good editor or REPL will show you these option in a Autocomplete way.
>>> dir(t)
['HTML_FORMATTERS',
 'XML_FORMATTERS',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_find_all',
 '_find_one',
 '_formatter_for_name',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_should_pretty_print',
 'append',
 'attrs',
 'can_be_empty_element',
 'childGenerator',
 'children',
 'clear',
 'contents',
 'decode',
 'decode_contents',
 'decompose',
 'descendants',
 'encode',
 'encode_contents',
 'extend',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 'find',
 'findAll',
 'findAllNext',
 'findAllPrevious',
 'findChild',
 'findChildren',
 'findNext',
 'findNextSibling',
 'findNextSiblings',
 'findParent',
 'findParents',
 'findPrevious',
 'findPreviousSibling',
 'findPreviousSiblings',
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',
 'find_next_sibling',
 'find_next_siblings',
 'find_parent',
 'find_parents',
 'find_previous',
 'find_previous_sibling',
 'find_previous_siblings',
 'format_string',
 'get',
 'getText',
 'get_attribute_list',
 'get_text',
 'has_attr',
 'has_key',
 'hidden',
 'index',
 'insert',
 'insert_after',
 'insert_before',
 'isSelfClosing',
 'is_empty_element',
 'known_xml',
 'name',
 'namespace',
 'next',
 'nextGenerator',
 'nextSibling',
 'nextSiblingGenerator',
 'next_element',
 'next_elements',
 'next_sibling',
 'next_siblings',
 'parent',
 'parentGenerator',
 'parents',
 'parserClass',
 'parser_class',
 'prefix',
 'preserve_whitespace_tags',
 'prettify',
 'previous',
 'previousGenerator',
 'previousSibling',
 'previousSiblingGenerator',
 'previous_element',
 'previous_elements',
 'previous_sibling',
 'previous_siblings',
 'recursiveChildGenerator',
 'renderContents',
 'replaceWith',
 'replaceWithChildren',
 'replace_with',
 'replace_with_children',
 'select',
 'select_one',
 'setup',
 'string',
 'strings',
 'stripped_strings',
 'text',
 'unwrap',
 'wrap']
So would eg find_next() work Think
>>> t.find_next()
<div class="_2lzn0o">Processor And Memory Features</div>
>>> t.find_next().text
'Processor And Memory Features
Larz60+ Wrote:you need:
results = soup.find('div', {'class': '_2RngUh'})
Don't need that @Larz60+,i do not use the dictionary call way anymore.
Because you can just copy class name direct for source code and just add class_ to make it work.
Example:
from bs4 import BeautifulSoup

html = '<div class="cities">London</div>'
soup = BeautifulSoup(html, 'lxml')
Usage:
# Only add _
>>> tag = soup.find(class_="cities")
>>> tag.text
'London'

>>> # A dictionary call need more changing of what is organically is and also need a div tag 
>>> tag = soup.find('div', {'class': 'cities'})
>>> tag.text
'London'
Thank you so much.. I will use this.