Python Forum
Web scraping read particular section
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web scraping read particular section
#1
I am trying read contents from particular section from a URL/web page.

I am able to achieve this using beatifulsoup using findall and giving classname soup.find(class_="_2RngUh").

But now I want to go to particular section based on input string. eg: product specification.
then I need to get class_ name(class_="_2RngUh") for that section so that I can read full data in that section.
Reply
#2
Can you post sample html and expected output?
Reply
#3
above this other html code is there <div class="_2GiuhO">Specifications</div>, now I have the text specifications and I want to get the class name for that specifications class="_2GiuhO".
Below is the sub block I took from html.

<div class="bhgxx2 col-12-12">
<div class="MocXoX">
<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Sales Package</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Laptop, Power Adaptor, User Guide, Warranty Documents</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Model Number</td>
<td class="_2k4JXJ col col-9-12">
<ul>
Reply
#4
>>> import bs4
>>> html_string='''<div class="bhgxx2 col-12-12">
... <div class="MocXoX">
... <div class="_2GiuhO">Specifications</div>
... <div>
... <div class="_3Rrcbo V39ti-">
... <div class="_2RngUh">
... <div class="_2lzn0o">General</div>
... <table class="_3ENrHu">
... <tbody>
... <tr class="_3_6Uyw row">
... <td class="_3-wDH3 col col-3-12">Sales Package</td>
... <td class="_2k4JXJ col col-9-12">
... <ul>
... <li class="_3YhLQA">Laptop, Power Adaptor, User Guide, Warranty Documents</li>
... </ul>
... </td>
... </tr>
... <tr class="_3_6Uyw row">
... <td class="_3-wDH3 col col-3-12">Model Number</td>
... <td class="_2k4JXJ col col-9-12">
... <ul>'''
>>>
>>> soup = bs4.BeautifulSoup(html_string)
>>>
>>> for element in soup.find_all(class_=True):
...   elm=element['class'][0]
...   try:
...    if soup.find('div',class_=elm).text == 'Specifications':
...     print(elm)
...   except:
...    pass
...
_2GiuhO
Reply
#5
Thank you so much.. above code is working I can get class name

Now major question is inside this class there are lot of sub sections and I want to extract each section based on input string passed in name value pair.

https://www.flipkart.com/dell-inspiron-5...86b0f7d7f1

in this url after going to specifications section then based on input General, Processor And Memory Features

I want output in name value pair or two lists with namelist and valuelist.
eg: namelist [Model Number, Part Number] and valuelist [5584,SLV-C568123WIN9] in general section

below is html for specification section

<div class="bhgxx2 col-12-12">
<div class="MocXoX">
<div class="_2GiuhO">Specifications</div>
<div>
<div class="_3Rrcbo V39ti-">
<div class="_2RngUh">
<div class="_2lzn0o">General</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Sales Package</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Laptop, Power Adaptor, User Guide, Warranty Documents</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Model Number</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">5584</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Part Number</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">SLV-C568123WIN9</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Series</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Inspiron 5000</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Color</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Silver</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Type</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Laptop</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Suitable For</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Processing &amp; Multitasking</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Battery Backup</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Upto 6 hours</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Battery Cell</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">3 cell</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">MS Office Provided</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Yes</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Processor And Memory Features</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Dedicated Graphic Memory Type</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">GDDR5</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Dedicated Graphic Memory Capacity</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">2 GB</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Processor Brand</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Intel</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Processor Name</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Core i5</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Processor Generation</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">8th Gen</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">SSD</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Yes</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">SSD Capacity</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">512 GB</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">RAM</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">8 GB</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">RAM Type</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">DDR4</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">HDD Capacity</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1 TB</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Processor Variant</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">8265U</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Clock Speed</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1.6 GHz with Turbo Boost Upto 3.9 GHz</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Memory Slots</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">2 Slots</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Expandable Memory</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Upto 32 GB</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">RAM Frequency</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">2666 MHz</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Cache</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">6 MB</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Graphic Processor</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">NVIDIA Geforce MX130</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Number of Cores</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">4</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Operating System</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">OS Architecture</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">64 bit</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Operating System</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Windows 10 Home</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">System Architecture</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">64 bit</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Port And Slot Features</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Mic In</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Yes</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">RJ45</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Yes</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">USB Port</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1 x USB 3.1 Type C (1st Gen), 3 x USB 3.1 (1st Gen)</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">HDMI Port</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1 x HDMI Port (v1.4b)</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Multi Card Slot</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">3-in-1 Card Reader (SD, SDHC, SDXC)</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Hardware Interface</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">SATA</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Display And Audio Features</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Touchscreen</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">No</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Screen Size</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">39.62 cm (15.6 inch)</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Screen Resolution</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1920 x 1080 Pixel</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Screen Type</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Full HD LED Backlit Anti-glare IPS Display</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Speakers</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Built-in Dual Speakers</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Internal Mic</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Built-in Microphones</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Sound Properties</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Waves Maxx Audio Pro</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Connectivity Features</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Wireless LAN</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">IEEE 802.11ac</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Bluetooth</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">v4.1</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Dimensions</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Dimensions</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">364 x 248 x 22 mm</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Weight</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1.95 kg</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Additional Features</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Disk Drive</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Not Available</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Web Camera</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">HD 720P Webcam</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Lock Port</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Kensington Lock Slot</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Antivirus</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">McAfee Multi Device Security 15 Months Subscription</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Keyboard</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">English International Keyboard</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Pointer Device</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Touchpad</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Included Software</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Microsoft Office Home and Student 2019</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Additional Features</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Li-ion Battery</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
<div class="_2RngUh">
<div class="_2lzn0o">Warranty</div>
<table class="_3ENrHu">
<tbody>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Warranty Summary</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1 Year Limited Hardware Warranty, In Home Service After Remote Diagnosis - Retail</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Warranty Service Type</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Onsite</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Covered in Warranty</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Manufacturing Defects</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Not Covered in Warranty</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">Physical Damage</li>
</ul>
</td>
</tr>
<tr class="_3_6Uyw row">
<td class="_3-wDH3 col col-3-12">Domestic Warranty</td>
<td class="_2k4JXJ col col-9-12">
<ul>
<li class="_3YhLQA">1 Year</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<button class="_2AkmmA uSQV49">Read More</button>
</div>
</div>
</div>
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  web scraping extract particular Div section AjayBachu 7 672 May-12-2020, 03:24 PM
Last Post: AjayBachu
  how to print out all the link <a> under each h2 section using beautifulsoup HenryJ 2 7,659 Feb-02-2018, 02:55 AM
Last Post: HenryJ
  Monitor a section of a webpage for changes yeto 1 1,405 Dec-05-2017, 08:09 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020