Web scraping read particular section - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Web scraping read particular section (/thread-26586.html) |
Web scraping read particular section - AjayBachu - May-06-2020 I am trying read contents from particular section from a URL/web page. I am able to achieve this using beatifulsoup using findall and giving classname soup.find(class_="_2RngUh"). But now I want to go to particular section based on input string. eg: product specification. then I need to get class_ name(class_="_2RngUh") for that section so that I can read full data in that section. RE: Web scraping read particular section - anbu23 - May-06-2020 Can you post sample html and expected output? RE: Web scraping read particular section - AjayBachu - May-07-2020 above this other html code is there <div class="_2GiuhO">Specifications</div>, now I have the text specifications and I want to get the class name for that specifications class="_2GiuhO". Below is the sub block I took from html. <div class="bhgxx2 col-12-12"> <div class="MocXoX"> <div class="_2GiuhO">Specifications</div> <div> <div class="_3Rrcbo V39ti-"> <div class="_2RngUh"> <div class="_2lzn0o">General</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Sales Package</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Laptop, Power Adaptor, User Guide, Warranty Documents</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Model Number</td> <td class="_2k4JXJ col col-9-12"> <ul> RE: Web scraping read particular section - anbu23 - May-07-2020 >>> import bs4 >>> html_string='''<div class="bhgxx2 col-12-12"> ... <div class="MocXoX"> ... <div class="_2GiuhO">Specifications</div> ... <div> ... <div class="_3Rrcbo V39ti-"> ... <div class="_2RngUh"> ... <div class="_2lzn0o">General</div> ... <table class="_3ENrHu"> ... <tbody> ... <tr class="_3_6Uyw row"> ... <td class="_3-wDH3 col col-3-12">Sales Package</td> ... <td class="_2k4JXJ col col-9-12"> ... <ul> ... <li class="_3YhLQA">Laptop, Power Adaptor, User Guide, Warranty Documents</li> ... </ul> ... </td> ... </tr> ... <tr class="_3_6Uyw row"> ... <td class="_3-wDH3 col col-3-12">Model Number</td> ... <td class="_2k4JXJ col col-9-12"> ... <ul>''' >>> >>> soup = bs4.BeautifulSoup(html_string) >>> >>> for element in soup.find_all(class_=True): ... elm=element['class'][0] ... try: ... if soup.find('div',class_=elm).text == 'Specifications': ... print(elm) ... except: ... pass ... _2GiuhO RE: Web scraping read particular section - AjayBachu - May-08-2020 Thank you so much.. above code is working I can get class name Now major question is inside this class there are lot of sub sections and I want to extract each section based on input string passed in name value pair. https://www.flipkart.com/dell-inspiron-5000-core-i5-8th-gen-8-gb-1-tb-hdd-512-gb-ssd-windows-10-home-2-graphics-5584-laptop/p/itm75586b0f7d7f1 in this url after going to specifications section then based on input General, Processor And Memory Features I want output in name value pair or two lists with namelist and valuelist. eg: namelist [Model Number, Part Number] and valuelist [5584,SLV-C568123WIN9] in general section below is html for specification section <div class="bhgxx2 col-12-12"> <div class="MocXoX"> <div class="_2GiuhO">Specifications</div> <div> <div class="_3Rrcbo V39ti-"> <div class="_2RngUh"> <div class="_2lzn0o">General</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Sales Package</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Laptop, Power Adaptor, User Guide, Warranty Documents</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Model Number</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">5584</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Part Number</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">SLV-C568123WIN9</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Series</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Inspiron 5000</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Color</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Silver</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Type</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Laptop</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Suitable For</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Processing & Multitasking</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Battery Backup</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Upto 6 hours</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Battery Cell</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">3 cell</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">MS Office Provided</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Yes</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Processor And Memory Features</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Dedicated Graphic Memory Type</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">GDDR5</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Dedicated Graphic Memory Capacity</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">2 GB</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Processor Brand</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Intel</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Processor Name</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Core i5</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Processor Generation</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">8th Gen</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">SSD</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Yes</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">SSD Capacity</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">512 GB</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">RAM</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">8 GB</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">RAM Type</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">DDR4</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">HDD Capacity</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1 TB</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Processor Variant</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">8265U</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Clock Speed</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1.6 GHz with Turbo Boost Upto 3.9 GHz</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Memory Slots</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">2 Slots</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Expandable Memory</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Upto 32 GB</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">RAM Frequency</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">2666 MHz</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Cache</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">6 MB</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Graphic Processor</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">NVIDIA Geforce MX130</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Number of Cores</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">4</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Operating System</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">OS Architecture</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">64 bit</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Operating System</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Windows 10 Home</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">System Architecture</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">64 bit</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Port And Slot Features</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Mic In</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Yes</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">RJ45</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Yes</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">USB Port</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1 x USB 3.1 Type C (1st Gen), 3 x USB 3.1 (1st Gen)</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">HDMI Port</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1 x HDMI Port (v1.4b)</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Multi Card Slot</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">3-in-1 Card Reader (SD, SDHC, SDXC)</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Hardware Interface</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">SATA</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Display And Audio Features</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Touchscreen</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">No</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Screen Size</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">39.62 cm (15.6 inch)</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Screen Resolution</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1920 x 1080 Pixel</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Screen Type</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Full HD LED Backlit Anti-glare IPS Display</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Speakers</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Built-in Dual Speakers</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Internal Mic</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Built-in Microphones</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Sound Properties</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Waves Maxx Audio Pro</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Connectivity Features</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Wireless LAN</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">IEEE 802.11ac</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Bluetooth</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">v4.1</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Dimensions</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Dimensions</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">364 x 248 x 22 mm</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Weight</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1.95 kg</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Additional Features</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Disk Drive</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Not Available</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Web Camera</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">HD 720P Webcam</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Lock Port</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Kensington Lock Slot</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Antivirus</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">McAfee Multi Device Security 15 Months Subscription</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Keyboard</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">English International Keyboard</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Pointer Device</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Touchpad</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Included Software</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Microsoft Office Home and Student 2019</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Additional Features</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Li-ion Battery</li> </ul> </td> </tr> </tbody> </table> </div> <div class="_2RngUh"> <div class="_2lzn0o">Warranty</div> <table class="_3ENrHu"> <tbody> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Warranty Summary</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1 Year Limited Hardware Warranty, In Home Service After Remote Diagnosis - Retail</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Warranty Service Type</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Onsite</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Covered in Warranty</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Manufacturing Defects</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Not Covered in Warranty</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">Physical Damage</li> </ul> </td> </tr> <tr class="_3_6Uyw row"> <td class="_3-wDH3 col col-3-12">Domestic Warranty</td> <td class="_2k4JXJ col col-9-12"> <ul> <li class="_3YhLQA">1 Year</li> </ul> </td> </tr> </tbody> </table> </div> </div> <button class="_2AkmmA uSQV49">Read More</button> </div> </div> </div> |