Python Forum

Full Version: Parsing bs4 Resultset
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm having trouble understanding the intricacies of BeautifulSoup. I did a find for a specific 'select' tag using 'find(id=...)'. The returned results was the correct 'select' along with its options. Now I'm stuck on how to extract data from that result set. I want to parse out the value and text for each select but I can't find a method for doing that. Do I have to use string functions to brute force the extractions or are there bs4 methods for simplifying that? TIA.
Post a sample of html and what want to parse out then it easier to give advice.
(Nov-08-2021, 04:54 PM)snippsat Wrote: [ -> ]Post a sample of html and what want to parse out then it easier to give advice.

Thanks for the reply.

<select id="TimeOfCallDropDownList" name="TimeOfCallDropDownList" tabindex="4"><option selected="selected" value="">Hour</option><option value="00">12:00 AM</option><option value="01">01:00 AM</option><option value="02">02:00 AM</option><option value="03">03:00 AM</option><option value="04">04:00 AM</option><option value="05">05:00 AM</option><option value="06">06:00 AM</option><option value="07">07:00 AM</option><option value="08">08:00 AM</option><option value="09">09:00 AM</option><option value="10">10:00 AM</option><option value="11">11:00 AM</option><option value="12">12:00 PM</option><option value="13">01:00 PM</option><option value="14">02:00 PM</option><option value="15">03:00 PM</option><option value="16">04:00 PM</option><option value="17">05:00 PM</option><option value="18">06:00 PM</option><option value="19">07:00 PM</option><option value="20">08:00 PM</option><option value="21">09:00 PM</option><option value="22">10:00 PM</option><option value="23">11:00 PM</option></select>

I need to parse out the option values and text.
I think I figured it out unless someone has a better idea. I converted the resultset to a string and ran it through BeautifulSoup again. Now I can 'find_all' options and process the result.
(Nov-08-2021, 07:42 PM)gw1500se Wrote: [ -> ]I think I figured it out unless someone has a better idea. I converted the resultset to a string and ran it through BeautifulSoup again.
You shall not convert to a string,just pass html to BS then it convert to Unicode.
Here a example of one way to do it.
from bs4 import BeautifulSoup

html = '''\
<select id="TimeOfCallDropDownList" name="TimeOfCallDropDownList" tabindex="4">
  <option selected="selected" value="">Hour</option>
  <option value="00">12:00 AM</option>
  <option value="01">01:00 AM</option>
  <option value="02">02:00 AM</option>
  <option value="03">03:00 AM</option>
  <option value="04">04:00 AM</option>
  <option value="05">05:00 AM</option>
  <option value="06">06:00 AM</option>
  <option value="07">07:00 AM</option>
  <option value="08">08:00 AM</option>
  <option value="09">09:00 AM</option>
  <option value="10">10:00 AM</option>
  <option value="11">11:00 AM</option>
  <option value="12">12:00 PM</option>
  <option value="13">01:00 PM</option>
  <option value="14">02:00 PM</option>
  <option value="15">03:00 PM</option>
  <option value="16">04:00 PM</option>
  <option value="17">05:00 PM</option>
  <option value="18">06:00 PM</option>
  <option value="19">07:00 PM</option>
  <option value="20">08:00 PM</option>
  <option value="21">09:00 PM</option>
  <option value="22">10:00 PM</option>
  <option value="23">11:00 PM</option>
</select>'''

soup = BeautifulSoup(html, 'lxml')
op_vaules = soup.select('[value]')
for val in op_vaules[1:]:
    print(f"{val.attrs.get('value')} --> {val.text}")
Output:
00 --> 12:00 AM 01 --> 01:00 AM 02 --> 02:00 AM 03 --> 03:00 AM 04 --> 04:00 AM 05 --> 05:00 AM 06 --> 06:00 AM 07 --> 07:00 AM 08 --> 08:00 AM 09 --> 09:00 AM 10 --> 10:00 AM 11 --> 11:00 AM 12 --> 12:00 PM 13 --> 01:00 PM 14 --> 02:00 PM 15 --> 03:00 PM 16 --> 04:00 PM 17 --> 05:00 PM 18 --> 06:00 PM 19 --> 07:00 PM 20 --> 08:00 PM 21 --> 09:00 PM 22 --> 10:00 PM 23 --> 11:00 PM