Extracting html data using attributes - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Extracting html data using attributes (/thread-26332.html) Pages:
1
2
|
Extracting html data using attributes - WiPi - Apr-28-2020 Hi guys, I am trying to extract specific data from a block of html using Beautiful and attributes and can only get so far. THe first 3 lines of the html block are: <div class="pagearrange__layout-column pagearrange__layout-column--full"> <a class="anchor" name="explorers"></a> <div id="explorer_shell_159760" data-tag="41" data-hash-tag="41" data-id="159760" data-aggregating="" data-owner="0" data-userid="0" data-validate="" class="trade_explorer flexBox noflex explorer explorer--demo explorer--loaded">My code is: response = requests.get(url) content = response.content soup = bs(content,'lxml') ids = soup.find('div', class_ = 'pagearrange__layout-column pagearrange__layout-column--full') print(ids)This finds the correct block of code but I now need to extract values for the element 'data-id'. Any ideas please? thanks RE: Extracting html data using attributes - anbu23 - Apr-28-2020 ids = soup.find('div', id = 'explorer_shell_159760')['data-id'] RE: Extracting html data using attributes - WiPi - Apr-28-2020 There are a couple of problems with this. If I run it as-is the output is: Also the id has to be non-unique - the number at the end is the same number as data-id which changes depending on the url.
RE: Extracting html data using attributes - snippsat - Apr-28-2020 from bs4 import BeautifulSoup html = '''\ <div class="pagearrange__layout-column pagearrange__layout-column--full"> <a class="anchor" name="explorers"></a> <div id="explorer_shell_159760" data-tag="41" data-hash-tag="41" data-id="159760" data-aggregating="" data-owner="0" data-userid="0" data-validate="" class="trade_explorer flexBox noflex explorer explorer--demo explorer--loaded">''' soup = BeautifulSoup(html, 'lxml')So now can test this out,first find the div tag ,so here i just use first id to find the tag.Then will attrs get all attributes in that tag,the can as show take out wanted one.>>> tag = soup.find(id="explorer_shell_159760") >>> tag <div class="trade_explorer flexBox noflex explorer explorer--demo explorer--loaded" data-aggregating="" data-hash-tag="41" data-id="159760" data-owner="0" data-tag="41" data-userid="0" data-validate="" id="explorer_shell_159760"></div> >>> >>> tag.attrs {'class': ['trade_explorer', 'flexBox', 'noflex', 'explorer', 'explorer--demo', 'explorer--loaded'], 'data-aggregating': '', 'data-hash-tag': '41', 'data-id': '159760', 'data-owner': '0', 'data-tag': '41', 'data-userid': '0', 'data-validate': '', 'id': 'explorer_shell_159760'} >>> >>> tag.attrs.get('data-id') '159760'Testing out @anbu23 code so dos that work with test code. This is more the way it should be done using attrs was more demo on what that dos.>>> soup.find('div', id = 'explorer_shell_159760')['data-id'] '159760' RE: Extracting html data using attributes - WiPi - Apr-28-2020 guys, Thanks for your replies I think I have the solution. As I mentioned the numbers change depending on the URL I am looking at - i.e <div id="explorer_shell_159760" might be <div id="explorer_shell_187462" in another URL so we have to search non-uniquely. With your help and for completeness I believe this code works: from bs4 import BeautifulSoup as bs import re response = requests.get(url) html = response.content soup = bs(html,'lxml') tag = soup.find_all(id = re.compile('explorer_shell_.*')) for data in tag: d=data.get('data-id') print(d)and for the particular URL I was looking at the output:
RE: Extracting html data using attributes - WiPi - May-04-2020 Hi guys, I'm sorry to re-ignite this thread bit I am really struggling to extract data from another set of html!! Here's the block I am interested in: <tbody class="explorer_tradeslist__tbody"> <tr id="trade_349236564" data-ticket="349236564" class="explorer_tradeslist__row "> <td class="slidetable__cell slidetable__cell--fixed" style="width: 63px; min-width: 63px;"> <a id="snap_180400_trade_349236564" class="explorer__anchor explorer__anchor--trade"></a> NZD/CAD </td> <td style="width: 20px; min-width: 20px;"></td> <td style="width: 103px; min-width: 103px;">I am trying to extract the text 'NZDCAD'. These are all the variants I have tried so far...all unsuccessful! tag=soup.find('a',class_ = 'explorer__anchor explorer__anchor--trade') tag=soup.find_all('td',class_ = 'slidetable__cell slidetable__cell--fixed') table=soup.find('table',class_='explorer_tradeslist__table alternating slidetable__table') table=soup.find('tbody',class_='explorer_tradeslist__tbody') tag=soup.find(class_='explorer_tradeslist__tbody',attrs={'id':'snap_180400_trade_349236564'}) tag=soup.find(class_='slidetable__cell slidetable__cell--fixed',attrs={'id':'snap_180400_trade_349236564'}) tag=soup.find_all('a',attrs={'id':'snap_180400_trade_349236564'})Are you able to put me out of my misery please? RE: Extracting html data using attributes - anbu23 - May-04-2020 import bs4 html_string='''<tbody class="explorer_tradeslist__tbody"> <tr id="trade_349236564" data-ticket="349236564" class="explorer_tradeslist__row "> <td class="slidetable__cell slidetable__cell--fixed" style="width: 63px; min-width: 63px;"> <a id="snap_180400_trade_349236564" class="explorer__anchor explorer__anchor--trade"></a> NZD/CAD </td> <td style="width: 20px; min-width: 20px;"></td> <td style="width: 103px; min-width: 103px;"> ''' soup = bs4.BeautifulSoup(html_string) soup.find("td",class_="slidetable__cell slidetable__cell--fixed").text RE: Extracting html data using attributes - WiPi - May-04-2020 I tried this and the output was 'Develop' which is weird as this is in the full html but under different tag names: <li class="left noborder nolink explorer__headerli explorer__headerli--title"> <strong class="explorer__titlesegment explorer__titlesegment--title"> <span class="icon icon--explorer-demo"></span> Develop </strong> <span class="explorer__titlesegment explorer__titlesegment--pipe">|</span> RE: Extracting html data using attributes - anbu23 - May-04-2020 Its wierd. Can you post code you tried? RE: Extracting html data using attributes - WiPi - May-04-2020 from bs4 import BeautifulSoup as bs import requests html = 'url' html = response.content soup = bs(html,'lxml') tag=soup.find("td",class_="slidetable__cell slidetable__cell--fixed").textActually I don't know what I did first time around but this returns 'None' now so I guess it didn't find anything. |