BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? (/thread-26568.html) |
BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - arbiel - May-05-2020 Hi The children of a node of the xml file I'm using all contain two ancient greek words as attributes, named «data-alpha» and «data-omega». Given an other greek word, I want to find out the node of which «data-omega» is alphabetically before this other word and of which next sibling's «data-alpha» is alphabetically after this word. I've made it to search nodes based on attributes with regular expression, such as def ltg1(classe): return classe is not None and classe=='js-char-popup-item symbols-grid__item u0000 ' def ltg2(classe): return classe is not None and re.compile('js-char-popup-item symbols-grid__item').search(classe) and not re.compile('^.*disabled$').search(classe) def les_lettres(doc_html, choix): if choix=="ltg1": lst=doc_html.find_all(class_=ltg1) elif choix=='ltg2': lst=doc_html.find_all(class_=ltg2) for lt in lst: dt=eval('dict('+lt.attrs['data-template']+')') lettre, symbol, unicode = dt['title'],dt[ 'symbol'],dt[ 'number'] print(lettre, symbol, unicode)How can I do that ? Arbiel RE: BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - anbu23 - May-07-2020 can you post sample html file? RE: BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - arbiel - May-09-2020 Hi anbu23 Hereunder is an extract of my xhtml file : <html> <head> <meta charset="UTF-8"/> <!--<base href="./pages/"/>--> <!--<base href="http://www.tabularium.be/bailly/"/>>--> <link rel="stylesheet" type="text/css" href="file:///home/grec/communs/grec.css"/> </head> <body> <section id="bailly"> <article data-dic="bailly" data-page="0165" data-alpha="Ἀνθώ" data-omega="ἀνίημι" id="bailly-0165" class="sommaire"> <a href="/home/grec/dictionnaires/bailly/pages/0165">Ἀνθώ - ἀνίημι </a> (bailly 0165) </article> <article data-dic="bailly" data-page="0177" data-alpha="ἀντιάνειρα" data-omega="Ἀντιγονίς" id="bailly-0177" class="sommaire"> <a href="/home/grec/dictionnaires/bailly/pages/0177">ἀντιάνειρα - Ἀντιγονίς </a> (bailly 0177) </article> <article data-dic="bailly" data-page="0183" data-alpha="Ἀντιοδημίς" data-omega="ἀντιπαράκλησις" id="bailly-0183" class="sommaire"> <a href="/home/grec/dictionnaires/bailly/pages/0183">Ἀντιοδημίς - ἀντιπαράκλησις </a> (bailly 0183) </article> </section> </body> </html>Here, my question was ill presented as I do not need to search with a regular expression, as I do not look for a known value, but for the highest value which is less or equal than a given string. To be more specific, suppose I'm looking for the word «ἄνοδος», which is between the «data-alpha» of page 165 (Ἀνθώ) and the «data-alpha» of page 177 (ἀντιάνειρα), I want to find page 165. Then the script will ask the user for the data-alpha and data-omega of intermediate pages and insert the corresponding pages inside the xhtml file. Obviously, if the script were looking for «ἀνιαρός», which is between «data-alpha» and «data-omega» of page 165, the script would stop looking, and inform the user that the searched-for page is page 165. Even if I still consider the question of how to use html5 compliant attributs, named «data-…» to filter xml/html nodes, I circumvented my problem in coding a loop. And that, also because the collating of ancient greek words requires me applying a fonction of my own that suppresses the diacritics as I haven't found any existing function correctly collating «diacriticized» greek words. |