BeautifulSoup : how to have a html5 attribut searched for in a regular expression ?

BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? (/thread-26568.html)

BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - arbiel - May-05-2020

Hi

The children of a node of the xml file I'm using all contain two ancient greek words as attributes, named «data-alpha» and «data-omega». Given an other greek word, I want to find out the node of which «data-omega» is alphabetically before this other word and of which next sibling's «data-alpha» is alphabetically after this word.

I've made it to search nodes based on attributes with regular expression, such as

def ltg1(classe):
	return classe is not None and classe=='js-char-popup-item symbols-grid__item u0000 '

def ltg2(classe):
	return classe is not None and re.compile('js-char-popup-item symbols-grid__item').search(classe) and not re.compile('^.*disabled$').search(classe)


def les_lettres(doc_html, choix):
	if choix=="ltg1":
		lst=doc_html.find_all(class_=ltg1)
	elif choix=='ltg2':
		lst=doc_html.find_all(class_=ltg2)
	for lt in lst:
		dt=eval('dict('+lt.attrs['data-template']+')')
		lettre, symbol, unicode = dt['title'],dt[ 'symbol'],dt[ 'number']
		print(lettre, symbol, unicode)

How can I do that ?

Arbiel

RE: BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - anbu23 - May-07-2020

can you post sample html file?

RE: BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? - arbiel - May-09-2020

Hi anbu23

Hereunder is an extract of my xhtml file :

<html>
<head>
<meta charset="UTF-8"/>
<!--<base href="./pages/"/>-->
<!--<base href="http://www.tabularium.be/bailly/"/>>-->
<link rel="stylesheet" type="text/css" href="file:///home/grec/communs/grec.css"/>
</head>
<body>
<section id="bailly">
<article data-dic="bailly" data-page="0165" data-alpha="Ἀνθώ" data-omega="ἀνίημι" id="bailly-0165" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0165">Ἀνθώ - ἀνίημι </a>
(bailly 0165)
</article>
<article data-dic="bailly" data-page="0177" data-alpha="ἀντιάνειρα" data-omega="Ἀντιγονίς" id="bailly-0177" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0177">ἀντιάνειρα - Ἀντιγονίς </a>
(bailly 0177)
</article>
<article data-dic="bailly" data-page="0183" data-alpha="Ἀντιοδημίς" data-omega="ἀντιπαράκλησις" id="bailly-0183" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0183">Ἀντιοδημίς - ἀντιπαράκλησις </a>
(bailly 0183)
</article>
</section>
</body>
</html>

Here, my question was ill presented as I do not need to search with a regular expression, as I do not look for a known value, but for the highest value which is less or equal than a given string.

To be more specific, suppose I'm looking for the word «ἄνοδος», which is between the «data-alpha» of page 165 (Ἀνθώ) and the «data-alpha» of page 177 (ἀντιάνειρα), I want to find page 165. Then the script will ask the user for the data-alpha and data-omega of intermediate pages and insert the corresponding pages inside the xhtml file.

Obviously, if the script were looking for «ἀνιαρός», which is between «data-alpha» and «data-omega» of page 165, the script would stop looking, and inform the user that the searched-for page is page 165.

Even if I still consider the question of how to use html5 compliant attributs, named «data-…» to filter xml/html nodes, I circumvented my problem in coding a loop. And that, also because the collating of ancient greek words requires me applying a fonction of my own that suppresses the diacritics as I haven't found any existing function correctly collating «diacriticized» greek words.