Python Forum
BeautifulSoup : how to have a html5 attribut searched for in a regular expression ?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup : how to have a html5 attribut searched for in a regular expression ?
#1
Hi

The children of a node of the xml file I'm using all contain two ancient greek words as attributes, named «data-alpha» and «data-omega». Given an other greek word, I want to find out the node of which «data-omega» is alphabetically before this other word and of which next sibling's «data-alpha» is alphabetically after this word.

I've made it to search nodes based on attributes with regular expression, such as
def ltg1(classe):
	return classe is not None and classe=='js-char-popup-item symbols-grid__item u0000 '

def ltg2(classe):
	return classe is not None and re.compile('js-char-popup-item symbols-grid__item').search(classe) and not re.compile('^.*disabled$').search(classe)


def les_lettres(doc_html, choix):
	if choix=="ltg1":
		lst=doc_html.find_all(class_=ltg1)
	elif choix=='ltg2':
		lst=doc_html.find_all(class_=ltg2)
	for lt in lst:
		dt=eval('dict('+lt.attrs['data-template']+')')
		lettre, symbol, unicode = dt['title'],dt[ 'symbol'],dt[ 'number']
		print(lettre, symbol, unicode)
How can I do that ?

Arbiel
using Ubuntu 18.04.4 LTS, Python 3.8
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply
#2
can you post sample html file?
Reply
#3
Hi anbu23


Hereunder is an extract of my xhtml file :
<html>
<head>
<meta charset="UTF-8"/>
<!--<base href="./pages/"/>-->
<!--<base href="http://www.tabularium.be/bailly/"/>>-->
<link rel="stylesheet" type="text/css" href="file:///home/grec/communs/grec.css"/>
</head>
<body>
<section id="bailly">
<article data-dic="bailly" data-page="0165" data-alpha="Ἀνθώ" data-omega="ἀνίημι" id="bailly-0165" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0165">Ἀνθώ - ἀνίημι </a>
(bailly 0165)
</article>
<article data-dic="bailly" data-page="0177" data-alpha="ἀντιάνειρα" data-omega="Ἀντιγονίς" id="bailly-0177" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0177">ἀντιάνειρα - Ἀντιγονίς </a>
(bailly 0177)
</article>
<article data-dic="bailly" data-page="0183" data-alpha="Ἀντιοδημίς" data-omega="ἀντιπαράκλησις" id="bailly-0183" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0183">Ἀντιοδημίς - ἀντιπαράκλησις </a>
(bailly 0183)
</article>
</section>
</body>
</html>
Here, my question was ill presented as I do not need to search with a regular expression, as I do not look for a known value, but for the highest value which is less or equal than a given string.

To be more specific, suppose I'm looking for the word «ἄνοδος», which is between the «data-alpha» of page 165 (Ἀνθώ) and the «data-alpha» of page 177 (ἀντιάνειρα), I want to find page 165. Then the script will ask the user for the data-alpha and data-omega of intermediate pages and insert the corresponding pages inside the xhtml file.

Obviously, if the script were looking for «ἀνιαρός», which is between «data-alpha» and «data-omega» of page 165, the script would stop looking, and inform the user that the searched-for page is page 165.

Even if I still consider the question of how to use html5 compliant attributs, named «data-…» to filter xml/html nodes, I circumvented my problem in coding a loop. And that, also because the collating of ancient greek words requires me applying a fonction of my own that suppresses the diacritics as I haven't found any existing function correctly collating «diacriticized» greek words.
using Ubuntu 18.04.4 LTS, Python 3.8
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Regular Expression rakhmadiev 6 5,301 Aug-21-2023, 01:52 PM
Last Post: Gribouillis
  Extract text from tag content using regular expression Pavel_47 8 5,098 Nov-25-2019, 03:17 PM
Last Post: buran
  Python x Html5 JohnnyCoffee 4 74,696 Oct-02-2019, 11:47 PM
Last Post: JohnnyCoffee
  web scraping with python regular expression dbpython2017 6 9,139 Sep-26-2017, 02:16 AM
Last Post: dbpython2017

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020