Python Forum
BeautifulSoup : how to have a html5 attribut searched for in a regular expression ?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup : how to have a html5 attribut searched for in a regular expression ?
#1
Hi

The children of a node of the xml file I'm using all contain two ancient greek words as attributes, named «data-alpha» and «data-omega». Given an other greek word, I want to find out the node of which «data-omega» is alphabetically before this other word and of which next sibling's «data-alpha» is alphabetically after this word.

I've made it to search nodes based on attributes with regular expression, such as
def ltg1(classe):
	return classe is not None and classe=='js-char-popup-item symbols-grid__item u0000 '

def ltg2(classe):
	return classe is not None and re.compile('js-char-popup-item symbols-grid__item').search(classe) and not re.compile('^.*disabled$').search(classe)


def les_lettres(doc_html, choix):
	if choix=="ltg1":
		lst=doc_html.find_all(class_=ltg1)
	elif choix=='ltg2':
		lst=doc_html.find_all(class_=ltg2)
	for lt in lst:
		dt=eval('dict('+lt.attrs['data-template']+')')
		lettre, symbol, unicode = dt['title'],dt[ 'symbol'],dt[ 'number']
		print(lettre, symbol, unicode)
How can I do that ?

Arbiel
using Ubuntu 18.04.4 LTS, Python 3.8
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply
#2
can you post sample html file?
Reply
#3
Hi anbu23


Hereunder is an extract of my xhtml file :
<html>
<head>
<meta charset="UTF-8"/>
<!--<base href="./pages/"/>-->
<!--<base href="http://www.tabularium.be/bailly/"/>>-->
<link rel="stylesheet" type="text/css" href="file:///home/grec/communs/grec.css"/>
</head>
<body>
<section id="bailly">
<article data-dic="bailly" data-page="0165" data-alpha="Ἀνθώ" data-omega="ἀνίημι" id="bailly-0165" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0165">Ἀνθώ - ἀνίημι </a>
(bailly 0165)
</article>
<article data-dic="bailly" data-page="0177" data-alpha="ἀντιάνειρα" data-omega="Ἀντιγονίς" id="bailly-0177" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0177">ἀντιάνειρα - Ἀντιγονίς </a>
(bailly 0177)
</article>
<article data-dic="bailly" data-page="0183" data-alpha="Ἀντιοδημίς" data-omega="ἀντιπαράκλησις" id="bailly-0183" class="sommaire">
<a href="/home/grec/dictionnaires/bailly/pages/0183">Ἀντιοδημίς - ἀντιπαράκλησις </a>
(bailly 0183)
</article>
</section>
</body>
</html>
Here, my question was ill presented as I do not need to search with a regular expression, as I do not look for a known value, but for the highest value which is less or equal than a given string.

To be more specific, suppose I'm looking for the word «ἄνοδος», which is between the «data-alpha» of page 165 (Ἀνθώ) and the «data-alpha» of page 177 (ἀντιάνειρα), I want to find page 165. Then the script will ask the user for the data-alpha and data-omega of intermediate pages and insert the corresponding pages inside the xhtml file.

Obviously, if the script were looking for «ἀνιαρός», which is between «data-alpha» and «data-omega» of page 165, the script would stop looking, and inform the user that the searched-for page is page 165.

Even if I still consider the question of how to use html5 compliant attributs, named «data-…» to filter xml/html nodes, I circumvented my problem in coding a loop. And that, also because the collating of ancient greek words requires me applying a fonction of my own that suppresses the diacritics as I haven't found any existing function correctly collating «diacriticized» greek words.
using Ubuntu 18.04.4 LTS, Python 3.8
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract text from tag content using regular expression Pavel_47 8 3,204 Nov-25-2019, 03:17 PM
Last Post: buran
  Python x Html5 JohnnyCoffee 4 2,113 Oct-02-2019, 11:47 PM
Last Post: JohnnyCoffee
  web scraping with python regular expression dbpython2017 6 7,645 Sep-26-2017, 02:16 AM
Last Post: dbpython2017
  Regular Expression rakhmadiev 4 3,777 Jun-04-2017, 05:47 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020