Python Forum
[BeautifulSoup] ; comparing attributs with given value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[BeautifulSoup] ; comparing attributs with given value
#1
Hi

I'm creating an html file to record the first and last entry of each page of my greek dictionnary. To find the page number of a given word, for instance «ἔπος», I have to find out the page such that its first entry is less or equal to «ἔπος» and last entry is greater or equal to «ἔπος».
Among the BeautifulSoup options to filter an xml tree with the «.find_all» method, there is a one-parameter-only function, this parameter being, as long as I've been to understand, each xml element the «.find_all» is processing.
To be able to compare the entries to the looked for word, I have to record it either in :
-a nonlocal or global variable
-an element of the tree, for example the parent node of the inspected element

I have tested this second way with:
def sel_laPage(article):
#	article['α'] is the first entry of the page
#	article['ω'], the last one
	return article.has_attr('α') and article.parent['ἔπος'] >= article['α'] and article.parent['ἔπος'] <= article['ω']
What do you think of the way I'm processing ?
How would you solve this issue ?

Arbiel

P.S : I actually found here another supposedly better solution
Reply
#2
or create a function which can be reused:

def isbetween(x, y, z):
    if x >= y <= z:
        print(f"{x} is between {y} and {z}")
    else:
        print(f"{x} is not between {y} and {z}")

isbetween(15, 10, 20)
isbetween(9, 10, 20)
Output:
15 is between 10 and 20 9 is not between 10 and 20
Reply
#3
(Dec-13-2020, 11:19 AM)arbiel Wrote: What do you think of the way I'm processing ?
That it will not work Wink
There is mix of types(has_attr(bool) parent(list) and then compare >=,make little sense.
>>> soup.tag2.has_attr('name')
True
>>> 
>>> soup.tag1.parent('tag1')
[<tag1>
<tag2 name="tag 2">
<tag3 name="korea"></tag3>
<tag4 name="china"></tag4>
<tag5 name="japan"></tag5>
</tag2>
</tag1>]
If use >= on string there are basic stuff that need to be known.
It will use lexicographical ordering on individual characters ASCII\Unicode code point.
>>> a = 'α'
>>> b = 'a'
>>> a < b
False
>>> 
>>> ord(a)
945
>>> ord(b)
97
>>> 945 < 97
False 
With word it will soon be become difficult to use < >.
>>> 'abc' > 'bac'
False
Result False as soon as a is found to be less than b.
The no further items are not compared.

For your word ἔπος the Unicode code point are:
>>> s = 'ἔπος'
>>> [ord(i) for i in s]
[7956, 960, 959, 962]
Reply
#4
Hi Larz60+ and snippsat

Thank's to both of you for having giving interest to my issue. Obviously, I should have been more precise.

@Larz60+
Filtering an xml tree with a function is only possible with a single-parameter fonction.

If it had been possible, I would have coded:
def sel_laPage(article, atonique):
#   article['α'] is the first entry of the page
#   article['ω'], the last one
    return article.has_attr('α') and atonique >= article['α'] and atonique <= article['ω']
…
grec.find('body').find_all(sel_laPage)
or, as you suggest,
def sel_laPage(article, atonique):
#   article['α'] is the first entry of the page
#   article['ω'], the last one
    return article.has_attr('α') and article['α'] >= atonique <= article['ω']
…
grec.find('body').find_all(sel_laPage)
However, with a one-parameter-only function, and a nonlocal or global «atonique» variable, I can code

def sel_laPage(article):
#   article['α'] is the first entry of the page
#   article['ω'], the last one
    nonlocal atonique
    return article.has_attr('α') and article['α'] >= atonique <= article['ω']
…
grec.find('body').find_all(sel_laPage)
but I don't like nonlocal or global assertions. However that could be the easiest way.

@snippsat

I should have written that all words I am comparing to each other are lowercase and free of any diacritized characters, so that comparing them makes sense. The «ἔπος» I gave as an example is the name of the attribute of the parent of all «article» elements, and not the value of it

My test file is as follows

[inline]<html><head/>
<html><head/>
<body>
<section id="Bailly" ἔπος="αβλαυτω">
<article id="0001" α="α" ω="αβαξ" ἄ="α" ὤ="ἄϐαξ"><a href="0001"><a href="0002" suivante="oui"></a></a></article>
<article id="0002" α="αβαπτιστος" ω="αβλαστος" ἄ="ἀϐάπτιστος" ὤ="ἄϐλαστος"><a href="0002"><a href="0003" suivante="oui"></a></a></article>
<article id="0003" α="αβλαυτος" ω="αβροτονινος" ἄ="ἄϐλαυτος" ὤ="ἀϐροτόνινος"><a href="0003"><a href="0003" suivante="non"></a></a></article>
<article id="0004" α="αβροτονιτης" ω="αγαθος" ἄ="ἀϐροτονίτης" ὤ="ἀγαθός"><a href="0004"><a href="0005" suivante="oui"></a></a></article>
<article id="0100" α="αμιλλημα" ω="αμμωνιοι" ἄ="ἁμίλλημα" ὤ="Ἀμμώνιοι"><a href="0100"><a href="0100" suivante="non"></a></a></article>
<article id="0101" α="αμμωνιος" ω="αμορβευω" ἄ="Ἀμμώνιος" ὤ="ἀμορϐεύω"><a href="0101"><a href="0101" suivante="non"></a></a>
</section>
</body>
</html>[/inline]

As it appears, there are 4 arguments to the «article» element, α, ω,ἄ ànd ὤ, the first two are undiacritized words, the last two, the real entries of the pages.

Running my test script on this test file gives the following result:
python filtre.py pour_filtre.html ἄϐλαυτω
[<article id="0003" α="αβλαυτος" ω="αβροτονινος" ἄ="ἄϐλαυτος" ὤ="ἀϐροτόνινος"><a href="0003"><a href="0003" suivante="non"></a></a></article>]
Arbiel
Reply
#5
Hi

With a lambda expression, it is possible to solve this issue without such oddities as nonlocal or global, nor without introducing in the xml(html) tree the value the tag's arguments are to be compared with.

Example :

# function to search for the tag
def sel_laPage(article, atonique):
	return article.has_attr('α') and atonique >= article['α'] and atonique <= article['ω']

# function to set the value to be compared with the tag's arguments
def contient(atonique):
	return lambda article : sel_laPage(article, atonique)

#function to read the xml file and find the tag into it
def test(fichier, atonique):
	grec=lire_xml(fichier)
	lapage=contient(atonique)
	return grec.find('body').find_all(lapage)
Arbiel
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020