Python Forum
Peculiar pattern from printing of sets
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Peculiar pattern from printing of sets
#1
Hey everyone.

This is my first post here. I tried searching a bit on google and couldn't find a post related to this. I'm assuming it's been asked before, but I had no lucking finding anything about it.

There's this peculiar pattern when printing out sets. I'm aware sets don't preserve insertion order like dicts do, and I assume that it's moreso the print function that causes this than the internal implementation of sets? I don't know much C and so I didn't check out setobject.c.

For instance, this code:
sets = [set(range(i, i + 10)) for i in range(0, 100, 10)]

for s in sets:
    print(s)
yield this result
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{10, 11, 12, 13, 14, 15, 16, 17, 18, 19}
{20, 21, 22, 23, 24, 25, 26, 27, 28, 29}
{32, 33, 34, 35, 36, 37, 38, 39, 30, 31}
{40, 41, 42, 43, 44, 45, 46, 47, 48, 49}
{50, 51, 52, 53, 54, 55, 56, 57, 58, 59}
{64, 65, 66, 67, 68, 69, 60, 61, 62, 63}
{70, 71, 72, 73, 74, 75, 76, 77, 78, 79}
{80, 81, 82, 83, 84, 85, 86, 87, 88, 89}
{96, 97, 98, 99, 90, 91, 92, 93, 94, 95}
Since set's don't preserve order I find it weird how most of these are printed in order, and those who aren't still follow a pattern; They are shifted to the left by two positions for each row that is out of order. E.g: the 3x's are shifted 2 positions to the lft. The 6x's are shifted 4 positions (2 more than 3xs)

I was thinking that maybe the print function prints the items in ascending order based on hash-value, but as far as I'm aware small int's hash is equal to their value. So if that were the case all rows should be in order.

Could anyone please shed some light into what causes this weird pattern?
Reply
#2
Sets are not guaranteed to preserve order, as you point out. Any observations you have about an order may change in the next version, or may be different on different platforms. Would not spend time on this (though agree it is interesting)
Reply
#3
Sets may not preserve order, but they have order. Items in the set are ordered the same regardless of the order they appear in the list used to make the set.
import random
x = list(range(60,70))
print(x, set(x))
random.shuffle(x)
print(x, set(x))
Output:
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63} [67, 61, 69, 68, 65, 66, 60, 63, 64, 62] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63}
Order is not imposed by the print command. When printing a set, the order is the same order the set values are returned if you use "for".
x = list(range(60,70))
y = set(x)
print(y, list(y))
Output:
{64, 65, 66, 67, 68, 69, 60, 61, 62, 63} [64, 65, 66, 67, 68, 69, 60, 61, 62, 63]
Because sets exist to do set operations, it makes sense that elements are stored in a way to facilitate this. Hash tables are used when you need a way to quickly access items in large collections, so it would make sense for the set values to be organized using their hash values.
print([hash(y) for y in set(range(60, 70))])
Output:
[64, 65, 66, 67, 68, 69, 60, 61, 62, 63]
It appears that the has value of an in is the int value, so I decided to find the first place where the set order did not match the hash order.
x = list(range(1000))
y = set(x)
for a, b in zip(x, y):
    if a != b:
        print(a, b)
        break
Output:
Hmmm, the order of the values/hash values was the same as the order of the set values. I printed the set to verify and noticed that the set was ordered 0, 1....999 in numerical order. So it would appear that the set order depends not only on the hash value but also on the set size. This is not surprising since all hash table algorithms take the table size into account.

Our set size is 10 so a reasonable has table size is 16 giving us a mask of 15 (b1111). Let's see what order we get if we use the hash value and a mask of 15 to get the hash table index.
mask = 15
x = [(y & mask, y) for y in range(60, 70)]
x.sort()
print(x)
print(set(range(60, 70)))
Output:
[(0, 64), (1, 65), (2, 66), (3, 67), (4, 68), (5, 69), (12, 60), (13, 61), (14, 62), (15, 63)] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63}
Woohoo! Nailed it on the first try! If we mask the last 4 bits of the hash value, the order of the resulting index matches the order of the values in the set.

Does this matter? No. I think the details of the Python set ordering algorithm are unimportant as long as it is efficient and correct. I cannot see any reason for trying to predict the order of values in a set since as defined in Python a set is unordered. The ordering might also change from one Python implementation to another. This worked for C Python, but J Python or Iron Python might use different algorithms for ordering sets.
SahandJ likes this post
Reply
#4
(Dec-29-2021, 04:25 PM)deanhystad Wrote: Sets may not preserve order, but they have order. Items in the set are ordered the same regardless of the order they appear in the list used to make the set.
import random
x = list(range(60,70))
print(x, set(x))
random.shuffle(x)
print(x, set(x))
Output:
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63} [67, 61, 69, 68, 65, 66, 60, 63, 64, 62] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63}
Order is not imposed by the print command. When printing a set, the order is the same order the set values are returned if you use "for".
x = list(range(60,70))
y = set(x)
print(y, list(y))
Output:
{64, 65, 66, 67, 68, 69, 60, 61, 62, 63} [64, 65, 66, 67, 68, 69, 60, 61, 62, 63]
Because sets exist to do set operations, it makes sense that elements are stored in a way to facilitate this. Hash tables are used when you need a way to quickly access items in large collections, so it would make sense for the set values to be organized using their hash values.
print([hash(y) for y in set(range(60, 70))])
Output:
[64, 65, 66, 67, 68, 69, 60, 61, 62, 63]
It appears that the has value of an in is the int value, so I decided to find the first place where the set order did not match the hash order.
x = list(range(1000))
y = set(x)
for a, b in zip(x, y):
    if a != b:
        print(a, b)
        break
Output:
Hmmm, the order of the values/hash values was the same as the order of the set values. I printed the set to verify and noticed that the set was ordered 0, 1....999 in numerical order. So it would appear that the set order depends not only on the hash value but also on the set size. This is not surprising since all hash table algorithms take the table size into account.

Our set size is 10 so a reasonable has table size is 16 giving us a mask of 15 (b1111). Let's see what order we get if we use the hash value and a mask of 15 to get the hash table index.
mask = 15
x = [(y & mask, y) for y in range(60, 70)]
x.sort()
print(x)
print(set(range(60, 70)))
Output:
[(0, 64), (1, 65), (2, 66), (3, 67), (4, 68), (5, 69), (12, 60), (13, 61), (14, 62), (15, 63)] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63}
Woohoo! Nailed it on the first try! If we mask the last 4 bits of the hash value, the order of the resulting index matches the order of the values in the set.

Does this matter? No. I think the details of the Python set ordering algorithm are unimportant as long as it is efficient and correct. I cannot see any reason for trying to predict the order of values in a set since as defined in Python a set is unordered. The ordering might also change from one Python implementation to another. This worked for C Python, but J Python or Iron Python might use different algorithms for ordering sets.


This is great!

Yeah I didn't really intend to use this knowledge to predict the order of any item. I just found it very interesting and was curious on why it behaved like this.
Thanks for figuring it out!
Reply
#5
(Dec-29-2021, 04:25 PM)deanhystad Wrote: Sets may not preserve order, but they have order. Items in the set are ordered the same regardless of the order they appear in the list used to make the set.
import random
x = list(range(60,70))
print(x, set(x))
random.shuffle(x)
print(x, set(x))
Output:
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63} [67, 61, 69, 68, 65, 66, 60, 63, 64, 62] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63}
Order is not imposed by the print command. When printing a set, the order is the same order the set values are returned if you use "for".
x = list(range(60,70))
y = set(x)
print(y, list(y))
Output:
{64, 65, 66, 67, 68, 69, 60, 61, 62, 63} [64, 65, 66, 67, 68, 69, 60, 61, 62, 63]
Because sets exist to do set operations, it makes sense that elements are stored in a way to facilitate this. Hash tables are used when you need a way to quickly access items in large collections, so it would make sense for the set values to be organized using their hash values.
print([hash(y) for y in set(range(60, 70))])
Output:
[64, 65, 66, 67, 68, 69, 60, 61, 62, 63]
It appears that the has value of an in is the int value, so I decided to find the first place where the set order did not match the hash order.
x = list(range(1000))
y = set(x)
for a, b in zip(x, y):
    if a != b:
        print(a, b)
        break
Output:
Hmmm, the order of the values/hash values was the same as the order of the set values. I printed the set to verify and noticed that the set was ordered 0, 1....999 in numerical order. So it would appear that the set order depends not only on the hash value but also on the set size. This is not surprising since all hash table algorithms take the table size into account.

Our set size is 10 so a reasonable has table size is 16 giving us a mask of 15 (b1111). Let's see what order we get if we use the hash value and a mask of 15 to get the hash table index.
mask = 15
x = [(y & mask, y) for y in range(60, 70)]
x.sort()
print(x)
print(set(range(60, 70)))
Output:
[(0, 64), (1, 65), (2, 66), (3, 67), (4, 68), (5, 69), (12, 60), (13, 61), (14, 62), (15, 63)] {64, 65, 66, 67, 68, 69, 60, 61, 62, 63}
Woohoo! Nailed it on the first try! If we mask the last 4 bits of the hash value, the order of the resulting index matches the order of the values in the set.

Does this matter? No. I think the details of the Python set ordering algorithm are unimportant as long as it is efficient and correct. I cannot see any reason for trying to predict the order of values in a set since as defined in Python a set is unordered. The ordering might also change from one Python implementation to another. This worked for C Python, but J Python or Iron Python might use different algorithms for ordering sets.

Sorry for the double reply. But I wonder why this wouldn't then make the set {10...19} show up in the wrong order?
printing out that set shows them in correct order.

Using a mask of 15, the values 10..15 will have the same last 4 bits before and after. 16, 17, 18, and 19 will have 0, 1, 2, and 3 (which makes sense). But shouldn't they then be printed first? I.e {16, 17, 18, 19, 10, 11, 12, 13, 14, 15} ?
Reply
#6
Because the table size isn't 16, it is something else and 16 happened to work for 60..70. Maybe the table size is 32? That works for 10..20 and 60..70. The easiest way to know exactly what the algorithm does is download the CPython source and read setobject.c. I didn't use that approach because your post was more along the lines of "I wonder what might be happening here?" as opposed to "What is the algorithm?" I worked through the steps of how someone might derive the algorithm from "black box" type testing and experimentation.
Reply
#7
Oh I see. I thought the table size would always be the first power of 2 that fits all elements. So 16 when the size is 10, 32 if it was 20, etc.
I must have misunderstood. Sorry.
Reply
#8
For instance, here the set {17, 15} is created and then new items are added one at a time. When other items are added you can see the order changes.

Output:
{17, 15} {0, 17, 15} {0, 17, 1, 15} # 3 bit hash selection, plus insertion order {0, 1, 2, 15, 17} # 4 bit hash selection {0, 1, 2, 3, 15, 17} {0, 1, 2, 3, 4, 15, 17}
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How does one combine 2 data sets ? detlefschmitt 2 1,679 Sep-03-2021, 03:38 AM
Last Post: detlefschmitt
  Looping Through Large Data Sets JoeDainton123 10 4,362 Oct-18-2020, 02:58 PM
Last Post: buran
  comprehension for sets Skaperen 2 1,857 Aug-07-2020, 10:12 PM
Last Post: Skaperen
  Sort sets by item values Sergey 4 67,359 Apr-19-2019, 10:50 AM
Last Post: Sergey
  Problem with character sets Pedroski55 4 3,710 Mar-04-2019, 02:35 AM
Last Post: snippsat
  merge 3 sql data sets to 1 librairy brecht83 0 2,107 Sep-26-2018, 10:13 PM
Last Post: brecht83
  Sets and Lists mp3909 2 2,411 Feb-21-2018, 11:54 AM
Last Post: mp3909

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020