Grouping Data based on 30% bracket

purnima1 · (This post was last modified: Mar-09-2023, 05:43 PM by purnima1.)

I want to one group where all values are with 30% of each other.

working code is follows :

from itertools import combinations

def pctDiff(A,B):
    return abs(A-B)*200/(A+B)

def main():
    dict2={}
    dict ={'acct_number':10202,'acct_name':'abc','v1_rev':3000,'v2_rev':4444,'v4_rev':234534,'v5_rev':5665,'v6_rev':66,'v7_rev':66,'v3_rev':66}
    vendors_revenue_list =['v1_rev','v2_rev','v3_rev','v4_rev','v5_rev','v6_rev','v7_rev','v8_rev']
    #prepared list of vendors
    for k in vendors_revenue_list:
        if k in dict.keys():
            dict2.update({k: dict[k]})

    print(dict2)
    #provides all possible combination
    for a, b in combinations(dict2, 2):
        groups = [(a,b) for a,b in combinations(dict2,2) if pctDiff(dict2[a],dict2[b]) <= 30]

    print(groups)

Output:{'v1_rev': 3000, 'v2_rev': 4444, 'v3_rev': 66, 'v4_rev': 234534, 'v5_rev': 5665, 'v6_rev': 66, 'v7_rev': 66}
[('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev'), ('v3_rev', 'v7_rev'), ('v6_rev', 'v7_rev')]

desired output

Output:
[('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev','v7_rev')]

**deanhystad** · Mar-09-2023, 06:26 PM

What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?

purnima1 · Mar-09-2023, 06:28 PM

(Mar-09-2023, 06:26 PM)deanhystad Wrote: What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?

purnima1 · Mar-09-2023, 06:31 PM

I want result as

(a,b,c) and (b,c,d)

With current implementation I think we will get

(a,b) (b,c) (c,a) (c,d)(b,d)

**deanhystad** · (This post was last modified: Mar-10-2023, 07:38 PM by deanhystad.)

data = {
    "v1_rev": 3000,
    "v2_rev": 4444,
    "v4_rev": 234534,
    "v5_rev": 5665,
    "v6_rev": 66,
    "v7_rev": 66,
    "v3_rev": 66,
}

# Sort items in increasing order of their value
items = iter(sorted(data.items(), key=lambda x: x[1]))
threshold = 30

bins = []
bin_ = [next(items)]
for item in items:
    # What is the smallest value that can be in bin with item?
    start = item[1] * (200 - threshold) / (200 + threshold)
    if bin_[0][1] < start:
        bins.append(bin_.copy())
    bin_.append(item)
    while bin_[0][1] < start:
        bin_.pop(0)

if bin_:
    bins.append(bin_)

bins = [dict(bin_) for bin_ in bins]
print(bins)

Output:
[{'v6_rev': 66, 'v7_rev': 66, 'v3_rev': 66}, {'v1_rev': 3000}, {'v2_rev': 4444, 'v5_rev': 5665}, {'v4_rev': 234534}]

As a generator. Testing with overlapping bins:

from typing import Any, Generator


def dict_grouper(
    value_dict: dict[Any, float], percent: float = 30
) -> Generator[dict[Any, float], None, None]:
    """Group values where each value in group is within "percent" of others."""
    scale = (200 - percent) / (200 + percent)
    items = iter(sorted(value_dict.items(), key=lambda x: x[1]))
    grp = [next(items)]
    for item in items:
        start = item[1] * scale
        if grp[0][1] < start:
            yield dict(grp)
            grp = [x for x in grp[1:] if x[1] >= start]
        grp.append(item)

    if grp:
        yield dict(grp)


print(*dict_grouper(dict(zip(("ABCDEFG"), range(30, 100, 10)))), sep="\n")

Output:{'A': 30, 'B': 40}
{'B': 40, 'C': 50}
{'C': 50, 'D': 60}
{'D': 60, 'E': 70, 'F': 80}
{'E': 70, 'F': 80, 'G': 90}

This should be very fast and stay fast. Using combinations with 7 items there are 127 potential groups and you would compute 742 pctDiff's. As the number of items increases, both these numbers increase rapidly. My algorithm only has to compute pctDiff 7 times, and the number of calculations grows linearly with the item count.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Changing client.get() method type based on size of data...	dl0dth	1	815	Jan-02-2025, 08:30 PM Last Post: dl0dth
	conditionals based on data frame	mbrown009	1	1,654	Aug-12-2022, 08:18 AM Last Post: Larz60+
	I have written a program that outputs data based on GPS signal	kalle	1	2,194	Jul-22-2022, 12:10 AM Last Post: mcmxl22
	Change elements of array based on position of input data	Cola_Reb	6	3,637	May-13-2022, 12:57 PM Last Post: Cola_Reb
	How to map two data frames based on multiple condition	SriRajesh	0	2,427	Oct-27-2021, 02:43 PM Last Post: SriRajesh
	Grouping and sum of a list of objects	Otbredbaron	1	5,657	Oct-23-2021, 01:42 PM Last Post: Gribouillis
	Extracting unique pairs from a data set based on another value	rybina	2	3,156	Feb-12-2021, 08:36 AM Last Post: rybina
	Data extraction from a table based on column and row names	tgottsc1	1	3,228	Jan-09-2021, 10:04 PM Last Post: buran
	Grouping and summing of dataset	jef	0	2,273	Oct-04-2020, 11:03 PM Last Post: jef
	Extracting data based on specific patterns in a text file	K11	1	2,979	Aug-28-2020, 09:00 AM Last Post: Gribouillis

Grouping Data based on 30% bracket

User Panel Messages

Announcements