Python Forum
Grouping Data based on 30% bracket
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Grouping Data based on 30% bracket
#1
I want to one group where all values are with 30% of each other.

working code is follows :

from itertools import combinations

def pctDiff(A,B):
    return abs(A-B)*200/(A+B)

def main():
    dict2={}
    dict ={'acct_number':10202,'acct_name':'abc','v1_rev':3000,'v2_rev':4444,'v4_rev':234534,'v5_rev':5665,'v6_rev':66,'v7_rev':66,'v3_rev':66}
    vendors_revenue_list =['v1_rev','v2_rev','v3_rev','v4_rev','v5_rev','v6_rev','v7_rev','v8_rev']
    #prepared list of vendors
    for k in vendors_revenue_list:
        if k in dict.keys():
            dict2.update({k: dict[k]})

    print(dict2)
    #provides all possible combination
    for a, b in combinations(dict2, 2):
        groups = [(a,b) for a,b in combinations(dict2,2) if pctDiff(dict2[a],dict2[b]) <= 30]

    print(groups)
Output:
{'v1_rev': 3000, 'v2_rev': 4444, 'v3_rev': 66, 'v4_rev': 234534, 'v5_rev': 5665, 'v6_rev': 66, 'v7_rev': 66} [('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev'), ('v3_rev', 'v7_rev'), ('v6_rev', 'v7_rev')]
desired output

Output:
[('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev','v7_rev')]
Reply
#2
What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?
Reply
#3
(Mar-09-2023, 06:26 PM)deanhystad Wrote: What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?
Reply
#4
I want result as

(a,b,c) and (b,c,d)

With current implementation I think we will get

(a,b) (b,c) (c,a) (c,d)(b,d)
Reply
#5
data = {
    "v1_rev": 3000,
    "v2_rev": 4444,
    "v4_rev": 234534,
    "v5_rev": 5665,
    "v6_rev": 66,
    "v7_rev": 66,
    "v3_rev": 66,
}

# Sort items in increasing order of their value
items = iter(sorted(data.items(), key=lambda x: x[1]))
threshold = 30

bins = []
bin_ = [next(items)]
for item in items:
    # What is the smallest value that can be in bin with item?
    start = item[1] * (200 - threshold) / (200 + threshold)
    if bin_[0][1] < start:
        bins.append(bin_.copy())
    bin_.append(item)
    while bin_[0][1] < start:
        bin_.pop(0)

if bin_:
    bins.append(bin_)

bins = [dict(bin_) for bin_ in bins]
print(bins)
Output:
[{'v6_rev': 66, 'v7_rev': 66, 'v3_rev': 66}, {'v1_rev': 3000}, {'v2_rev': 4444, 'v5_rev': 5665}, {'v4_rev': 234534}]
As a generator. Testing with overlapping bins:
from typing import Any, Generator


def dict_grouper(
    value_dict: dict[Any, float], percent: float = 30
) -> Generator[dict[Any, float], None, None]:
    """Group values where each value in group is within "percent" of others."""
    scale = (200 - percent) / (200 + percent)
    items = iter(sorted(value_dict.items(), key=lambda x: x[1]))
    grp = [next(items)]
    for item in items:
        start = item[1] * scale
        if grp[0][1] < start:
            yield dict(grp)
            grp = [x for x in grp[1:] if x[1] >= start]
        grp.append(item)

    if grp:
        yield dict(grp)


print(*dict_grouper(dict(zip(("ABCDEFG"), range(30, 100, 10)))), sep="\n")
Output:
{'A': 30, 'B': 40} {'B': 40, 'C': 50} {'C': 50, 'D': 60} {'D': 60, 'E': 70, 'F': 80} {'E': 70, 'F': 80, 'G': 90}
This should be very fast and stay fast. Using combinations with 7 items there are 127 potential groups and you would compute 742 pctDiff's. As the number of items increases, both these numbers increase rapidly. My algorithm only has to compute pctDiff 7 times, and the number of calculations grows linearly with the item count.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Changing client.get() method type based on size of data... dl0dth 1 722 Jan-02-2025, 08:30 PM
Last Post: dl0dth
  conditionals based on data frame mbrown009 1 1,576 Aug-12-2022, 08:18 AM
Last Post: Larz60+
  I have written a program that outputs data based on GPS signal kalle 1 2,069 Jul-22-2022, 12:10 AM
Last Post: mcmxl22
Question Change elements of array based on position of input data Cola_Reb 6 3,470 May-13-2022, 12:57 PM
Last Post: Cola_Reb
  How to map two data frames based on multiple condition SriRajesh 0 2,323 Oct-27-2021, 02:43 PM
Last Post: SriRajesh
  Grouping and sum of a list of objects Otbredbaron 1 5,357 Oct-23-2021, 01:42 PM
Last Post: Gribouillis
  Extracting unique pairs from a data set based on another value rybina 2 3,048 Feb-12-2021, 08:36 AM
Last Post: rybina
  Data extraction from a table based on column and row names tgottsc1 1 3,106 Jan-09-2021, 10:04 PM
Last Post: buran
  Grouping and summing of dataset jef 0 2,188 Oct-04-2020, 11:03 PM
Last Post: jef
  Extracting data based on specific patterns in a text file K11 1 2,871 Aug-28-2020, 09:00 AM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020