Python Forum
Grouping Data based on 30% bracket
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Grouping Data based on 30% bracket
#1
I want to one group where all values are with 30% of each other.

working code is follows :

from itertools import combinations

def pctDiff(A,B):
    return abs(A-B)*200/(A+B)

def main():
    dict2={}
    dict ={'acct_number':10202,'acct_name':'abc','v1_rev':3000,'v2_rev':4444,'v4_rev':234534,'v5_rev':5665,'v6_rev':66,'v7_rev':66,'v3_rev':66}
    vendors_revenue_list =['v1_rev','v2_rev','v3_rev','v4_rev','v5_rev','v6_rev','v7_rev','v8_rev']
    #prepared list of vendors
    for k in vendors_revenue_list:
        if k in dict.keys():
            dict2.update({k: dict[k]})

    print(dict2)
    #provides all possible combination
    for a, b in combinations(dict2, 2):
        groups = [(a,b) for a,b in combinations(dict2,2) if pctDiff(dict2[a],dict2[b]) <= 30]

    print(groups)
Output:
{'v1_rev': 3000, 'v2_rev': 4444, 'v3_rev': 66, 'v4_rev': 234534, 'v5_rev': 5665, 'v6_rev': 66, 'v7_rev': 66} [('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev'), ('v3_rev', 'v7_rev'), ('v6_rev', 'v7_rev')]
desired output

Output:
[('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev','v7_rev')]
Reply
#2
What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?
Reply
#3
(Mar-09-2023, 06:26 PM)deanhystad Wrote: What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?
Reply
#4
I want result as

(a,b,c) and (b,c,d)

With current implementation I think we will get

(a,b) (b,c) (c,a) (c,d)(b,d)
Reply
#5
data = {
    "v1_rev": 3000,
    "v2_rev": 4444,
    "v4_rev": 234534,
    "v5_rev": 5665,
    "v6_rev": 66,
    "v7_rev": 66,
    "v3_rev": 66,
}

# Sort items in increasing order of their value
items = iter(sorted(data.items(), key=lambda x: x[1]))
threshold = 30

bins = []
bin_ = [next(items)]
for item in items:
    # What is the smallest value that can be in bin with item?
    start = item[1] * (200 - threshold) / (200 + threshold)
    if bin_[0][1] < start:
        bins.append(bin_.copy())
    bin_.append(item)
    while bin_[0][1] < start:
        bin_.pop(0)

if bin_:
    bins.append(bin_)

bins = [dict(bin_) for bin_ in bins]
print(bins)
Output:
[{'v6_rev': 66, 'v7_rev': 66, 'v3_rev': 66}, {'v1_rev': 3000}, {'v2_rev': 4444, 'v5_rev': 5665}, {'v4_rev': 234534}]
As a generator. Testing with overlapping bins:
from typing import Any, Generator


def dict_grouper(
    value_dict: dict[Any, float], percent: float = 30
) -> Generator[dict[Any, float], None, None]:
    """Group values where each value in group is within "percent" of others."""
    scale = (200 - percent) / (200 + percent)
    items = iter(sorted(value_dict.items(), key=lambda x: x[1]))
    grp = [next(items)]
    for item in items:
        start = item[1] * scale
        if grp[0][1] < start:
            yield dict(grp)
            grp = [x for x in grp[1:] if x[1] >= start]
        grp.append(item)

    if grp:
        yield dict(grp)


print(*dict_grouper(dict(zip(("ABCDEFG"), range(30, 100, 10)))), sep="\n")
Output:
{'A': 30, 'B': 40} {'B': 40, 'C': 50} {'C': 50, 'D': 60} {'D': 60, 'E': 70, 'F': 80} {'E': 70, 'F': 80, 'G': 90}
This should be very fast and stay fast. Using combinations with 7 items there are 127 potential groups and you would compute 742 pctDiff's. As the number of items increases, both these numbers increase rapidly. My algorithm only has to compute pctDiff 7 times, and the number of calculations grows linearly with the item count.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  conditionals based on data frame mbrown009 1 905 Aug-12-2022, 08:18 AM
Last Post: Larz60+
  I have written a program that outputs data based on GPS signal kalle 1 1,182 Jul-22-2022, 12:10 AM
Last Post: mcmxl22
Question Change elements of array based on position of input data Cola_Reb 6 2,143 May-13-2022, 12:57 PM
Last Post: Cola_Reb
  How to map two data frames based on multiple condition SriRajesh 0 1,494 Oct-27-2021, 02:43 PM
Last Post: SriRajesh
  Grouping and sum of a list of objects Otbredbaron 1 3,226 Oct-23-2021, 01:42 PM
Last Post: Gribouillis
  Extracting unique pairs from a data set based on another value rybina 2 2,309 Feb-12-2021, 08:36 AM
Last Post: rybina
  Data extraction from a table based on column and row names tgottsc1 1 2,417 Jan-09-2021, 10:04 PM
Last Post: buran
  Grouping and summing of dataset jef 0 1,651 Oct-04-2020, 11:03 PM
Last Post: jef
  Extracting data based on specific patterns in a text file K11 1 2,217 Aug-28-2020, 09:00 AM
Last Post: Gribouillis
  Grouping algorithm riccardoob 7 3,039 May-19-2020, 01:22 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020