Python Forum
Grouping Data based on 30% bracket
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Grouping Data based on 30% bracket
#1
I want to one group where all values are with 30% of each other.

working code is follows :

from itertools import combinations

def pctDiff(A,B):
    return abs(A-B)*200/(A+B)

def main():
    dict2={}
    dict ={'acct_number':10202,'acct_name':'abc','v1_rev':3000,'v2_rev':4444,'v4_rev':234534,'v5_rev':5665,'v6_rev':66,'v7_rev':66,'v3_rev':66}
    vendors_revenue_list =['v1_rev','v2_rev','v3_rev','v4_rev','v5_rev','v6_rev','v7_rev','v8_rev']
    #prepared list of vendors
    for k in vendors_revenue_list:
        if k in dict.keys():
            dict2.update({k: dict[k]})

    print(dict2)
    #provides all possible combination
    for a, b in combinations(dict2, 2):
        groups = [(a,b) for a,b in combinations(dict2,2) if pctDiff(dict2[a],dict2[b]) <= 30]

    print(groups)
Output:
{'v1_rev': 3000, 'v2_rev': 4444, 'v3_rev': 66, 'v4_rev': 234534, 'v5_rev': 5665, 'v6_rev': 66, 'v7_rev': 66} [('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev'), ('v3_rev', 'v7_rev'), ('v6_rev', 'v7_rev')]
desired output

Output:
[('v2_rev', 'v5_rev'), ('v3_rev', 'v6_rev','v7_rev')]
Reply
#2
What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?
Reply
#3
(Mar-09-2023, 06:26 PM)deanhystad Wrote: What do you want done if a, b, and c are within 30% of each other, and b, c and d are within 30% of each other. Should the groups be:
(a, b, c), (d)
(a, b. c), (b, c, d)
(a), (b), ©, (d), (a, b), (a, c), (b, c), (b, d), (c, d), (a, b, c), (b, c, d)
something else?
Reply
#4
I want result as

(a,b,c) and (b,c,d)

With current implementation I think we will get

(a,b) (b,c) (c,a) (c,d)(b,d)
Reply
#5
data = {
    "v1_rev": 3000,
    "v2_rev": 4444,
    "v4_rev": 234534,
    "v5_rev": 5665,
    "v6_rev": 66,
    "v7_rev": 66,
    "v3_rev": 66,
}

# Sort items in increasing order of their value
items = iter(sorted(data.items(), key=lambda x: x[1]))
threshold = 30

bins = []
bin_ = [next(items)]
for item in items:
    # What is the smallest value that can be in bin with item?
    start = item[1] * (200 - threshold) / (200 + threshold)
    if bin_[0][1] < start:
        bins.append(bin_.copy())
    bin_.append(item)
    while bin_[0][1] < start:
        bin_.pop(0)

if bin_:
    bins.append(bin_)

bins = [dict(bin_) for bin_ in bins]
print(bins)
Output:
[{'v6_rev': 66, 'v7_rev': 66, 'v3_rev': 66}, {'v1_rev': 3000}, {'v2_rev': 4444, 'v5_rev': 5665}, {'v4_rev': 234534}]
As a generator. Testing with overlapping bins:
from typing import Any, Generator


def dict_grouper(
    value_dict: dict[Any, float], percent: float = 30
) -> Generator[dict[Any, float], None, None]:
    """Group values where each value in group is within "percent" of others."""
    scale = (200 - percent) / (200 + percent)
    items = iter(sorted(value_dict.items(), key=lambda x: x[1]))
    grp = [next(items)]
    for item in items:
        start = item[1] * scale
        if grp[0][1] < start:
            yield dict(grp)
            grp = [x for x in grp[1:] if x[1] >= start]
        grp.append(item)

    if grp:
        yield dict(grp)


print(*dict_grouper(dict(zip(("ABCDEFG"), range(30, 100, 10)))), sep="\n")
Output:
{'A': 30, 'B': 40} {'B': 40, 'C': 50} {'C': 50, 'D': 60} {'D': 60, 'E': 70, 'F': 80} {'E': 70, 'F': 80, 'G': 90}
This should be very fast and stay fast. Using combinations with 7 items there are 127 potential groups and you would compute 742 pctDiff's. As the number of items increases, both these numbers increase rapidly. My algorithm only has to compute pctDiff 7 times, and the number of calculations grows linearly with the item count.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Changing client.get() method type based on size of data... dl0dth 1 815 Jan-02-2025, 08:30 PM
Last Post: dl0dth
  conditionals based on data frame mbrown009 1 1,654 Aug-12-2022, 08:18 AM
Last Post: Larz60+
  I have written a program that outputs data based on GPS signal kalle 1 2,194 Jul-22-2022, 12:10 AM
Last Post: mcmxl22
Question Change elements of array based on position of input data Cola_Reb 6 3,637 May-13-2022, 12:57 PM
Last Post: Cola_Reb
  How to map two data frames based on multiple condition SriRajesh 0 2,427 Oct-27-2021, 02:43 PM
Last Post: SriRajesh
  Grouping and sum of a list of objects Otbredbaron 1 5,657 Oct-23-2021, 01:42 PM
Last Post: Gribouillis
  Extracting unique pairs from a data set based on another value rybina 2 3,156 Feb-12-2021, 08:36 AM
Last Post: rybina
  Data extraction from a table based on column and row names tgottsc1 1 3,228 Jan-09-2021, 10:04 PM
Last Post: buran
  Grouping and summing of dataset jef 0 2,273 Oct-04-2020, 11:03 PM
Last Post: jef
  Extracting data based on specific patterns in a text file K11 1 2,979 Aug-28-2020, 09:00 AM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020