Filter data into new dataframe as main dataframe is being populated

cubangt · Oct-21-2023, 11:31 PM

Is this possible, i am populating a dataframe with a few fields and thousands of rows that will be saved into a CSV. For a separate reason, i need certain records to be inserted into a different dataframe, this will also be saved in a CSV, but for different audience and different purpose, so id like to see if during the "addition" of records into the main df, i can filter and add certain records that meet a criteria into a different df.

Or would it be easier to create the master df, then apply a filter to that and build the new df based on the filter. This 2nd df will require some calculations and other things before saving to CSV, so thats the reason for asking if its best to do it while df#1 is being populated or wait until df#1 is built, then build df#2

**deanhystad** · Oct-22-2023, 02:15 AM

What have you tried?

cubangt · (This post was last modified: Oct-22-2023, 04:08 AM by cubangt.)

These are just the tests i have been playing around with to get familiar with the methods of getting the data and calculations.
This is not in a script, just the things i have been trying so i can try to incorporate into my script.

This was done prior to the other help you provided.

# THIS IS COMPLETE STORE REGISTER BREAKDOWN LIST (NEED TO LOOK AT UPDATING AUTOMATICALLY WHEN NEW STORES ARE ADDED)
# SOURCE DATA TO BE UPDATED WITH EVERY NEW STORE AND LIST OF REGISTERS
registerBreakdown = pd.read_csv("RegisterList1.csv",names=['IP','Store','Register'])

# THIS RETURNS REGISTER COUNT BY STORE LOCATION, USED FOR CALCULATIONS FOR DASHBOARD
#registercntbystore = registerBreakdown.groupby(['Store'])['Register'].count()
registercntbystore = registerBreakdown.groupby(['Store'])['Register'].count()

# RETURNS THE TOTAL COUNTS FOR THE 2 STATUS (UP=0 AND DOWN=1)
#statuscnt = statusResults[["Status"]].value_counts()
statuscnt = statusResults[["Status"]].value_counts()
dfstatuscnt = pd.DataFrame(statusResults[["Status"]].value_counts(), columns=['Count'])

# RETURNS THE TOTAL COUNTS FOR THE DOWN STATUS
statuscntDown = statusResults[statusResults["Status"] == True]

# RETURNS THE TOTAL COUNTS FOR THE UP STATUS
statuscntUp = statusResults[statusResults["Status"] == False]

# RETURNS THE TOTAL COUNT FROM THE REGISTER LIST USED TO PING(ALL REGISTERS BEING PINGED/CHECKED)
loccnt = len(registerBreakdown.index)

# RETURNS THE COUNT AS NUMBER FOR DOWN
ttlDown = statuscnt[1] #len(statuscntDown.index)

# RETURNS THE COUNT AS NUMBER FOR UP
ttlUp = statuscnt[0] #len(statuscntUp.index)

cubangt · Oct-22-2023, 04:25 AM

What im trying to accomplish is to get a few calculation values returned to me.

I need to get a count of registers down by store and if the "%" is greater than "x%" then save that off into a new df

example:

Store 7 has a total of 8 registers, only 1 is down, that is 13% Not required in the df
Store 13 has a total of 15 registers, only 3 are down, but that is 20% which is over the % threshold so this would be added to the df
and so on.

Since the master result set that is saved into a csv contained the IP, Store, Register, Status and Datetime, the master df has everything needed to calculate the % by store.

My test from above using this line:

registercntbystore = registerBreakdown.groupby(['Store'])['Register'].count()

Gives me a df with all stores and the total count by store.

Cant get my head wrapped around how to use everything together in order to get a df with ONLY records that are over the % threshold. The purpose is that this will need to also be saved into a CSV for our support team to work on or contact the stores to find out what is wrong.

cubangt · (This post was last modified: Oct-22-2023, 05:49 AM by cubangt.)

Id like a dataframe with this information

Output:Store	Down	Up	Grand Total	Percentage
7	1	7	8	      13%
9	1	10	11	       9%
10	0	11	11	       0%
12	2	11	13	       15%

**deanhystad** · (This post was last modified: Oct-22-2023, 03:30 PM by deanhystad.)

Your choice. This example does the secondary report processing in the same script, but it is all post processing after gathering the data. You could just as easily move the post processing to a different script that reads in the ping results file.

import random
import pandas as pd
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor


pType = "Register"


def ping(register_info):
    """Check if register is online.
    register_info contains ip. store, register.
    Returns dictionary {ip, store, register, register online status, datetime}.
    """
    ip, store, register = register_info
    return {
        "IP": ip,
        "Store": store,
        "Register": register,
        "Status": random.choice((0, 0, 0, 0, 0, 1)),
        "Datetime": datetime.now().strftime("%m/%d/%Y,%H:%M:%S")
    }


def scan_files(directory="."):
    """Make fake stores"""
    for store in range(1, 10):
        for register in range(1, random.randint(8, 15)):
            yield (f"123.456.{store}.{register}", store,  register)

# Get all the ping info
with ThreadPoolExecutor(125) as executor:
    results = pd.DataFrame(executor.map(ping, scan_files()))

# Compute what % of registers are offline
by_store = results.groupby("Store")
status = by_store["Status"]
report = pd.concat([status.size(), status.sum()], axis=1).reset_index()
report.columns = ["Store", "Register Count", "Down"]
report["Percentage"] = report["Down"] / report["Register Count"] * 100
print("Register Status Report", report, sep="\n")

# Get list of stores with more than 20% of registers offline
bad_stores = report[report["Percentage"] > 20]["Store"].values
print("\nStores where Registers Offline > 20%", report[report["Store"].isin(bad_stores)], sep="\n")
for store in bad_stores:
    print(results.loc[by_store.groups[store]].reset_index(drop=True))

cubangt · Oct-22-2023, 08:53 PM

I want to better understand the example you provided.. without providing more examples.. Why am i not able to get the calculations of the registers that are up? What portion of the logic should i focus on within the example portion of this "# Compute what % of registers are offline"

I tried adding a "Up" column and it complained about "Length mismatch: Expected axis has 3 elements, new values have 4 elements"
I tried to create a new set of logic and just renamed the variables and status to Up, but no luck.

I appreciate all the help, but id like to see if i can figure it out, but need a little push on where to focus in this logic:

# Compute what % of registers are online
by_store = statusResults.groupby("Store")
statusUp = by_store["Status"]
report = pd.concat([statusUp.size(), statusUp.sum()], axis=1).reset_index()
report.columns = ["Store", "Register Count", "Up"]
report["Up"] = report["Up"] / report["Register Count"] * 100
print("Register Status Report Up", report, sep="\n")

cubangt · Oct-22-2023, 09:23 PM

Disregard my above question, i was overthinking it and figured out it will work as provided. thank you again..

Trying to incorporate into my script now, to see what i can complete this weekend.

cubangt · Oct-23-2023, 12:43 AM

Here is my test script updated and working as expected. Just need to replace the "statusResults" dataframe with the dynamically generated one in the main script, but i feel good about this working in production.

# THIS IS TEMPORARY UNTIL MOVED TO REAL SCRIPT(THIS WILL BE THE PING RESULTS BEFORE WRITING TO CSV)
statusResults = pd.read_csv("Register_ip_output_20231021_233512.csv",names=['IP', 'Store', 'Register','Status', 'Datetime'])

# Compute what % of registers are offline
by_store = statusResults.groupby("Store")
status = by_store["Status"]
report = pd.concat([status.size(), status.sum()], axis=1).reset_index()
report.columns = ["Store", "Register Count", "Percentage"]
report["Percentage"] = report["Percentage"] / report["Register Count"] * 100
statuscntUp = len(statusResults[statusResults["Status"] == False])

# USED TO CALCULATE NUMBER OF STORES WITH A % OF REGISTERS DOWN
percentDown = len(report[report["Percentage"] != 0].index)
# RETURNS THE TOTAL NUMBER OF REGISTERS CHECKED
RegisterCount = report["Register Count"].sum()
# TOTAL CHAIN % OF REGISTERS UP
TotalPercentUp = str(int(statuscntUp / RegisterCount * 100)) + "%"

# LIST OF STORES FOR DISPLAYING ON SCREEN(FORMATTED)
TwentyStores = ','.join(str(v) for v in report[(report["Percentage"] > 20) & (report["Percentage"] < 50)]["Store"].values)
# GETS COUNT OF STORES WITH MORE THAN 20%
TwentyPercent = len(report[(report["Percentage"] > 20) & (report["Percentage"] < 50)]["Store"].values)

# GETS COUNT OF STORES WITH MORE THAN 50%
FiftyPercent = len(report[report["Percentage"] >= 50]["Store"].values)
# LIST OF STORES FOR DISPLAYING ON SCREEN(FORMATTED)
FiftyStores = ','.join(str(v) for v in report[report["Percentage"] >= 50]["Store"].values)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	renaming a column without a name in a dataframe	Carbonpony	2	797	Jan-23-2025, 08:20 AM Last Post: Carbonpony
	dataframe merge	gunther	2	562	Jan-22-2025, 05:23 PM Last Post: Pedroski55
	Running search/replace across Polars dataframe columns efficiently	hobbycoder	3	2,113	Oct-28-2024, 03:18 AM Last Post: hobbycoder
	Most efficient way to roll through a pandas dataframe?	sawtooth500	2	1,075	Aug-28-2024, 10:08 AM Last Post: Alice12
	Confused by the different ways of extracting data in DataFrame	leea2024	1	646	Aug-17-2024, 01:34 PM Last Post: deanhystad
	docx file to pandas dataframe/excel	iitip92	1	2,419	Jun-27-2024, 05:28 AM Last Post: Pedroski55
	FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries	sawtooth500	14	7,683	Apr-24-2024, 01:42 AM Last Post: sawtooth500
	Elegant way to apply each element of an array to a dataframe?	sawtooth500	7	2,527	Mar-29-2024, 05:51 PM Last Post: deanhystad
	Dataframe copy warning	sawtooth500	4	2,116	Mar-25-2024, 11:38 PM Last Post: sawtooth500
	FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries i	sawtooth500	3	16,887	Mar-22-2024, 03:08 AM Last Post: deanhystad

Filter data into new dataframe as main dataframe is being populated

User Panel Messages

Announcements