Python Forum
Filter data into new dataframe as main dataframe is being populated
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Filter data into new dataframe as main dataframe is being populated
#1
Is this possible, i am populating a dataframe with a few fields and thousands of rows that will be saved into a CSV. For a separate reason, i need certain records to be inserted into a different dataframe, this will also be saved in a CSV, but for different audience and different purpose, so id like to see if during the "addition" of records into the main df, i can filter and add certain records that meet a criteria into a different df.

Or would it be easier to create the master df, then apply a filter to that and build the new df based on the filter. This 2nd df will require some calculations and other things before saving to CSV, so thats the reason for asking if its best to do it while df#1 is being populated or wait until df#1 is built, then build df#2
Reply
#2
What have you tried?
Reply
#3
These are just the tests i have been playing around with to get familiar with the methods of getting the data and calculations.
This is not in a script, just the things i have been trying so i can try to incorporate into my script.

This was done prior to the other help you provided.

# THIS IS COMPLETE STORE REGISTER BREAKDOWN LIST (NEED TO LOOK AT UPDATING AUTOMATICALLY WHEN NEW STORES ARE ADDED)
# SOURCE DATA TO BE UPDATED WITH EVERY NEW STORE AND LIST OF REGISTERS
registerBreakdown = pd.read_csv("RegisterList1.csv",names=['IP','Store','Register'])

# THIS RETURNS REGISTER COUNT BY STORE LOCATION, USED FOR CALCULATIONS FOR DASHBOARD
#registercntbystore = registerBreakdown.groupby(['Store'])['Register'].count()
registercntbystore = registerBreakdown.groupby(['Store'])['Register'].count()

# RETURNS THE TOTAL COUNTS FOR THE 2 STATUS (UP=0 AND DOWN=1)
#statuscnt = statusResults[["Status"]].value_counts()
statuscnt = statusResults[["Status"]].value_counts()
dfstatuscnt = pd.DataFrame(statusResults[["Status"]].value_counts(), columns=['Count'])

# RETURNS THE TOTAL COUNTS FOR THE DOWN STATUS
statuscntDown = statusResults[statusResults["Status"] == True]

# RETURNS THE TOTAL COUNTS FOR THE UP STATUS
statuscntUp = statusResults[statusResults["Status"] == False]

# RETURNS THE TOTAL COUNT FROM THE REGISTER LIST USED TO PING(ALL REGISTERS BEING PINGED/CHECKED)
loccnt = len(registerBreakdown.index)

# RETURNS THE COUNT AS NUMBER FOR DOWN
ttlDown = statuscnt[1] #len(statuscntDown.index)

# RETURNS THE COUNT AS NUMBER FOR UP
ttlUp = statuscnt[0] #len(statuscntUp.index)
Reply
#4
What im trying to accomplish is to get a few calculation values returned to me.

I need to get a count of registers down by store and if the "%" is greater than "x%" then save that off into a new df

example:

Store 7 has a total of 8 registers, only 1 is down, that is 13% Not required in the df
Store 13 has a total of 15 registers, only 3 are down, but that is 20% which is over the % threshold so this would be added to the df
and so on.

Since the master result set that is saved into a csv contained the IP, Store, Register, Status and Datetime, the master df has everything needed to calculate the % by store.

My test from above using this line:
registercntbystore = registerBreakdown.groupby(['Store'])['Register'].count()
Gives me a df with all stores and the total count by store.

Cant get my head wrapped around how to use everything together in order to get a df with ONLY records that are over the % threshold. The purpose is that this will need to also be saved into a CSV for our support team to work on or contact the stores to find out what is wrong.
Reply
#5
Id like a dataframe with this information

Output:
Store Down Up Grand Total Percentage 7 1 7 8 13% 9 1 10 11 9% 10 0 11 11 0% 12 2 11 13 15%
Reply
#6
Your choice. This example does the secondary report processing in the same script, but it is all post processing after gathering the data. You could just as easily move the post processing to a different script that reads in the ping results file.
import random
import pandas as pd
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor


pType = "Register"


def ping(register_info):
    """Check if register is online.
    register_info contains ip. store, register.
    Returns dictionary {ip, store, register, register online status, datetime}.
    """
    ip, store, register = register_info
    return {
        "IP": ip,
        "Store": store,
        "Register": register,
        "Status": random.choice((0, 0, 0, 0, 0, 1)),
        "Datetime": datetime.now().strftime("%m/%d/%Y,%H:%M:%S")
    }


def scan_files(directory="."):
    """Make fake stores"""
    for store in range(1, 10):
        for register in range(1, random.randint(8, 15)):
            yield (f"123.456.{store}.{register}", store,  register)

# Get all the ping info
with ThreadPoolExecutor(125) as executor:
    results = pd.DataFrame(executor.map(ping, scan_files()))

# Compute what % of registers are offline
by_store = results.groupby("Store")
status = by_store["Status"]
report = pd.concat([status.size(), status.sum()], axis=1).reset_index()
report.columns = ["Store", "Register Count", "Down"]
report["Percentage"] = report["Down"] / report["Register Count"] * 100
print("Register Status Report", report, sep="\n")

# Get list of stores with more than 20% of registers offline
bad_stores = report[report["Percentage"] > 20]["Store"].values
print("\nStores where Registers Offline > 20%", report[report["Store"].isin(bad_stores)], sep="\n")
for store in bad_stores:
    print(results.loc[by_store.groups[store]].reset_index(drop=True))
Reply
#7
I want to better understand the example you provided.. without providing more examples.. Why am i not able to get the calculations of the registers that are up? What portion of the logic should i focus on within the example portion of this "# Compute what % of registers are offline"

I tried adding a "Up" column and it complained about "Length mismatch: Expected axis has 3 elements, new values have 4 elements"
I tried to create a new set of logic and just renamed the variables and status to Up, but no luck.

I appreciate all the help, but id like to see if i can figure it out, but need a little push on where to focus in this logic:

# Compute what % of registers are online
by_store = statusResults.groupby("Store")
statusUp = by_store["Status"]
report = pd.concat([statusUp.size(), statusUp.sum()], axis=1).reset_index()
report.columns = ["Store", "Register Count", "Up"]
report["Up"] = report["Up"] / report["Register Count"] * 100
print("Register Status Report Up", report, sep="\n")
Reply
#8
Disregard my above question, i was overthinking it and figured out it will work as provided. thank you again..

Trying to incorporate into my script now, to see what i can complete this weekend.
Reply
#9
Here is my test script updated and working as expected. Just need to replace the "statusResults" dataframe with the dynamically generated one in the main script, but i feel good about this working in production.

# THIS IS TEMPORARY UNTIL MOVED TO REAL SCRIPT(THIS WILL BE THE PING RESULTS BEFORE WRITING TO CSV)
statusResults = pd.read_csv("Register_ip_output_20231021_233512.csv",names=['IP', 'Store', 'Register','Status', 'Datetime'])

# Compute what % of registers are offline
by_store = statusResults.groupby("Store")
status = by_store["Status"]
report = pd.concat([status.size(), status.sum()], axis=1).reset_index()
report.columns = ["Store", "Register Count", "Percentage"]
report["Percentage"] = report["Percentage"] / report["Register Count"] * 100
statuscntUp = len(statusResults[statusResults["Status"] == False])

# USED TO CALCULATE NUMBER OF STORES WITH A % OF REGISTERS DOWN
percentDown = len(report[report["Percentage"] != 0].index)
# RETURNS THE TOTAL NUMBER OF REGISTERS CHECKED
RegisterCount = report["Register Count"].sum()
# TOTAL CHAIN % OF REGISTERS UP
TotalPercentUp = str(int(statuscntUp / RegisterCount * 100)) + "%"

# LIST OF STORES FOR DISPLAYING ON SCREEN(FORMATTED)
TwentyStores = ','.join(str(v) for v in report[(report["Percentage"] > 20) & (report["Percentage"] < 50)]["Store"].values)
# GETS COUNT OF STORES WITH MORE THAN 20%
TwentyPercent = len(report[(report["Percentage"] > 20) & (report["Percentage"] < 50)]["Store"].values)

# GETS COUNT OF STORES WITH MORE THAN 50%
FiftyPercent = len(report[report["Percentage"] >= 50]["Store"].values)
# LIST OF STORES FOR DISPLAYING ON SCREEN(FORMATTED)
FiftyStores = ','.join(str(v) for v in report[report["Percentage"] >= 50]["Store"].values)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries sawtooth500 14 451 Apr-24-2024, 01:42 AM
Last Post: sawtooth500
  Elegant way to apply each element of an array to a dataframe? sawtooth500 7 428 Mar-29-2024, 05:51 PM
Last Post: deanhystad
  Dataframe copy warning sawtooth500 4 361 Mar-25-2024, 11:38 PM
Last Post: sawtooth500
  FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries i sawtooth500 3 1,217 Mar-22-2024, 03:08 AM
Last Post: deanhystad
  Adding PD DataFrame column bsben 2 323 Mar-08-2024, 10:46 PM
Last Post: deanhystad
  Python Alteryx QS-Passing pandas dataframe column inside SQL query where condition sanky1990 0 747 Dec-04-2023, 09:48 PM
Last Post: sanky1990
  Comparing Dataframe to String? RockBlok 2 418 Nov-24-2023, 04:55 PM
Last Post: RockBlok
  DataFRame.concat() nafshar 3 791 Jul-14-2023, 04:41 PM
Last Post: nafshar
  Convert dataframe from str back to datafarme Creepy 1 637 Jul-07-2023, 02:13 PM
Last Post: snippsat
  Question on pandas.dataframe merging two colums shomikc 4 842 Jun-29-2023, 11:30 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020