Python Forum
Help Refining DataFrame
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help Refining DataFrame
#1
I have a DataFrame that I wish to refine in a way that only the relevant data stays on top. The only way I figured out how to do it was by separating the data frame as I will also need to scatter plot it later and transforming it into two lists, passing them through a loop. This is not working out as PyCharm console always get stuck, not running the code untill the end or not returning the k as it was intended to. Please help me improve this code and make it work.
Here is the code:

import pandas as pd
from matplotlib import pyplot as plt


df = pd.read_csv("C:/Users/thech/OneDrive/Documents/Rocky/Workshop2A/200mm/DataXY/ExtractedCellsData/CurveE_RR0.csv")


x = df.iloc[:, 0]
y = df.iloc[:, 1]

#plt.scatter(x, y)
#plt.show(block=True)


xmax = x.iloc[-1]
ymax = y.max()
xmin = x.iloc[0]
ymin = y.min()

tgbeta = 2*(ymax-ymin)/(xmax-xmin)

xl = x.to_list()
yl = y.to_list()


def refine_data(xn: list[float], yn: list[float], tangent: float) -> int:
    k = 0  # number of deleted points/Cells
    for i in range(len(yn)):      # loop
        if abs(xn[i])/yn[i] < 0.8/tangent:
            yn[i] = yn[k]
            xn[i] = xn[k]
            k += 1
        return k

    refine_data(xl, yl, tgbeta)
Larz60+ write Nov-04-2024, 10:17 AM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Tags have been added tthis time. Please use BBCode tags on future posts.
Reply
#2
Are you trying to select df rows where x / y < some threshold?
import pandas as pd
import matplotlib.pyplot as plt
import random


df = pd.DataFrame({"x": (random.random() for _ in range(100)), "y": (random.random() for _ in range(100))})
xrange = df.x.max() - df.x.min()
yrange = df.y.max() - df.y.min()
threshold = 0.8 / (2 * yrange / xrange)
df2 = df.loc[(df.x / df.y).abs() < threshold]

df2.plot.scatter(x="x", y="y")
plt.show()
You could extract the values and do everything in python. This function needs a few changes.
def refine_data(xn: list[float], yn: list[float], tangent: float) -> int:
    k = 0  # number of deleted points/Cells
    for i in range(len(yn)):      # loop
        if abs(xn[i])/yn[i] < 0.8/tangent:
            yn[i] = yn[k]
            xn[i] = xn[k]
            k += 1
        return k
This part moves the points in the wrong direction.
            yn[i] = yn[k]
            xn[i] = xn[k]
            k += 1
It should be:
            yn[k] = yn[i]
            xn[k] = xn[i]
            k += 1
And the return is indented incorrectly. The function should not return until the loop has completed.
    for i in range(len(yn)):      # loop
        if abs(xn[i])/yn[i] < 0.8/tangent:
            yn[k] = yn[i]
            xn[k] = xn[i]
            k += 1
    return k
I think it would be better to have the function create new lists containing the refined points.
def refine_data(x: list[float], y: list[float], tangent: float) -> tuple[list[float], list[float]]:
    xn = []
    yn = []
    threshold = 0.8 / tangent
    for px, py in zip(x, y):
        if abs(px / py) < threshold:
            xn.append(px)
            yn.append(py)
    return xn, yn
There is also an indenting error with the code calling the function
Reply
#3
This confuses me:
Quote:PyCharm console always get stuck, not running the code untill the end or not returning the k as it was intended to.
Getting stuck where? I notice there are references to plotting. Does it get stuck after drawing a plot? That might be because you use pyplot.show() to draw the plot windows. pyplot.show() blocks your code from running until the plot window is closed.

For example, this program draws two plots sequentially. It waits for you to close the first plot before drawing the second.
import pandas as pd
import matplotlib.pyplot as plt
import random


df = pd.DataFrame({"x": (random.random() for _ in range(100)), "y": (random.random() for _ in range(100))})
xrange = df.x.max() - df.x.min()
yrange = df.y.max() - df.y.min()

threshold = 0.8 / (2 * yrange / xrange)
df2 = df.loc[(df.x / df.y).abs() < threshold]
df2.plot.scatter(x="x", y="y", title="0.8")
plt.show()

threshold = 0.5 / (2 * yrange / xrange)
df2 = df.loc[(df.x / df.y).abs() < threshold]
df2.plot.scatter(x="x", y="y", title="0.5")
plt.show()
Is this the kind of behavior you are talking about?
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020