Python Forum
How to detect abnormal data in big database python - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How to detect abnormal data in big database python (/thread-40207.html)



How to detect abnormal data in big database python - vanphuht91 - Jun-19-2023

I have a database with 10,000 rows and 3,000 columns. My task is detect abnormal data when compare line by line, item by item.

import matplotlib.pyplot as plt
import pandas as pd

file = "E:/Document/python/Book1.xlsx"
df = pd.read_excel(file)

outliers = dict( marker ="o",mfc= "w",mec = "r",mew=1.5)
medianprops = dict(lw=1.5,c = "b")
meanprops = dict(marker="+",ms=10, mew=3, c = "g")
whiskerprops = dict(ls="--",lw=1.5,c = "k")
capprops = dict(lw=1.5,c = "orange")
boxprops = dict(lw=1.5)
cols = df.columns[3:]
for col in cols:
    df.boxplot(col,by="Tester",
                grid= False,
                showfliers=True, flierprops = outliers,
                boxprops=boxprops,medianprops=medianprops,
                showmeans=True, meanprops=meanprops,
                whiskerprops= whiskerprops, capprops=capprops)
    plt.suptitle("")
    plt.xlabel("")
plt.show()
My code showed abnormal data with a small data like this:
[attachment=2427] [attachment=2428]
We can't use this to a very big data. My idea is detect abnormal first then plot figure to showed it.
But, I don't know how to do this. Think Think Think
Who has any good ideas that can help me so much.
Thanks for reading!


RE: How to detect abnormal data in big database python - deanhystad - Jun-22-2023

You need to define "abnormal data". Once you do that, it is simply a matter of using pandas (or scipy) to perform some computations to identify the abnormal data.

So what is abnormal data? From your examples it appears abnormal data is a "Tester" that greatly differs from the mean, but by how much? 1 standard deviation? 2?


RE: How to detect abnormal data in big database python - Pedroski55 - Jun-22-2023

3000 columns? Is that really necessary?

1. Like deanhystad says, define abnormal
2. Connect to your database.
3. Make an SQL query which does what you want for 1 row.
4. Repeat for all rows

The problem as I see it is getting all the column names. You don't really want to write 3000 column names in your SQL query!

You can find all the column names for any given table in the database information_schema, in the table COLUMNS and in the column COLUMN_NAME

In my database that looks like this:

Quote:SELECT COLUMN_NAME FROM information_schema.COLUMNS WHERE TABLE_SCHEMA = 'allstudentsdb' AND TABLE_NAME = 'allstudentsAnswers21BECW' AND DATA_TYPE = 'text'

From the command line you can use mysqldump piped to grep to display what you want. That is mentioned in the link below.

Check out this link for help making such an SQL query.

I use the module pymysql to connect to my little cloud server database. Works great!


RE: How to detect abnormal data in big database python - Skaperen - Jun-23-2023

(Jun-22-2023, 11:09 PM)Pedroski55 Wrote: 3000 columns? Is that really necessary?
sounds like the definition of an abnormal DBA.


RE: How to detect abnormal data in big database python - vanphuht91 - Jun-23-2023

(Jun-22-2023, 08:41 PM)deanhystad Wrote: You need to define "abnormal data". Once you do that, it is simply a matter of using pandas (or scipy) to perform some computations to identify the abnormal data.

So what is abnormal data? From your examples it appears abnormal data is a "Tester" that greatly differs from the mean, but by how much? 1 standard deviation? 2?
Thanks mod for your answer. Currently, I compare mean bw testers, by 3 deviation.
But, still compare manual, I want to use python code to detect abnormal data automatic then show this.


RE: How to detect abnormal data in big database python - Skaperen - Jun-27-2023

maybe "abnormal data" means some string in a row/column that has a value that cannot be interpreted as the expected type, such as "The day before Christmas" when a date is expected. or maybe just "31 June 2023".