Python Forum
How to detect abnormal data in big database python
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to detect abnormal data in big database python
#1
I have a database with 10,000 rows and 3,000 columns. My task is detect abnormal data when compare line by line, item by item.

import matplotlib.pyplot as plt
import pandas as pd

file = "E:/Document/python/Book1.xlsx"
df = pd.read_excel(file)

outliers = dict( marker ="o",mfc= "w",mec = "r",mew=1.5)
medianprops = dict(lw=1.5,c = "b")
meanprops = dict(marker="+",ms=10, mew=3, c = "g")
whiskerprops = dict(ls="--",lw=1.5,c = "k")
capprops = dict(lw=1.5,c = "orange")
boxprops = dict(lw=1.5)
cols = df.columns[3:]
for col in cols:
    df.boxplot(col,by="Tester",
                grid= False,
                showfliers=True, flierprops = outliers,
                boxprops=boxprops,medianprops=medianprops,
                showmeans=True, meanprops=meanprops,
                whiskerprops= whiskerprops, capprops=capprops)
    plt.suptitle("")
    plt.xlabel("")
plt.show()
My code showed abnormal data with a small data like this:
       
We can't use this to a very big data. My idea is detect abnormal first then plot figure to showed it.
But, I don't know how to do this. Think Think Think
Who has any good ideas that can help me so much.
Thanks for reading!

Attached Files

.xlsx   Book1.xlsx (Size: 16.29 KB / Downloads: 78)
Reply
#2
You need to define "abnormal data". Once you do that, it is simply a matter of using pandas (or scipy) to perform some computations to identify the abnormal data.

So what is abnormal data? From your examples it appears abnormal data is a "Tester" that greatly differs from the mean, but by how much? 1 standard deviation? 2?
vanphuht91 likes this post
Reply
#3
3000 columns? Is that really necessary?

1. Like deanhystad says, define abnormal
2. Connect to your database.
3. Make an SQL query which does what you want for 1 row.
4. Repeat for all rows

The problem as I see it is getting all the column names. You don't really want to write 3000 column names in your SQL query!

You can find all the column names for any given table in the database information_schema, in the table COLUMNS and in the column COLUMN_NAME

In my database that looks like this:

Quote:SELECT COLUMN_NAME FROM information_schema.COLUMNS WHERE TABLE_SCHEMA = 'allstudentsdb' AND TABLE_NAME = 'allstudentsAnswers21BECW' AND DATA_TYPE = 'text'

From the command line you can use mysqldump piped to grep to display what you want. That is mentioned in the link below.

Check out this link for help making such an SQL query.

I use the module pymysql to connect to my little cloud server database. Works great!
Reply
#4
(Jun-22-2023, 11:09 PM)Pedroski55 Wrote: 3000 columns? Is that really necessary?
sounds like the definition of an abnormal DBA.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
(Jun-22-2023, 08:41 PM)deanhystad Wrote: You need to define "abnormal data". Once you do that, it is simply a matter of using pandas (or scipy) to perform some computations to identify the abnormal data.

So what is abnormal data? From your examples it appears abnormal data is a "Tester" that greatly differs from the mean, but by how much? 1 standard deviation? 2?
Thanks mod for your answer. Currently, I compare mean bw testers, by 3 deviation.
But, still compare manual, I want to use python code to detect abnormal data automatic then show this.
Reply
#6
maybe "abnormal data" means some string in a row/column that has a value that cannot be interpreted as the expected type, such as "The day before Christmas" when a date is expected. or maybe just "31 June 2023".
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  When does Python detect Errors? stamp1t 1 448 Oct-21-2023, 05:53 PM
Last Post: deanhystad
  Database that can compress a column, or all data, automatically? Calab 3 1,197 May-22-2023, 03:25 AM
Last Post: Calab
  Basic SQL query using Py: Inserting or querying sqlite3 database not returning data marlonbown 3 1,392 Nov-08-2022, 07:16 PM
Last Post: marlonbown
  I need help parsing through data and creating a database using beautiful soup username369 1 1,719 Sep-22-2021, 08:45 PM
Last Post: Larz60+
  SaltStack: MySQL returner save less data into Database table columns xtc14 2 2,183 Jul-02-2021, 02:19 PM
Last Post: xtc14
  Python chess game to detect winner ddddd 1 2,013 Dec-13-2020, 10:24 PM
Last Post: michael1789
  Looping to read data in database CEC68 1 1,721 Sep-24-2020, 08:54 PM
Last Post: scidam
  Get database used data space from pyodbc susja 1 2,261 Aug-14-2020, 02:01 PM
Last Post: susja
  sqlite3 database does not save data across restarting the program SheeppOSU 1 3,457 Jul-24-2020, 05:53 AM
Last Post: SheeppOSU
  Importing data from a text file into an SQLite database with Python macieju1974 7 4,145 Jun-29-2020, 08:51 PM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020