Python Forum

Full Version: Machine learning SQL injection detection
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Good day, i am a post graduate student working on "Detecting and preventing SQL injection attack on a database using machine learning approach". My Major challenge right now is generating the dataset and how to write the appropriate code in Python, i will be highly grateful if you can help me out in any way you can, thanks alot.
I moved the thread to News and Discussions sub-forum, because it looks more appropriate for general discussion on possible approach. My understanding is you don't have code/specific questions yet
First off, hope you are using the latest version of Python (3.6.3).  You might use Python's builtin sqlite3 to create your test database.  Once created, make a backup copy so you always have a pristine copy and one you can attack. I've seen people working with databases of thousands of entries, when really all you need is a minimal amount. In your case, probably 2-5 entries would be enough to initially test the actual code for injection/detection. Once satisfied, you can always increase the size of the database or even try it against other databases.

If you run into problems, either with the database or the program, we are here to help.  Be sure and read the section of our Help document on BBCode before you post your code, errors and output.
I have a code for url malicious detection, but i want this code rewritten for SQL injection detection, pls can any one in the house help. The code is here below, thanks

import pandas as pd
import numpy as np
import random

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

urls_data = pd.read_csv("data.csv")
type(urls_data)
urls_data.head()
def makeTokens(f):
    tkns_BySlash = str(f.encode('utf-8')).split('/')
    total_Tokens = []
    for i in tkns_BySlash:
        tokens = str(i).split('-')
        tkns_ByDot = []
        for j in range(0, len(tokens)):
            temp_Tokens = str(tokens[j]).split('.')
            tkns_ByDot = tkns_ByDot + temp_Tokens
        total_Tokens = total_Tokens + tokens + tkns_ByDot
    total_Tokens = list(set(total_Tokens))
    if 'com' in total_Tokens:
        total_Tokens.remove('com')
    return total_Tokens
y = urls_data["label"]
url_list = urls_data["url"]
vectorizer = TfidfVectorizer(tokenizer=makeTokens)
x = vectorizer.fit_transform(url_list)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
logit = LogisticRegression()
logit.fit(x_train, y_train)
print ("Accuracy ", logit.score(x_test, y_test))
x_predict = ["http://www.psn.com.pk/",
"google.com/search=faizanahmad",
"www.radsport-voggel.de/wp-admin/includes/log.exe",
"www.radsport-voggel.de/wp-admin/includes/an/log.exe",
"www.google.com",
"www.google-scholar.com/wp-good"]
x_predict = vectorizer.transform(x_predict)
New_predict = logit.predict(x_predict)
print(New_predict)