Tuesday, 30 August 2016

Random String Python Text Classifier Example

In this post I'm going to explain how to write a simple NaiveBayes text classifier in Python and provide some example code.

Machine Learning!? Awesome!!1!

My original goal was to tell the difference between regular dictionary words and random strings. I looked at using manual ngram frequency analysis and this partially worked but I wanted to try out an ML solution for comparison.

I don't have much ML experience but it was easy to build a working script using the scikit library. This library abstracts away much of the mathematical complexity and offers a quick and high level way to implement ML concepts. In just a few lines of python I was able to build a classifier with 93% accuracy.

It's worth mentioning I did not use the "bag of words" approach as I was looking at analysing the structure of individual words as opposed to sentences. Changing the CountVectorizer parameters you could look at sentences or groups of words.

Building a Classifier

Building a classifier is quite simple, you just need to collect your data, format it, vectorize it, then train your model. In my script below I pretty much follow that process. First up I read in some data using pandas. I have two csv data sets, one file has normal dictionary words, the other random words. Each row contains the type, 0 or 1 (normal or random), and then the data which is just a word. For example:

normal.csv random.csv
0,apple 1,fdsgsdgfdg
0,banana 1,plicawq
0,orange               1,mncdlppl

In each file I used the first 5000 words for training and the last 5000 for testing. To vectorize the words I used the CountVectorizer with the ngram function, this breaks the words up based on their ngrams and converts them to numbers.

With the data ready I used the "fit" function to train the classifier with the training data set. To measure the accuracy of the model I used the "score" function and test data set. And finally to manually test some values I used the "predict" function. In the end my classifier could function with a 93% accuracy which I thought was pretty good considering I made hardly any customisations.

I used the Multinomial Naive Bayes function as this was recommended however other algorithms may work more effectively. The classifier and vectorizer also support a number of additional parameters that can be adjusted to improve the accuracy, I modified them only slightly, further improvements could likely be made here as well.

The Code

The following requires Python 2.7, scikit, pandas and also the two csv files containing data as described above.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from time import time

#Start timer
t0 = time()

#Create classifier and vectorizer
clf = MultinomialNB(alpha=0.1)
vec = CountVectorizer(analyzer='char_wb', ngram_range=(2, 4), min_df=1)

#Read in wordset and vectorize words
#Training data
train_set = pd.concat(pd.read_csv(f, names=["type","word"], nrows=5000) for f in ["normal.csv","random.csv"])
train_types = train_set.type.tolist()
train_words = vec.fit_transform(train_set.word.tolist())

#Test data
test_set = pd.concat(pd.read_csv(f, names=["type","word"], skiprows=5000, nrows=5000) for f in ["normal.csv","random.csv"])
test_types = test_set.type.tolist()
test_words = vec.transform(test_set.word.tolist())

#Train classifier
clf.fit(train_words, train_types)
train_time = time() - t0
print("Training time: %0.3fs" % train_time)

#Use test data to evaluate classifier
print "Accuracy is " + str(clf.score(test_words, test_types))
test_time = time() - train_time - t0
print("Testing time: %0.3fs" % test_time)

#Classify words
testdata = ['xgrdqwlpfrr','apple']
print testdata
print clf.predict(vec.transform(testdata))

predict_time = time() - test_time - train_time - t0
print("Predict time: %0.3fs" % predict_time)

Running the script should give you something like the following:

Scikit Tips

If you're trying to install scikit in windows you'll need to install the relevant .whl package. In Linux I had to upgrade pip before it would install.

Final Thoughts

I was amazed how quick and easy it was to write a simple classifier, machine learning has definitely gone mainstream. I focused on finding the difference between normal and random strings however classifiers can be used to tell the difference between all kinds of data sets.

Obviously being a security blog you may be wondering why I'd be looking into text classifiers. Well when analysing data to detect attackers you'll often want to classify various activity. Performing analysis with a classifier can give some interesting results :)