anomaly

package module
v0.0.0-...-2b926cc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 25, 2018 License: BSD-3-Clause Imports: 12 Imported by: 0

README

Anomaly detection for JSON documents

Organization

  • cmd/anomaly_bench - Anomaly detection for JSON documents prototype code
  • cmd/anomaly_image - Anomaly detection for images
  • cmd/lstm - LSTM test code
  • cmd/search_lfsr - Code for finding maximal length lfsr
  • images - Test images for anomaly_image
  • lstm - LSTM implementation
  • gru - GRU implementation

Abstract

Standard statistical methods can be used for anomaly detection of one dimensional real valued data. The multidimensional nature of JSON documents makes anomaly detection more difficult. Firstly, this README proposes a two stage algorithm for the anomaly detection of JSON documents. The first stage of the algorithm uses random matrix dimensionality reduction to vectorize a JSON document into a fixed length vector (JSON document vector). The second stage of the algorithm uses one of three methods: average cosine similarity, a single neuron, or an autoencoder to determine how surprising the JSON document vector is. Secondly, this README proposes using a LSTM or a GRU recurrent neural network for anomaly detection. Thirdly, a probabilistic model based on models for adaptive arithmetic coding and Kolmogorov complexity is proposed. Fourthly, a probabilistic model with resampling is proposed. Simple statistical analysis can then be used for determining which JSON documents the user should be alerted to.

Background

Two stage algorithm

Vectorization (first stage)
Converting a JSON document into a vector

The typical approach for converting a document into a vector is to count words. Each entry in the vector corresponds to the count for a particular word. In order to capture more document struct word pairs could be counted instead. JSON documents have explicit struct that should be captured in the computed vector. For example the below JSON:

{
 "a": [
  {"a": "aa"},
  {"b": "bb"}
 ],
 "b": [
  {"a": "aa"},
  {"b": "bb"}
 ]
}

would be converted into the below 12 "words" using a recursive algorithm:

  1. a a aa
  2. a aa
  3. aa
  4. a b bb
  5. b bb
  6. bb
  7. b a aa
  8. a aa
  9. aa
  10. b b bb
  11. b bb
  12. bb

The 8 unique words are ["a a aa", "a aa", "aa", "a b bb", "b bb", "bb", "b a aa", "b b bb"], and their vector is [1, 2, 2, 1, 2, 2, 1, 1]. For all possible JSON documents the vector would be very large, so an algorithm is needed for compressing this vector.

Random matrix dimensionality reduction

Dimensionality reduction compresses a large vector into a smaller vector. This is done by multiplying a vector by a matrix. Principal component analysis (PCA) is the algorithm typically used, but it doesn't scale well for larger vectors. For very large vectors a random matrix can be used instead of a matrix generated by PCA. The random matrix is filled with (-1, 1, 0) with respective odds (1/6, 1/6, 4/6). For the vector from the previous section the matrix would look like: [0 1 0 0 -1 0 0 1; 1 0 0 -1 0 1 0 0]. This matrix would compress the 8 entry vector down to a 2 entry vector. As an optimization the matrix columns can be generated on the fly using a random number generator seeded with the hash of a particular word. Addition can then be used to replace multiplication.

The code for the vectorizer can be found here.

Anomaly detection (second stage)

Three anomaly detection methods are described below. Each method computes a surprise metric from a JSON document vector.

With average cosine similarity

The average cosine similarity method works by computing the cosine similarity between a given JSON document vector and each member of a JSON document vector database. The average is then computed. This metric represents how close the given JSON document vector is to the database of document vectors on average. After computing the average cosine similarity the JSON document vector is added to the database. The algorithm gets slower with time.

The code for the average cosine similarity algorithm can be found here.

With a single neuron

A single neuron implemented with the cosine similarity formula can be used for anomaly detection. The single valued output of the neuron represents how surprising the inputed JSON document vector is. The single neuron is trained with a JSON document vector as input and 1 as the output.

The code for the single neuron algorithm can be found here.

With autoencoders

An autoencoding neural network isn't trained with labeled data, instead it is trained to output the input vector. The standard autoencoder has three layers. The top and bottom layers are the same size, and the middle layer is typically more narrow than the top and bottom layers. The narrow middle layer creates an information bottleneck. It is possible to compute an autoencoder error metric for a particular JSON document vector. This "surprise" metric is computed by inputing the JSON document vector into the neural network and then computing the mean squared error at the output. The neural network can then be trained on the JSON document vector, so the neural network isn't surprised by similar JSON document vectors in the future.

The code for the autoencoder algorithm can be found here.

Recurrent neural networks

LSTM algorithm

The LSTM takes a series of bytes as input and outputs a predicted next byte. The LSTM algorithm works by training a LSTM on JSON data. The cost of training is then used as a surprise metric of the JSON data. Unlike the above algorithms, the LSTM based solution is capable of anomaly detection for non-JSON binary protocols. The LSTM has a state which could be used as a JSON document vector as in the above algorithm.

The code for the LSTM algorithm can be found here.

GRU algorithm

A GRU is similar to a LSTM, but it requires fewer tensor operations. The cost of training is still used as a surprise metric.

The code for the GRU algorithm can be found here.

Probabilistic models

Complexity

The complexity algorithm works by computing the compressed bits per symbol of a JSON document given the previous JSON documents. The JSON document isn't compressed, instead the compressed bits per symbol is computed with log2 of the symbol probabilities generated by the model for adaptive arithmetic coding.

The code for the complexity algorithm can be found here.

Complexity with resampling

The complexity with resampling algorithm is implemented with multiple copies of the above complexity model. Each copy sees a subset of the training data set. This allows the uncertainty in surprise to be computed.

The code for the complexity with resampling algorithm can be found here.

Benchmarks

The benchmarks are executed with:

go test -bench=.
BenchmarkLFSR-4                    	2000000000	         2.02 ns/op
BenchmarkSource-4                  	500000000	         3.47 ns/op
BenchmarkVectorizer-4              	    2000	    526639 ns/op
BenchmarkVectorizerLFSR-4          	   10000	    103305 ns/op
BenchmarkVectorizerNoCache-4       	    2000	   1069329 ns/op
BenchmarkVectorizerLFSRNoCache-4   	    5000	    200034 ns/op
BenchmarkAverageSimilarity-4       	   10000	   1268110 ns/op
BenchmarkNeuron-4                  	    5000	    231500 ns/op
BenchmarkAutoencoder-4             	     200	   7310656 ns/op
BenchmarkLSTM-4                    	      10	 117217241 ns/op
BenchmarkGRU-4                     	      20	  80982837 ns/op
BenchmarkComplexity-4              	   20000	     96472 ns/op

As can been seen the LSTM based algorithm is much slower than the two stage algorithm. The complexity based approach is the fastest solution.

Verification

Two stage algorithm

Verification is accomplished by generating random JSON documents from a gaussian random variable and feeding them into the anomaly detection algorithm. The below graph shows the distribution resulting from average cosine similarity method being fed random JSON documents:

Graph 1 average cosine similarity distribution

For single neuron:

Graph 3 neuron distribution

For autoencoder:

Graph 6 autoencoder error distribution

As should be expected the graphs appears to be gaussian. Another test implemented feeds the below two JSON documents into each of the above three anomaly detection algorithms after they have been trained on 1000 random JSON documents:

First test JSON document:

{
 "alfa": [
  {"alfa": "1"},
  {"bravo": "2"}
 ],
 "bravo": [
  {"alfa": "3"},
  {"bravo": "4"}
 ]
}

Second test JSON document:

{
 "a": [
  {"a": "aa"},
  {"b": "bb"}
 ],
 "b": [
  {"a": "aa"},
  {"b": "bb"}
 ]
}

The second JSON document is more similar to the randomly generated JSON documents than the first JSON document. This idea is tested 100 times by changing the random seed used to generate the 1000 random JSON documents. All of the methods pass the test with a score near 100. This shows that the anomaly detection algorithm isn't a random number generator. One final test is performed by graphing the output of the single neuron method and the autoencoder method against the average cosine similarity method. The below graph 5 shows that the output of the single neuron correlates with average cosine similarity:

Graph 5 neuron vs average similarity

The below graph 8 shows that the autoencoder method correlates with average cosine similarity:

Graph 8 autoencoder error vs average similarity

The below three graphs show average cosine similarity, single neuron, and autoencoder are not correlated through time:

Graph 2 average similarity vs time

Graph 2 neuron vs time

Graph 7 autoencoder error vs time

Recurrent neural networks
LSTM algorithm

The distribution of the LSTM surprise metrics for random JSON documents is wide:

Graph 9 LSTM distribution

The LSTM over time can be seen in the below graph:

Graph 10 LSTM vs time

The LSTM algorithm correlates with the average similarity approach:

Graph 11 LSTM vs average similarity

The LSTM correctly determines the surprise metrics of the above two test JSON documents. The surprise metric of the first test JSON document is greater than the surprise metric of the second test JSON document.

GRU algorithm

The distribution of the GRU surprise metrics is different from the LSTM distribution:

Graph 12 GRU distribution

The GRU over time can be seen in the below graph:

Graph 10 GRU vs time

The LSTM and GRU algorithms are correlated. This means that they are probably not computing a random process:

Graph 11 LSTM vs GRU

The GRU correctly determines the surprise metrics of the above two test JSON documents. The surprise metric of the first test JSON document is greater than the surprise metric of the second test JSON document.

Complexity

The distribution for the complexity surprise metrics is normal:

Graph 15 Complexity distribution

It can be seen that the complexity model learns over time:

Graph 16 GRU vs time

The complexity algorithm correctly determines the surprise metrics of the above two test JSON documents. The surprise metric of the first test JSON document is greater than the surprise metric of the second test JSON document.

Complexity with resampling

The below graph shows the complexity over time with error bars representing uncertainty:

Graph 17 Complexity with uncertainty vs time

Conclusion

An anomaly detection engine has been demonstrated. The first two stage algorithm is made up of two components: a vectorizer and an algorithm for computing surprise. After vectorization the single neuron and autoencoder algorithms have a fixed cost for determining if a JSON document is an anomaly. The single neuron and autoencoder methods are suitable for taking a real time learning approach. The single neuron method is faster than the other two methods. The LSTM based algorithm works, but it is slower than the other approaches. A GRU based algorithm can be used instead of a LSTM based algorithm for faster performance. The complexity algorithm is the fastest solution and supports any type of data.

Future work

  • The LSTM algorithm could be used to replace the vectorizer of the two stage algorithm. In theory this should result in more optimal vectorization.
  • Instead of a LSTM based recurrent neural network, a GRU based recurrent neural network could be used. The GRU would be faster than the LSTM.
  • Use a recurrent neural network to create a heat map of the JSON document. This would show the parts that are anomalies.
  • Ensemble learning could be used to combine multiple algorithms.
  • Use models for adaptive arithmetic coding for anomaly detection.

Documentation

Index

Constants

View Source
const (
	// CDF16Fixed is the shift for 16 bit coders
	CDF16Fixed = 16 - 3
	// CDF16Scale is the scale for 16 bit coder
	CDF16Scale = 1 << CDF16Fixed
	// CDF16Rate is the damping factor for 16 bit coder
	CDF16Rate = 5
	// CDF16Size is the size of the cdf
	CDF16Size = 256
	// CDF16Depth is the depth of the context tree
	CDF16Depth = 2
)
View Source
const (
	// One is odds for 1
	One = math.MaxUint32 / 6
	// MinusOne is odds for -1
	MinusOne = 2 * One
)

Variables

This section is empty.

Functions

func Adapt

func Adapt(a []float32) []float32

Adapt prepares a vector for input into a neural network

func GenerateRandomJSON

func GenerateRandomJSON(rnd *rand.Rand) map[string]interface{}

GenerateRandomJSON generates random JSON

func Normalize

func Normalize(a []int64) []float32

Normalize converts a vector to a unit vector

func Similarity

func Similarity(a, b []float32) float64

Similarity computes the cosine similarity between two vectors https://en.wikipedia.org/wiki/Cosine_similarity

Types

type Autoencoder

type Autoencoder struct {
	*neural.Neural32
	*Vectorizer
}

Autoencoder is an autoencoding neural network

func (*Autoencoder) Train

func (a *Autoencoder) Train(input []byte) (surprise, uncertainty float32)

Train calculates the surprise with the autoencoder

type AverageSimilarity

type AverageSimilarity struct {
	*Vectorizer
	// contains filtered or unexported fields
}

AverageSimilarity computes surpise by calculation the average cosine similarity across all Vectors

func (*AverageSimilarity) Train

func (a *AverageSimilarity) Train(input []byte) (surprise, uncertainty float32)

Train computes the surprise with average similarity

type CDF16

type CDF16 struct {
	Root    *Node16
	Context []uint16
	First   int
	Mixin   [][]uint16
}

CDF16 is a context based cumulative distributive function model

func NewCDF16

func NewCDF16() *CDF16

NewCDF16 creates a new CDF16 with a given context depth

func (*CDF16) AddContext

func (c *CDF16) AddContext(s uint16)

AddContext adds a symbol to the context

func (*CDF16) Model

func (c *CDF16) Model() []uint16

Model gets the model for the current context

func (*CDF16) ResetContext

func (c *CDF16) ResetContext()

ResetContext resets the context

func (*CDF16) Update

func (c *CDF16) Update(s uint16)

Update updates the model

type Complexity

type Complexity struct {
	*CDF16
}

Complexity is an entorpy based anomaly detector

func (*Complexity) Train

func (c *Complexity) Train(input []byte) (surprise, uncertainty float32)

Train trains the Complexity

type LFSR32

type LFSR32 uint32

LFSR32 is a 32 bit linear feedback shift register

func (*LFSR32) Int

func (l *LFSR32) Int() int8

Int randomly returns 1, -1 or 0

func (*LFSR32) Uint64

func (l *LFSR32) Uint64() uint64

Uint64 generates a 32 bit random number

type Meta

type Meta struct {
	Models []*CDF16
	Rand   *rand.Rand
}

Meta is a meta anomaly detection engine that uses other engines

func (*Meta) Train

func (m *Meta) Train(input []byte) (surprise, uncertainty float32)

Train trains the meta engine

type Network

type Network interface {
	Train(input []byte) (surprise, uncertainty float32)
}

Network is a network for calculating surprise

func NewAutoencoder

func NewAutoencoder(rnd *rand.Rand, vectorizer *Vectorizer) Network

NewAutoencoder creates an autoencoder

func NewAverageSimilarity

func NewAverageSimilarity(rnd *rand.Rand, vectorizer *Vectorizer) Network

NewAverageSimilarity creates a new average similarity surprise engine

func NewComplexity

func NewComplexity(rnd *rand.Rand, vectorizer *Vectorizer) Network

NewComplexity creates a new entorpy based model

func NewGRU

func NewGRU(rnd *rand.Rand, vectorizer *Vectorizer) Network

NewGRU creates a new GRU network

func NewLSTM

func NewLSTM(rnd *rand.Rand, vectorizer *Vectorizer) Network

NewLSTM creates a new LSTM network

func NewMeta

func NewMeta(rnd *rand.Rand, vectorizer *Vectorizer) Network

NewMeta creates a new meta engine

func NewNeuron

func NewNeuron(rnd *rand.Rand, vectorizer *Vectorizer) Network

NewNeuron creates a new neuron

type NetworkFactory

type NetworkFactory func(rnd *rand.Rand, vectorizer *Vectorizer) Network

NetworkFactory produces new networks

type Neuron

type Neuron struct {
	I, W *tensor.Dense
	CS   *gg.Node
	*gg.ExprGraph
	gg.VM
	gg.Nodes
	*gg.VanillaSolver
	*Vectorizer
}

Neuron is a single neuron

func (*Neuron) Train

func (n *Neuron) Train(input []byte) (surprise, uncertainty float32)

Train trains the neuron

type Node16

type Node16 struct {
	Model    []uint16
	Children map[uint16]*Node16
}

Node16 is a context node

func NewNode16

func NewNode16() *Node16

NewNode16 creates a new context node

type Rand

type Rand struct {
	*rand.Rand
}

Rand is the golang random number generator

func (Rand) Int

func (r Rand) Int() int8

Int randomly returns 1, -1 or 0

type Source

type Source interface {
	Int() int8
}

Source is a random source of 1, -1, and 0

func NewLFSR32Source

func NewLFSR32Source(seed uint64) Source

NewLFSR32Source create a new LFSR32 based source

func NewRandSource

func NewRandSource(seed uint64) Source

NewRandSource creates a new Rand based source

type SourceFactory

type SourceFactory func(seed uint64) Source

SourceFactory generates new random number sources

type Vectorizer

type Vectorizer struct {
	Size              int
	UseCache          bool
	MatrixColumnCache map[uint64][]int8
	Source            SourceFactory
	sync.RWMutex
}

Vectorizer converts JSON documents to vectors

func NewVectorizer

func NewVectorizer(size int, useCache bool, source SourceFactory) *Vectorizer

NewVectorizer creates a new vectorizer https://en.wikipedia.org/wiki/Random_projection

func (*Vectorizer) AddMatrixColumn

func (v *Vectorizer) AddMatrixColumn(a []string, b []int64)

AddMatrixColumn finds or generates a matrix column and adds it to a vector

func (*Vectorizer) Vectorize

func (v *Vectorizer) Vectorize(object map[string]interface{}) []int64

Vectorize produces a vector from a JSON object

Directories

Path Synopsis
cmd
anomaly_bench command
anomaly_image command
lstm command
search_lfsr command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL