Overview

Performing similarity queries on billions of time series is a challenge requiring both efficient indexing techniques and parallelization. This library contains a number of algorithms and optimizations to solve the problem of indexing and querying billions of time series. This page comes with the paper about DPiSAX: Massively Distributed Partitioned iSAX . It gives links to the code and documentation of:

You may freely use this code for research purposes, provided that you contact us here.

Requirements

Massively Distributed Partitioned iSAX works with Apache Spark. In order to run DPiSAX you must download and install Apache Spark 1.6.2.

Building

The code has been written in Java and we used maven to build it, please use the given pom.xml file to build an executable jar containing all the dependencies.

Configuration & Usage

DPiSAX settings can be configured through configuration file, for more information try :

$SPARK_HOME/bin/spark-submit --class fr.inria.sparkisax.PiSAX  DP-iSAX-1.jar -h
Usage: PDiSAX [OPTION]...

Options:
  -f		  Input file
  -g		  RandomWalk Time Series Generator, you must give number of Time Series and Number of elements in each Time Series.
  -G		  RandomWalk Query Time Series Generator, you must give number of Time Series.
  -o		  Output directory.
  -q		  Comma-separated list of input query file.
  -c		  Path to a file from which to load extra properties. If not specified, this will use defaults propertie.
  --config	  Change given configuration variables, NAME=VALUE (e.g fraction, partitions, normalization, wordLen, timeSeriesLength, threshold, k).
  -a		  Use 1 for DPiSAX, use 3 for DbasicPiSAX.
  -A		  Use 1 for (DPiSAX and DbasicPiSAX), use 2 for DiSAX.
  --ls		  Parallel Linear Search.
  -p		  Preprocessing. you must give input file.
  -h		  Show this help message and exit.  

Full documentation at: <http://djameledine-yagoubi.info/projects/DPiSAX/>

Datasets

We carried out our experiments on synthetic datasets using a random walk data series generator, each data series consisting of 256 points. At each time point the generator draws a random number from a Gaussian distribution N(0,1), then adds the value of the last number to the new number. You can use Random Walk Time Series Generator to produce a set of randomWalk datasets.

The real world data represents seismic time series collected from the IRIS Seismic Data Access repository.