Abstract

A growing number of domains (finance, seismology, internet-ofthings, etc.) collect massive time series. When the number of series grow to the hundreds of millions or even billions, similarity queries become intractable on a single machine. Further, naive (quadratic) parallelization won’t work well. So, we need both efficient indexing and parallelization. We propose a sketch/random projection-based approach that scales nearly linearly in parallel environments, and provides high quality answers.

Architecture

The main components of our approach (RadiusSketch), i.e., Index Construction and Query Processing, are developed within Spark. Spark is deployed on top of Hadoop Distributed File System ( HDFS ) in order to efficiently read input time series, query time series, as well as to store preliminary data and final results, and thus to overcome the bottleneck of centralized data storing. In Spark-parSketch stores grids (the resulting indexes) to a distributed relational storage ( RDB Store ), setup as a number of PostgreSQL instances. Each Spark worker connects to each of the databases through JDBC and persists the contents of each grid cell as soon as they are identified. This pipelined procedure avoids the in-memory storage of large intermediate datasets, hence relaxes the memory consumption during grid construction, to avoid memory overflow at Spark workers. Moreover, the embedded feature of RDBMS for indexation provides for more efficient query processing.

Parallel index construction

Time-series index construction on the input dataset D within distributed data processing frameworks proceeds as follows:

  1. At the sketch computation stage the dot product of time series with the random transformation matrix R, where each element is a random variable in {1,-1}, results in a vector of much lower dimension. At each node, sketch vectors SD over the input set of time series D are built locally and then partitioned into equal subvectors of given size. Each subvector corresponds to a grid. Thus, each sketch is assigned to a grid cell in each of the grids.
  2. At the next stage grouping by grid cell is performed, which requires data shuffling between computation nodes. As a result, each grid cell is mapped to the list of time series identifiers assigned to that cell.
  3. We use a set of relational database instances to store the resulting grids GD, previously created and distributed at the nodes.

Parallel query processing

Given a collection of queries Q, in the form of distributed time series dataset, and a previously constructed index on a dataset D, we consider the problem of finding time series that are similar to Q in D. We perform such a search in the following steps:

  1. The sketches of the time series queries Q are computed in parallel S Q, using the same random matrix R, as for grid construction. The resulting sketches are then partitioned into sub-vectors in parallel. Then the sub-vectors are sent to their appropriate grids and placed in grid cells.
  2. Then, the contents of the corresponding grid cells in the index, previously stored as a collection of grids GD, are retrieved from all the database instances in parallel on each node. Thus, if a sub-vector of a query time series q lands in the same grid cell as a database time series d, then d is a possible match for q. Two time series (a query and an indexed one) are considered to be similar if they are assigned to the same grid cell in a large user-tunable fraction of grids.
  3. This requires global communication.
  4. Because sketching is approximate, each candidate match between a query q and data vector d is checked by performing correlations.

Requirements

F-RadiusSketch works with Apache Spark. In order to run F-RadiusSketch and RadiusSketch you must download and install Apache Spark 1.6.2

Building

The code is written in Java and We use maven to build it, Use the given pom.xml file to build an executable jar containing all the dependencies. Usage


 for F-ParSketch use FParSketch class :
$SPARK_HOME/bin/spark-submit --class fr.inria.parallelSketchApproach.FParSketch  --master $MASTER ParallelSketchApproach-0.jar  <inputDataSetFile> <inputQueryFile> <outputFiles> [<configFile>]
   
 for ParSketch use ParSketch class :
$SPARK_HOME/bin/spark-submit --class fr.inria.parallelSketchApproach.ParSketch  --master $MASTER ParallelSketchApproach-0.jar  <inputDataSetFile> <inputQueryFile> <outputFiles> [<configFile>]

Demo

Parameters

Input Time Series :
Batch Size (number of queries) :
Cell Size1 :

Response Time

Quality Ratio2: N/A

Precision

Parallel Linear Search

spark-parSketch

Query # : 0