15-418 Final Project

Parallellized Sequential Pattern Mining

Anna Tan

Summary

We are going to implement two algorithms for sequential pattern mining on multi-core CPU platforms. We would like to parallellize two common data mining algorithms, PrefixSpan and the GSP algorithm. We will evaluate the performance between the sequential implementations and the parallel implementations, as well as among the parallel implementations.

Background

Sequential pattern mining is used to discover statistically relevant patterns in large data sets where the data is assumed to be delivered in sequence. It is used in a wide variety of applications, including consumer behavior discovery, DNA pattern discovery, and natural language categorization.

A sequence in a sequence database is an ordered list of sets of items - it would look something like <{a, b}, {c}, {a, c, d}>. In this sequence, a sequential pattern can be <{a}, {c}>, <{c}, {c, d}>, or <{a, b}, {c}, {d}>. A sequential pattern mining algorithm takes as input a sequence database and a minimum support threshold (minsup), to output all sequential patterns having a support at least minsup.

PrefixSpan (Pei et al.) finds these sequential patterns by looking at the prefixes of each sequence and creating pattern-projected database. It appends elements to these sequential patterns one by one, and recursively call the subroutine.

The GSP algorithm(Srikant & Agrawal), on the other hand, generates all possible k-sequences from frequent k-1 sequences by "joining" any two sequences, and then checks how much support each of these sequences has.

Because these algorithms may all potentially explore a large number of possible subsequence patterns across a large number of sequences, they can benefit significantly from parallelism. Often, these data mining algorithms run for very long time and bottlenecks the training and prediction phases of the actual machine learning algorithm. Since for each sequential pattern and each data point, we are doing roughly the same amount of work, we hope to exploit the data parallelism and model parallelism in these two algorithms.

Challenges

Memory Limitation: Because the sequence database is typically very large and we are generating large sets as the algorithms run, we may have inefficient memory utilization when running on large datasets. We may want to consider compression techniques to reduce the amount of data we need to read into memory.
In both algorithms, we need to combine the support from different threads to filter out the correct frequent patterns, and then produce a final list of frequent patterns. Depending on how we distribute the workload across threads, this will require communication across threads and usage of shared memory.
The algorithms can become computationally heavy if the dataset is large, as generating possible k-sequences and projecting sequences are both done on the dataset. Joining any two sequences can also become a very large task if the number of sequences is large.

For PrefixSpan, there is divergence in that not all sequences are included when creating a projected database based on some prefix sequence. The projected database also vary in size. Therefore, we need to consider balancing the workload.

For the GSP algorithm, we have less divergence in execution. The locality is high for memory access.

Resources

I will be using the two papers I referenced in Background as my main source of starting code. Each paper describes the sequential implementation of the algorithm. These algorithms and their parallel versions will be implemented on Gates machines, and run on lateday clusters.

In addition, I will be using this list of sequence datasets to test my algorithms for correctness and performance. This list provides a nice variety of databases in sizes, ranging from 800 sequences to 990 000 sequences.

I will need to look into compression techniques related to sequential sequence datasets if memory limitation becomes an issue in my implementations.

Goals and Deliverables

Plan to Achieve:

Create parallel implementations on multiple cores that have speedup of at least 4x, in order to have a reasonable mining time on big datasets
Present results and benchmarks on our parallel implementations' correctness and speed compared to the sequential versions on a few datasets

Hope to Achieve:

Implement a compression technique for large datasets
Create parallel implementations that use a cluster of machines or GPU and compare these implementations with the CPU implementations
Create parallel applications that use the data mining algorithms we have created to see how much speedup we can get for real life applications

Platform of Choice

We will be using C++ as our main programming language, since we want to use OpenMP to parallelize our program. Our code will be run on the latedays cluster to take advantage of the multithreaded CPUs.

Schedule

April 11 - 18: Implement the sequential versions of the algorithms and identify parallelism
April 18 - 25: Implement the parallel versions of the algorithms
April 25 - May 2: Test the parallel algorithms on different datasets and optimize parallelism
May 2 - 9: Finalize Project Report and update site; finish analysis on data

shangtee.github.io

Under construction