View on GitHub

shangtee.github.io

Under construction

15-418 Final Project

Parallellized Sequential Pattern Mining

Anna Tan

Proposal | Checkpoint | Final Report

Summary

Over the past two weeks, I worked on implementing the sequential versions of the two algorithms. In particular, I followed the pseudocode as described in the papers linked in the proposal. I have done some basic testing with simple test cases to confirm the correctness the code, although I have not run a full dataset on my implementations. I also started thinking about how to parallelize the algorithms while I am working on the sequential versions, and have written up skeleton code for the parallel implementations.

I have also worked on figuring out how to read the datasets from here In particular, I plan to use the synthetic datasets,data.slen_10.tlen_1.seq.patlen_6.lit.patlen_8.nitems_5000_spmf.txt and data.slen_8.tlen_1.seq.patlen_4.lit.patlen_8.nitems_5000_spmf.txt, as well as Kosarak (10000) to compare the performance of my implementations. Interestingly, most of the datasets listed on the website only have one item per set - all sequences look like <{a}, {b}, {c}, {e}, {d}>, which is not fully using these two algorithms. I can't seem to find any real life datasets that would fully benefit from using these sequential pattern mining algorithms, although I could spend some more time searching around.

Goals and Deliverables

I think I'll be on pace to completing my deliverables, but I'm not sure if I'll get to the "nice to haves." Since I will be complete with all my school work other than this project by May 1, I should have a lot of time to focus on my project and complete all the implementations I wish to do. Depending on how quickly I can complete all the implementations, I may look into the "nice to have" goals on the week of the presentations.

Plan to Achieve:

  • Create parallel implementations on multiple cores that have speedup of at least 4x, in order to have a reasonable mining time on big datasets
  • Present results and benchmarks on our parallel implementations' correctness and speed compared to the sequential versions on a few datasets
  • Create parallel applications that use the data mining algorithms we have created to see how much speedup we can get for real life applications

Parallelism Competition

I plan to show a graph (or multiple graphs) comparing the sequential versions and parallel versions of the two algorithms running on different datasets. In addition, I think it would also be useful to explain what the algorithms are and maybe demo a simple question detection machine learning model using one of the two algorithms.

Concerns

One of my biggest concerns is that I am not sure where to find good real life datasets that are standard for testing these sequential pattern mining algorithms. I have read a few papers that refer to some datasets, but I can't seem to find them online anywhere.

Other than datasets, I think most of the work is just coding and getting my algorithms to work. I can see that there are few different ways of parallelizing, so I may implement all the ways I can think of and compare their speed.

Schedule

  • April 11 - 18: Implement the sequential versions of the algorithms and identify parallelism
  • (Completed)
  • April 18 - 25: Implement the parallel versions of the algorithms
  • (In Progress)
  • April 25 - April 29: Finish implementing the parallel versions of the algorithms and start testing
  • April 30 - May 2: Continue testing and work on optimization
  • May 3 - 9: Finalize Project Report, update site, upload code to Github; finish analysis on data