KeystoneML ships with a number of example ML pipelines that reproduce recent academic results in several fields. Once you’ve built the project, you can run the following examples locally or on a Spark cluster.
Keystone ships with a number of example pipelines for Image Classification. These pipelines are designed to scale well across a cluster of commodity machines.
The MNIST dataset contains 50,000 training images of hand-written digits along with manually created labels. As of 2015, this benchmark is considered small by image classification standards and we can train a good classification pipeline on it on a laptop in under a minute.
This example runs the MnistRandomFFT pipeline which uses random feature maps and a linear model to build an optical character recognition pipeline for handwritten digits.
This example should achieve about 4% test error in less than a minute, and increasing the block size to 4096 and increasing the number of FFTs to 20 or more generally makes it perform even better.
KeystoneML ships with several pipelines for the CIFAR-10 dataset. The most advanced pipeline can be trained with the following:
This example runs the RandomPatchCifar pipeline builds a 10-class image classification pipeline over a training set of 50,000 tiny images. It uses convolution filters that have been randomly sampled from the training data as feature extractors.
This will take a bit longer to run and you may want to launch a small (8-node) cluster to finish it. It should achieve close to state-of-the-art performance of 15% error.
This example runs the VOCSIFTFisher pipeline which builds a multi-class image classification pipeline over a training set of 5,000 full sized images. It uses SIFT Feature extraction and Fisher Vectors to featurize the training data.
You should see an MAP around 58% for this 20 class classification problem, and the pipeline will run in about 15 minutes on a cluster of 16 cc2.8xlarge
machines on Amazon EC2.
This example runs the NewsgroupsPipeline pipeline which builds a 20 class text classification pipeline based on n-gram features and TF-IDF smoothing on a dataset of thousands of newsgroup posts.