Fork me on GitHub

KeytoneML
Build Status

KeystoneML is a software framework, written in Scala, from the UC Berkeley AMPLab designed to simplify the construction of large scale, end-to-end, machine learning pipelines with Apache Spark.

We contributed to the design of spark.ml during the development of KeystoneML, so if you’re familiar with spark.ml then you’ll recognize some shared concepts, but there are a few important differences, particularly around type safety and chaining, which lead to pipelines that are easier to construct and more robust.

KeystoneML also presents a richer set of operators than those present in spark.ml including featurizers for images, text, and speech, and provides several example pipelines that reproduce state-of-the-art academic results on public data sets.

News

What is KeystoneML for?

KeystoneML makes constructing even complicated machine learning pipelines easy. Here’s an example text categorization pipeline which creates bigram features and creates a Naive Bayes model based on the 100,000 most common features.

val trainData = NewsGroupsDataLoader(sc, trainingDir)

val predictor = Trim andThen
    LowerCase() andThen
    Tokenizer() andThen
    NGramsFeaturizer(1 to conf.nGrams) andThen
    TermFrequency(x => 1) andThen
    (CommonSparseFeatures(conf.commonFeatures), trainData.data) andThen
    (NaiveBayesEstimator(numClasses), trainData.data, trainData.labels) andThen
    MaxClassifier

Parallelization of the pipeline fitting process is handled automatically and pipeline nodes are designed to scale horizontally.

Once the pipeline has been defined you can apply it to test data and evaluate its effectiveness.

val test = NewsGroupsDataLoader(sc, testingDir)
val predictions = predictor(test.data)
val eval = MulticlassClassifierEvaluator(predictions, test.labels, numClasses)

println(eval.summary(newsgroupsData.classes))

The result of this code will contain the following:

Avg Accuracy:	0.980
Macro Precision:0.816
Macro Recall:	0.797
Macro F1:	0.797
Total Accuracy:	0.804
Micro Precision:0.804
Micro Recall:	0.804
Micro F1:	0.804

This relatively simple pipeline predicts the right document category over 80% of the time on the test set.

Of course, you can the pipeline in another system on new samples of text - just like any other function.

println(newsgroupsData.classes(predictor("The Philadelphia Phillies win the World Series!")))

Which prints the following:

rec.sport.baseball

KeystoneML works with much more than just text. Have a look at our examples to see pipelines in the domains of computer vision and speech.

KeystoneML is alpha software, in a very early public release (v0.2). The project is still very young, but we feel that it has grown to the point where it is viable for general use.

Linking

KeystoneML is available from Maven Central. You can use it in your applications by adding the following lines to your SBT project definition:

libraryDependencies += "edu.berkeley.cs.amplab" % "keystoneml_2.10" % "0.3.0"

See here for an example application which uses KeystoneML (and has scripts for launching a cluster configured with KeystoneML

Developing

KeystoneML is available on GitHub.

$ git clone https://github.com/amplab/keystone.git

Building

Once downloaded, you can build KeystoneML with the following commands:

$ cd keystone
$ git checkout branch-v0.3
$ sbt/sbt assembly
$ make

This will automatically resolve dependencies and package a jar file in target/keystone/scala-2.10/keystone-assembly-0.3.jar.

You can then run example pipelines with the included bin/run-pipeline.sh script, or pass as an argument to spark-submit.

Running an Example

Once you’ve built KeystoneML, you can run many of the example pipelines locally. However, to run the larger examples, you’ll want access to a Spark cluster.

Here’s an example of running a handwriting recognition pipeline on the popular MNIST dataset. You should be able to run this on a single machine in under a minute.

#Get the data from S3
wget http://mnist-data.s3.amazonaws.com/train-mnist-dense-with-labels.data
wget http://mnist-data.s3.amazonaws.com/test-mnist-dense-with-labels.data

KEYSTONE_MEM=4g ./bin/run-pipeline.sh \
  pipelines.images.mnist.MnistRandomFFT \
  --trainLocation ./train-mnist-dense-with-labels.data \
  --testLocation ./test-mnist-dense-with-labels.data \
  --numFFTs 4 \
  --blockSize 2048

To run on a cluster, we recommend using the spark-ec2 to launch a cluster and provision with correct versions of BLAS and native C libraries used by KeystoneML.

We’ve provided some scripts to set up a well-configured cluster automatically in bin/pipelines-ec2.sh. You can read more about using them here.

Building New Pipelines

Now that you’ve seen an example pipeline, have a look at the programming guide.

After that, head over to the API documentation.

People

KeystoneML is under active development in the UC Berkeley AMPLab. Development is led by Evan Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael Franklin and Benjamin Recht.

For more information please contact Evan Sparks and Shivaram Venkataraman.

Getting Help and Contributing

For help using the software please join and send mail to the KeystoneML users list.

KeystoneML is an Apache Licensed open-source project and we welcome contributions. Have a look at our Github Issues page if you’d like to contribute, and feel free to fork the repo and submit a pull request!

Citing

If you use KeystoneML in academic work, please cite the following paper:

Sparks, E. R., Venkataraman, S., Kaftan, T., Franklin, M. J., Recht, B. “KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics” Data Engineering (ICDE), 2017.

Acknowledgements

Research on KeystoneML is a part of the AMPLab at UC Berkeley. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, DOE Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple, Inc., Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, Facebook, Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware.