We contributed to the design of spark.ml during the development of KeystoneML, so if you’re familiar with
spark.ml then you’ll recognize some shared concepts, but there are a few important differences, particularly around type safety and chaining, which lead to pipelines that are easier to construct and more robust.
KeystoneML also presents a richer set of operators than those present in
spark.ml including featurizers for images, text, and speech, and provides several example pipelines that reproduce state-of-the-art academic results on public data sets.
KeystoneML makes constructing even complicated machine learning pipelines easy. Here’s an example text categorization pipeline which creates bigram features and creates a Naive Bayes model based on the 100,000 most common features.
val trainData = NewsGroupsDataLoader(sc, trainingDir) val predictor = Trim andThen LowerCase() andThen Tokenizer() andThen NGramsFeaturizer(1 to conf.nGrams) andThen TermFrequency(x => 1) andThen (CommonSparseFeatures(conf.commonFeatures), trainData.data) andThen (NaiveBayesEstimator(numClasses), trainData.data, trainData.labels) andThen MaxClassifier
Parallelization of the pipeline fitting process is handled automatically and pipeline nodes are designed to scale horizontally.
Once the pipeline has been defined you can apply it to test data and evaluate its effectiveness.
val test = NewsGroupsDataLoader(sc, testingDir) val predictions = predictor(test.data) val eval = MulticlassClassifierEvaluator(predictions, test.labels, numClasses) println(eval.summary(newsgroupsData.classes))
The result of this code will contain the following:
Avg Accuracy: 0.980 Macro Precision:0.816 Macro Recall: 0.797 Macro F1: 0.797 Total Accuracy: 0.804 Micro Precision:0.804 Micro Recall: 0.804 Micro F1: 0.804
This relatively simple pipeline predicts the right document category over 80% of the time on the test set.
Of course, you can the pipeline in another system on new samples of text - just like any other function.
println(newsgroupsData.classes(predictor("The Philadelphia Phillies win the World Series!")))
Which prints the following:
KeystoneML works with much more than just text. Have a look at our examples to see pipelines in the domains of computer vision and speech.
KeystoneML is alpha software, in a very early public release (v0.2). The project is still very young, but we feel that it has grown to the point where it is viable for general use.
KeystoneML is available from Maven Central. You can use it in your applications by adding the following lines to your SBT project definition:
libraryDependencies += "edu.berkeley.cs.amplab" % "keystoneml_2.10" % "0.3.0"
See here for an example application which uses KeystoneML (and has scripts for launching a cluster configured with KeystoneML
KeystoneML is available on GitHub.
$ git clone https://github.com/amplab/keystone.git
Once downloaded, you can build KeystoneML with the following commands:
$ cd keystone $ git checkout branch-v0.3 $ sbt/sbt assembly $ make
This will automatically resolve dependencies and package a jar file in
You can then run example pipelines with the included
bin/run-pipeline.sh script, or pass as an argument to
Once you’ve built KeystoneML, you can run many of the example pipelines locally. However, to run the larger examples, you’ll want access to a Spark cluster.
Here’s an example of running a handwriting recognition pipeline on the popular MNIST dataset. You should be able to run this on a single machine in under a minute.
#Get the data from S3 wget http://mnist-data.s3.amazonaws.com/train-mnist-dense-with-labels.data wget http://mnist-data.s3.amazonaws.com/test-mnist-dense-with-labels.data KEYSTONE_MEM=4g ./bin/run-pipeline.sh \ pipelines.images.mnist.MnistRandomFFT \ --trainLocation ./train-mnist-dense-with-labels.data \ --testLocation ./test-mnist-dense-with-labels.data \ --numFFTs 4 \ --blockSize 2048
To run on a cluster, we recommend using the
spark-ec2 to launch a cluster and provision with correct versions of BLAS and native C libraries used by KeystoneML.
We’ve provided some scripts to set up a well-configured cluster automatically in
bin/pipelines-ec2.sh. You can read more about using them here.
Now that you’ve seen an example pipeline, have a look at the programming guide.
After that, head over to the API documentation.
KeystoneML is under active development in the UC Berkeley AMPLab. Development is led by Evan Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael Franklin and Benjamin Recht.
For more information please contact Evan Sparks and Shivaram Venkataraman.
For help using the software please join and send mail to the KeystoneML users list.
KeystoneML is an Apache Licensed open-source project and we welcome contributions. Have a look at our Github Issues page if you’d like to contribute, and feel free to fork the repo and submit a pull request!
If you use KeystoneML in academic work, please cite the following paper:
Sparks, E. R., Venkataraman, S., Kaftan, T., Franklin, M. J., Recht, B. “KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics” Data Engineering (ICDE), 2017.
Research on KeystoneML is a part of the AMPLab at UC Berkeley. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, DOE Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple, Inc., Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, Facebook, Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware.