This document covers key API concepts present in KeystoneML, and presents an overview of its components.
KeystoneML is a software framework designed to make building and deploying large scale machine learning pipelines easier. To assist developers in this task we have created an API that simplifies common tasks and presents a unified interface for all stages of the pipeline.
Additionally we’ve included a rich library of example pipelines and the operators (or nodes) that support them.
KeystoneML is built on several design principles: supporting end-to-end workflows, type safety, horizontal scalability, and composibility.
By focusing on these principles, KeystoneML allows for the construction of complete, robust, large scale pipelines that are constructed from reusable, understandable parts.
We’ve done our best to adhere to these principles throughout the development of KeystoneML, and we hope that this translates to better applications that use it!
At the center of KeystoneML are a handful of core API concepts that allow us to build complex machine learning pipelines out of simple parts: pipelines
, nodes
, transformers
, and estimators
.
A Pipeline
is a dataflow that takes some input data and maps it to some output data through a series of nodes
.
By design, these nodes can operate on one data item (for point lookup) or many data items: for batch model evaluation.
In a sense, a pipeline is just a function that is composed of simpler functions. Here’s part of the Pipeline
definition:
From this we can see that a Pipeline has two type parameters: its input and output types. We can also see that it has methods to operate on just a single input data item, or on a batch RDD of data items.
Nodes come in two flavors: Transformers
and Estimators
.
Transformers
are nodes which provide a unary function interface for both single items and RDD
of the same type of item, while an Estimator
produces a Transformer
based on some training data.
As already mentioned, a Transformer
is the simplest type of node, and takes an input, and deterministically transforms it into an output.
Here’s an abridged definition of the Transformer
class.
There are a few things going on in this class definition.
First: A Transformer has two type parameters: its input and output types.
Second, every Transformer extends TransformerNode, which is used internally by Keystone for Pipeline construction and execution.
In turn TransformerNode extends Serializable, which means it can be written out and shipped over the network to run on any machine in a Spark cluster.
Third, it extends Pipeline because every Transformer can be treated as a full pipeline in it’s own right.
Fourth, it is abstract
because it has an apply
method which needs to be filled out by the implementor.
Fifth, it provides a default implementation of apply(in: RDD[A]): RDD[B]
which simply runs the single-item version on each item in an RDD.
Developers worried about performance of their transformers on bulk datasets are welcome to override this method, and we do so in KeystoneML with some frequency.
While transformers are unary functions, they themselves may be parameterized by more than just their input. To handle this case, transformers can take additional state as constructor parameters. Here’s a simple transformer which will add a fixed vector from any vector it is fed as input. (Note: we make use of breeze library for all local linear algebra operations.)
We can then create a new Adder
and apply
it to a Vector
or RDD[Vector]
just as you’d expect:
If you want to play around with defining new Transformers, you can do so at the scala console by typing sbt/sbt console
in the KeystoneML project directory.
Estimators
are what puts the ML in KeystoneML.
An abridged Estimator
interface looks like this:
That is Estimator
takes in training data as an RDD
to its fit()
method, and outputs a Transformer.
This may sound like abstract functional programming nonsense, but as we’ll see this idea is pretty powerful.
Let’s consider a concrete example.
Suppose you have a big list of vectors and you want to subtract off the mean of each coordinate across all the vectors (and new ones that come from the same distribution).
You could create an Estimator
to do this like so:
A couple things to notice about this example:
fit
takes an RDD, and computes the mean of each coordinate using familiar Spark and breeze operations.Transformer[Vector[Double],Vector[Double]]
interface so we can return an adder from our ScalerEstimator
estimator.-1.0
we can reuse the Adder
code we already wrote and it will work as expected.Of course, KeystoneML already includes this functionality out of the box via the StandardScaler
class, so you don’t have to write it yourself!
In most cases, Estimators
are things that estimate machine learning models - like a LinearMapEstimator
which learns a standard linear model on training data.
Pipelines are created by chaining transformers and estimators with the andThen
methods. Going back to a different part of the Transformer
interface:
Ignoring the implementation, andThen
allows you to take a pipeline and add another onto it, yielding a new Pipeline[A,C]
which works by first applying the first pipeline (A
=> B
) and then applying the next
pipeline (B
=> C
).
This is where type safety comes in to ensure robustness. As your pipelines get more complicated, you may end up trying to chain together nodes that are incompatible, but the compiler won’t let you. This is powerful, because it means that if your pipeline compiles, it is more likely to work when you go to run it at scale. Here’s an example of a simple two stage pipeline that adds 4.0 to every coordinate of a 3-dimensional vector:
Since sometimes transformers are just simple unary functions, you can also inline a Transformer definition. Here’s a three-stage pipeline that adds 2.0 to each element of a vector, computes its sum, and then translates that to a string:
You can also chain Estimators
onto transformers via the andThen (estimator, data)
or andThen (labelEstimator, data, labels)
methods. The latter makes sense if you’re training a supervised learning model which needs ground truth training labels.
Suppose you want to chain together a pipeline which takes a raw image, converts it to grayscale, and then fits a linear model on the pixel space, and returns the most likely class according to the linear model.
You can do this with some code that looks like the following:
In this example pipe
has a type Pipeline[Image, Int] and predicts the most likely class of an input image according to the model fit on the training data
While this pipeline won’t give you a very high quality model (because pixels are bad predictors of an image class), it demonstrates the APIs.
One of the main features of KeystoneML is the example pipelines and nodes it provides out of the box. These are designed to illustrate end-to-end real world pipelines in computer vision, speech recognition, and natural language processing.
We’ve included several example pipelines:
Example nodes fall in several categories:
For several nodes (particularly in images) we call into external libraries (both Java and C) that contain fast, high quality implementations of the nodes in question - pushing reuse across language boundaries.
Full documentation of the nodes is available in the scaladoc.
A data loader is the entry point for your data into a batch training pipeline. We’ve included several data loaders for datasets that correspond to the example pipelines we’ve included.
Where possible, we redistribute the input data via Amazon S3. However, we lack data redistribution rights for some of the input datasets, so you’ll need to secure access to these yourself.
KeystoneML also provides several utilities for evaluating models once they’ve been trained. Computing metrics like precision, recall, and accuracy on a test set.
Metrics are currently calculated for Binary Classification, Multiclass Classification, and Multilabel Classification, with more on the way.