Given a pipeline DAG and an additional set of nodes to cache - return a DAG with the nodes cached.
Estimates the total runtime of a pipeline given the cached set of nodes
This method takes a sequence of profiles at different sample sizes, and generalizes them to a new data scale by fitting then using linear models for memory and cpu usage dependent on data scale.
Get a map representing the children for each node Note: This doesn't capture how many times each child depended on the instruction
Get all descendents of all sources in the graph
Get the operator weights: estimates for how many passes an operator will make over its input dependencies
Get an estimate for how many times the output of each node will be accessed, assuming the given set of nodes have their outputs cached.
Get an estimate for how many times the output of each node will be accessed, assuming the given set of nodes have their outputs cached.
Note: This assumes all sinks are accessed exactly once!
Get the initial set of what nodes will have their results effectively cached
Get profiles of nodes in the pipeline
Get profiles of nodes in the pipeline
The pipeline DAG
A linearization in the nodes of the pipeline DAG
The nodes to collect profiling information for
The scales to profile at (expected number of data points per partition)
The number of times to profile at each scale
Name for this rule, automatically inferred based on class name.
Name for this rule, automatically inferred based on class name.
Returns true iff there is still an uncached node whose output is used > once, that would fit in memory if cached