keystoneml.nodes.nlp

NGramsHashingTF

case class NGramsHashingTF(orders: Seq[Int], numFeatures: Int) extends Transformer[Seq[String], SparseVector[Double]] with Product with Serializable

Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing

It computes a rolling MurmurHash3 instead of fully constructing the n-grams, making it more efficient than using NGramsFeaturizer followed by HashingTF, although it should return the exact same feature vector. The MurmurHash3 methods are copied from scala.util.hashing.MurmurHash3

Individual terms are hashed using Scala's .## method. We may want to convert to MurmurHash3 for strings, as discussed for Spark's ML Pipelines in https://issues.apache.org/jira/browse/SPARK-10574

orders

valid ngram orders, must be consecutive positive integers

numFeatures

The desired feature space to convert to using the hashing trick.

Linear Supertypes
Product, Equals, Transformer[Seq[String], SparseVector[Double]], Chainable[Seq[String], SparseVector[Double]], TransformerOperator, Serializable, Serializable, Operator, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. NGramsHashingTF
  2. Product
  3. Equals
  4. Transformer
  5. Chainable
  6. TransformerOperator
  7. Serializable
  8. Serializable
  9. Operator
  10. AnyRef
  11. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new NGramsHashingTF(orders: Seq[Int], numFeatures: Int)

    orders

    valid ngram orders, must be consecutive positive integers

    numFeatures

    The desired feature space to convert to using the hashing trick.

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def andThen[C, L](est: LabelEstimator[SparseVector[Double], C, L], data: PipelineDataset[Seq[String]], labels: PipelineDataset[L]): Pipeline[Seq[String], C]

    Chains a label estimator onto the end of this pipeline, producing a new pipeline.

    Chains a label estimator onto the end of this pipeline, producing a new pipeline. If this pipeline has already been executed, it will not need to be fit again.

    est

    The estimator to chain onto the end of this pipeline

    data

    The training data to use (the estimator will be fit on the result of passing this data through the current pipeline)

    labels

    The labels to use when fitting the LabelEstimator. Must be zippable with the training data.

    Definition Classes
    Chainable
  7. final def andThen[C, L](est: LabelEstimator[SparseVector[Double], C, L], data: RDD[Seq[String]], labels: PipelineDataset[L]): Pipeline[Seq[String], C]

    Chains a label estimator onto the end of this pipeline, producing a new pipeline.

    Chains a label estimator onto the end of this pipeline, producing a new pipeline. If this pipeline has already been executed, it will not need to be fit again.

    est

    The estimator to chain onto the end of this pipeline

    data

    The training data to use (the estimator will be fit on the result of passing this data through the current pipeline)

    labels

    The labels to use when fitting the LabelEstimator. Must be zippable with the training data.

    Definition Classes
    Chainable
  8. final def andThen[C, L](est: LabelEstimator[SparseVector[Double], C, L], data: PipelineDataset[Seq[String]], labels: RDD[L]): Pipeline[Seq[String], C]

    Chains a label estimator onto the end of this pipeline, producing a new pipeline.

    Chains a label estimator onto the end of this pipeline, producing a new pipeline. If this pipeline has already been executed, it will not need to be fit again.

    est

    The estimator to chain onto the end of this pipeline

    data

    The training data to use (the estimator will be fit on the result of passing this data through the current pipeline)

    labels

    The labels to use when fitting the LabelEstimator. Must be zippable with the training data.

    Definition Classes
    Chainable
  9. final def andThen[C, L](est: LabelEstimator[SparseVector[Double], C, L], data: RDD[Seq[String]], labels: RDD[L]): Pipeline[Seq[String], C]

    Chains a label estimator onto the end of this pipeline, producing a new pipeline.

    Chains a label estimator onto the end of this pipeline, producing a new pipeline. If this pipeline has already been executed, it will not need to be fit again.

    est

    The estimator to chain onto the end of this pipeline

    data

    The training data to use (the estimator will be fit on the result of passing this data through the current pipeline)

    labels

    The labels to use when fitting the LabelEstimator. Must be zippable with the training data.

    Definition Classes
    Chainable
  10. final def andThen[C](est: Estimator[SparseVector[Double], C], data: PipelineDataset[Seq[String]]): Pipeline[Seq[String], C]

    Chains an estimator onto the end of this pipeline, producing a new pipeline.

    Chains an estimator onto the end of this pipeline, producing a new pipeline. If this pipeline has already been executed, it will not need to be fit again.

    est

    The estimator to chain onto the end of this pipeline

    data

    The training data to use (the estimator will be fit on the result of passing this data through the current pipeline)

    Definition Classes
    Chainable
  11. final def andThen[C](est: Estimator[SparseVector[Double], C], data: RDD[Seq[String]]): Pipeline[Seq[String], C]

    Chains an estimator onto the end of this pipeline, producing a new pipeline.

    Chains an estimator onto the end of this pipeline, producing a new pipeline. If this pipeline has already been executed, it will not need to be fit again.

    est

    The estimator to chain onto the end of this pipeline

    data

    The training data to use (the estimator will be fit on the result of passing this data through the current pipeline)

    Definition Classes
    Chainable
  12. final def andThen[C](next: Chainable[SparseVector[Double], C]): Pipeline[Seq[String], C]

    Chains a pipeline onto the end of this one, producing a new pipeline.

    Chains a pipeline onto the end of this one, producing a new pipeline. If either this pipeline or the following has already been executed, it will not need to be fit again.

    next

    the pipeline to chain

    Definition Classes
    Chainable
  13. def apply(line: Seq[String]): SparseVector[Double]

    The application of this Transformer to a single input item.

    The application of this Transformer to a single input item. This method MUST be overridden by ML developers.

    returns

    The output value

    Definition Classes
    NGramsHashingTFTransformer
  14. def apply(in: RDD[Seq[String]]): RDD[SparseVector[Double]]

    The application of this Transformer to an RDD of input items.

    The application of this Transformer to an RDD of input items. This method may optionally be overridden by ML developers.

    in

    The bulk RDD input to pass into this transformer

    returns

    The bulk RDD output for the given input

    Definition Classes
    Transformer
  15. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  16. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  17. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  18. def execute(deps: Seq[Expression]): Expression

    Definition Classes
    TransformerOperator → Operator
  19. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  20. final def finalizeHash(hash: Int, length: Int): Int

    Finalize a hash to incorporate the length and make sure all bits avalanche.

  21. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  22. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  23. def label: String

    Definition Classes
    Operator
  24. final def mix(hash: Int, data: Int): Int

    Mix in a block of data into an intermediate hash value.

  25. final def mixLast(hash: Int, data: Int): Int

    May optionally be used as the last mixing step.

    May optionally be used as the last mixing step. Is a little bit faster than mix, as it does no further mixing of the resulting hash. For the last element this is not necessary as the hash is thoroughly mixed during finalization anyway.

  26. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  27. def nonNegativeMod(x: Int, mod: Int): Int

  28. final def notify(): Unit

    Definition Classes
    AnyRef
  29. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  30. val numFeatures: Int

    The desired feature space to convert to using the hashing trick.

  31. val orders: Seq[Int]

    valid ngram orders, must be consecutive positive integers

  32. final val seqSeed: Int

  33. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  34. def toPipeline: Pipeline[Seq[String], SparseVector[Double]]

    A method that converts this object into a Pipeline.

    A method that converts this object into a Pipeline. Must be implemented by anything that extends Chainable.

    Definition Classes
    TransformerChainable
  35. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  36. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  37. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Product

Inherited from Equals

Inherited from Transformer[Seq[String], SparseVector[Double]]

Inherited from Chainable[Seq[String], SparseVector[Double]]

Inherited from TransformerOperator

Inherited from Serializable

Inherited from Serializable

Inherited from Operator

Inherited from AnyRef

Inherited from Any

Ungrouped