A family of NGramIndexer that can unpack or strip off specific words, query the order of an packed ngram, etc.
Transformer that uses CoreNLP to (in order): - Tokenize document - Lemmatize tokens - Replace entities w/ their type (e.
Converts a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.
Partitions each ngram by hashing on its first two words (first as in farthest away
from the current word), then mod by numPartitions
.
Transformer that converts a String to lower case
An NGram representation that is a thin wrapper over Array[String].
A simple transformer that represents each ngram as an NGram and counts their occurrence.
An ngram featurizer.
Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.
Estimates a Stupid Backoff ngram language model, which was introduced in the following paper:
Transformer that tokenizes a String into a Seq[String] by splitting on a regular expression.
Encodes string tokens as non-negative integers, which are indices of the tokens' positions in the sorted-by-frequency order.
Control flags used for NGramsCounts.
Packs up to 3 words (trigrams) into a single Long by bit packing.
Transformer that trims a String of leading and trailing whitespace