Loads the 20 newsgroups dataset.
Loads the 20 newsgroups dataset. Designed to load data from 20news-bydate.tar.gz from http://qwone.com/~jason/20Newsgroups/
The expected directory structure for the train and test dirs is: train_or_test_dir/class_label/docs_as_separate_plaintext_files
SparkContext to use
Directory of the training data
A NewsgroupsData object containing the loaded train & test data as RDDs
The 20 Newsgroups class labels (and directory names) *