scala - Labelling points for classification in Spark -
i'm trying run multiple classifiers on this telecom dataset predict churn. far, i've loaded dataset spark rdd, i'm not sure how can select 1 column label - in case, last column. not asking code, short explanation on how rdds , labeledpoint work together. looked @ examples provided in official spark github, seem use libsvm format.
question: how labeledpoint work, , how can specify label is?
my code far, if helps:
import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ import org.apache.spark.sparkconf import org.apache.spark.ml.classification.{randomforestclassificationmodel, randomforestclassifier} import org.apache.spark.ml.feature.standardscaler import org.apache.spark.mllib.classification.{svmmodel, svmwithsgd, logisticregressionwithlbfgs, logisticregressionmodel, naivebayes, naivebayesmodel} object{ def main(args: array[string]): unit = { //setting spark context val conf = new sparkconf().setappname("churn") val sc = new sparkcontext(conf) //loading , mapping data rdd val csv = sc.textfile("file://filename.csv") val data = csv.map(line => line.split(",").map(elem => elem.trim)) /* computer learns points features , labels here */ } } the dataset looks this:
state,account length,area code,phone,int'l plan,vmail plan,vmail message,day mins,day calls,day charge,eve mins,eve calls,eve charge,night mins,night calls,night charge,intl mins,intl calls,intl charge,custserv calls,churn? ks,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,false. oh,107,415,371-7191,no,yes,26,161.600000,123,27.470000,195.500000,103,16.620000,254.400000,103,11.450000,13.700000,3,3.700000,1,false. nj,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,false.
you need decide features are: example phone number not feature. so, columns dropped. then, want transform string columns numbers. yes, ml transformers, it's overkill in situation. i'd (showing logic on single line of data):
import org.apache.spark.mllib.regression.labeledpoint import org.apache.spark.mllib.linalg.vectors val line = "nj,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,false" val arrl = line.split(",").map(_.trim) val mr = map("no"-> "0.0", "yes"-> "0.0", "false"->"0.0", "true" ->"1.0") val stringvec = array( arrl(2), mr(arrl(4)), mr(arrl(5)) ) ++ arrl.slice(6, 20) val label = mr(arrl(20)).todouble val vec = stringvec.map(_.todouble) labeledpoint( label, vectors.dense(vec)) so, answer question: labeled point target variable (in case, last column (as double), has customer churned or not), plus vector of numeric (double) features describing customer (vec in case).
Comments
Post a Comment