scala - Labelling points for classification in Spark -


i'm trying run multiple classifiers on this telecom dataset predict churn. far, i've loaded dataset spark rdd, i'm not sure how can select 1 column label - in case, last column. not asking code, short explanation on how rdds , labeledpoint work together. looked @ examples provided in official spark github, seem use libsvm format.

question: how labeledpoint work, , how can specify label is?

my code far, if helps:

import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ import org.apache.spark.sparkconf import org.apache.spark.ml.classification.{randomforestclassificationmodel, randomforestclassifier} import org.apache.spark.ml.feature.standardscaler import org.apache.spark.mllib.classification.{svmmodel, svmwithsgd, logisticregressionwithlbfgs, logisticregressionmodel, naivebayes, naivebayesmodel}  object{    def main(args: array[string]): unit = {     //setting spark context     val conf = new sparkconf().setappname("churn")     val sc = new sparkcontext(conf)     //loading , mapping data rdd     val csv = sc.textfile("file://filename.csv")     val data = csv.map(line => line.split(",").map(elem => elem.trim))     /* computer learns points features , labels here */ } } 

the dataset looks this:

state,account length,area code,phone,int'l plan,vmail plan,vmail message,day mins,day calls,day charge,eve mins,eve calls,eve charge,night mins,night calls,night charge,intl mins,intl calls,intl charge,custserv calls,churn? ks,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,false. oh,107,415,371-7191,no,yes,26,161.600000,123,27.470000,195.500000,103,16.620000,254.400000,103,11.450000,13.700000,3,3.700000,1,false. nj,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,false. 

you need decide features are: example phone number not feature. so, columns dropped. then, want transform string columns numbers. yes, ml transformers, it's overkill in situation. i'd (showing logic on single line of data):

import org.apache.spark.mllib.regression.labeledpoint import org.apache.spark.mllib.linalg.vectors  val line = "nj,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,false" val arrl = line.split(",").map(_.trim) val mr = map("no"-> "0.0", "yes"-> "0.0", "false"->"0.0", "true" ->"1.0") val stringvec = array( arrl(2), mr(arrl(4)), mr(arrl(5))   ) ++ arrl.slice(6, 20)  val label = mr(arrl(20)).todouble val vec = stringvec.map(_.todouble) labeledpoint( label, vectors.dense(vec)) 

so, answer question: labeled point target variable (in case, last column (as double), has customer churned or not), plus vector of numeric (double) features describing customer (vec in case).


Comments

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -