r - Handling different Factor Levels in Train and Test data -
i have training data set of 20 column , of factors have use training model, have been given test data set on have apply model predictions , submit.
i doing initial data exploration , out of curiosity checked levels of training data , testing data levels since dealing categorical variables.to dismay of categories (variables) have different levels in training , testing data set.
for example
table(train$cap.shape) #training data column levels b c f k x 196 4 2356 828 2300 table(test$cap.shape) #test data b f s x 256 796 32 1356 here have category s in test data set , how can handle these cases, category of c in training low , thinking merge factor other factor based on how distribution dependent variables, stuck on how handle level in test.
more examples
table(train$odor) #train c f m n p s y 189 2155 36 2150 2 576 576 table(test$odor) #test c f l n p 400 3 5 400 1378 254 in column have 2 levels in test substantial number of instances in test data set. how can handle these discrepancies.
table(train$scolour) #train b h k n o r w y 48 1627 700 753 48 72 2388 48 table(test$scolour) #test h k n u 5 1172 1215 48 here have factor of u
should first build model on training set , find important predictors , worry factor levels ?
having different feature sets violates basic precept of machine learning. training , test data must represent same data space. these not; although each pair has common kernel of features (dimensions), use them on same model, have reduce each set common features, or extend both union of features, filling in "don't care" or semantically null values features.
Comments
Post a Comment