r - Handling different Factor Levels in Train and Test data -


i have training data set of 20 column , of factors have use training model, have been given test data set on have apply model predictions , submit.

i doing initial data exploration , out of curiosity checked levels of training data , testing data levels since dealing categorical variables.to dismay of categories (variables) have different levels in training , testing data set.

for example

table(train$cap.shape) #training data column levels   b    c    f    k    x  196    4 2356  828 2300  table(test$cap.shape) #test data    b    f    s    x  256  796   32 1356 

here have category s in test data set , how can handle these cases, category of c in training low , thinking merge factor other factor based on how distribution dependent variables, stuck on how handle level in test.

more examples

table(train$odor) #train   c    f    m    n    p    s    y   189 2155   36 2150    2  576  576  table(test$odor) #test       c    f    l    n    p  400    3    5  400 1378  254 

in column have 2 levels in test substantial number of instances in test data set. how can handle these discrepancies.

table(train$scolour) #train     b    h    k    n    o    r    w    y     48 1627  700  753   48   72 2388   48     table(test$scolour) #test     h    k    n    u      5 1172 1215   48 

here have factor of u

should first build model on training set , find important predictors , worry factor levels ?

having different feature sets violates basic precept of machine learning. training , test data must represent same data space. these not; although each pair has common kernel of features (dimensions), use them on same model, have reduce each set common features, or extend both union of features, filling in "don't care" or semantically null values features.


Comments

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -