python - sklearn: Naive Bayes classifier gives low accuracy -
i have dataset includes 200000 labelled training examples. each training example have 10 features, including both continuous , discrete. i'm trying use sklearn
package of python in order train model , make predictions have troubles (and questions too).
first let me write code have written far:
from sklearn.naive_bayes import gaussiannb # data contains 200 000 examples # targets contain corresponding labels each training example gnb = gaussiannb() gnb.fit(data, targets) predicted = gnb.predict(data)
the problem low accuracy (too many misclassified labels) - around 20%. not quite sure whether there problem data (e.g. more data needed or else) or code.
is proper way implement naive bayes classifier given dataset both discrete , continuous features?
furthermore, in machine learning know dataset should split training , validation/testing sets. automatically performed sklearn
or should fit
model using training dataset , call predict
using validation set?
any thoughts or suggestions appreciated.
the problem low accuracy (too many misclassified labels) - around 20%. not quite sure whether there problem data (e.g. more data needed or else) or code.
this not big error naive bayes, extremely simple classifier , should not expect strong, more data won't help. gaussian estimators good, naive assumptions problem. use stronger model. can start random forest since easy use non-experts in field.
is proper way implement naive bayes classifier given dataset both discrete , continuous features?
no, not, should use different distributions in discrete features, scikit-learn not support that, have manually. said before - change model.
furthermore, in machine learning know dataset should split training , validation/testing sets. automatically performed sklearn or should fit model using training dataset , call predict using validation set?
nothing done automatically in manner, need on own (scikit learn has lots of tools - see cross validation pacakges).
Comments
Post a Comment