![]() ![]() Not knowing the size of your data set, 20 input variables may be a conservative use of your degrees of freedom. Then I would approach the in-sample data set with a Random Forest. I would start by temporally splitting my data set into an in-sample and out-of-sample set, making sure that the out-of-sample contained similar cycles and statistical properties as the in-sample set. This alleviates some of the issues with autocorrelation, but adds other complications if the in-sample time series differs significantly from the out-of-sample time series (non-stationary time series).Ĭonsidering the issues above, I would start as simple as possible and work my way up in complexity as needed. This problem often leads to splitting the data set at a particular time point so that every observation before the split is training and every observation after the split is cross-validation/testing. This violates the independence assumption if there is significant autocorrelation and will artificially reduce your out-of-sample error estimates. This happens because the out-of-sample data sets have points that are temporally adjacent to in-sample data points. ![]() If you randomly sample from your entire data set to create cross-validation and testing sets, then you will likely introduce a look-ahead bias into your out-of-sample error estimates. Also, autocorrelation affects how you split your data set. If there is a significant degree of autocorrelation this will reduce your effective sample size and increase your chances of over fitting, especially for complex machine learning algorithms like multilayer neural networks that have many tunable weights. These issues complicate the partitioning of your data into training, cross-validation, and test sets. Additionally, assuming that the time series is stationary may or may not be a good assumption. However, I will make a few recommendations for general time series prediction.įirst, it is important to note that independence of observations is often a poor assumption for time series due to serial correlation of observations. It is hard to make specific suggestions without knowing more information about the exact problem you are attempting to solve. *I don't know any scientific study that reports this, but have heard numerous people reporting this observation, and there is a number of descriptions of the differences between types of models that at the end conclude that the theoretical differences in practice hardly ever matter. Note: the first three can also be set up to output posterior probabilities. The conclusion would be to look for someone to consult who has experience with the classifiers you consider or, even better, with classification of your type of data (that would need a more detailed description than just saying it is time series). I try to collect some evidence about this rumour in this question. There are rumours* that for the final quality of the model the choice of model often matters less than the experience the user has with the chosen type of model. You can aggregate not only decision trees but all other kinds of models as well. If you don't have terribly many samples, but absolutely need nonlinear boundaries and therefore get unstable models, then ensemble models (like the random Forest) can help. ![]() Knowledge about the nature of your problem and data may also suggest sensible ways of feature generation. See bias variance tradeoff and model complexity e.g. linear) are very often chosen because more complex models cannot be afforded with the given amount of data and less about really being convinced of having actually linear class boundaries. The more statistically independent cases you have for training, the more complex models you can afford. So all we can tell you here are very general rules of thumb. These decisions IMHO can only be made in a sensible way with intimate knowledge about the problem and the data at hand (search terms: no free lunch theorem for pattern recognition/classification). ![]()
0 Comments
Leave a Reply. |