Quantifying variable importance with Monte Carlo simulation
Finding the smallest subset of all possible input variables that result in an accurate model (that is, a parsimonious solution) is often the biggest challenge for many data mining projects. It's common for data sets to contain 10s to 100s of input variables. Models that are over-trained or simply fail to build are both possible with so called "wide" data sets. Removing unimportant variables to find the sweet spot between model accuracy and stability is where experienced data miners can deliver significant value.
The primary method of variable selection in Modeler is Feature Selection. The Feature Selection process identifies the significance of each variable individually. Statistically insignificant variables below a specified p-value are dropped. While this technique works well with simple data sets and "main effects" models such as regression, it completely ignores the interaction between variables. As often happens, the interaction...