Show the implementation of Naïve Bays algorithm.
The Microsoft Naive Bays algorithm is a
classification algorithm based on Bays’ theorems, and provided by Microsoft SQL
Server Analysis Services for use in predictive modeling. The word naïve in the
name Naïve Bays derives from the fact that the algorithm uses Bayesian
techniques but does not take into account dependencies that may exist. For more
information about Bayesian methods, see Microsoft Research Community.
This algorithm is less computationally intense
than other Microsoft algorithms, and therefore is useful for quickly generating
mining models to discover relationships between input columns and predictable
columns. You can use this algorithm to do initial exploration of data, and then
later you can apply the results to create additional mining models with other
algorithms that are more computationally intense and more accurate.
The Microsoft Naive Bays algorithm calculates
the probability of every state of each input column, given each possible state
of the predictable column.
Here, the Microsoft Naive Bays Viewer lists each
input column in the dataset, and shows how the states of each column are
distributed, given each state of the predictable column.
You would use this view of the model to identify
the input columns that are important for differentiating between states of the
predictable column.
Data Required for Naive Bays Models
When you prepare data for use in training a
Naive Bays model, you should understand the requirements for the algorithm,
including how much data is needed, and how the data is used.
The requirements for a Naive Bays model are as
follows:
A single key column Each model must contain one numeric or text
column that uniquely identifies each record. Compound keys are not allowed.
Input columns
In a Naive Bays model, all columns must be either discrete or
discretized columns. For information about discretizing columns, see
Discretization Methods (Data Mining).
For a Naive Bays model, it is also important to
ensure that the input attributes are independent of each other. This is
particularly important when you use the model for prediction.
The reason is that, if you use two columns of
data that are already closely related, the effect would be to multiply the
influence of those columns, which can obscure other factors that influence the
outcome.
Conversely, the ability of the algorithm to
identify correlations among variables is useful when you are exploring a model
or dataset, to identify relationships among inputs.
At least one predictable column The predictable attribute must contain
discrete or discretized values.
The values of the predictable column can be
treated as inputs. This practice can be useful when you are exploring a new
dataset, to find relationships among the columns.
Comments
Post a Comment