Introduction
Analytic Solver Data Science contains two techniques for transforming continuous data: Binning and Rescaling.
Bin Continuous Data
Binning a dataset is a process of grouping measured data into data classes which can reduce the effect of minor errors in the dataset leading to better understanding and visualization. For example, consider exact ages versus the categories, "child", "adult", and "elderly". The three categories would suffice in most analysis rather than using exact ages which are less visual. In Analytic Solver Data Science, the user decides what values the binned variable should take.
A variable can be binned in the following ways.
- Equal count: When using this option, the data is binned in such a way that each bin contains the same number of records. When this option is selected, the options Rank of the bin, Mean of the bin, and Median of the bin are enabled.
- Rank of the bin: In this option each value in the variable is assigned a rank according to the start and interval values as specified by the user.
- Mean of the bin: The mean is calculated as the average of the values lying in the bin interval. This mean value is assigned to each record that lies in that interval.
- Median of the bin: Records with the same binning value are counted and the median is calculated on the input value. The median value is then assigned to the binned variable.
- Equal Interval: Equal interval is based on bin size. When this method is selected, the whole range is divided into bins with bin sizes specified by the user. The options of Rank and Mid value are available with this method.
- Rank of the bin: In this option each value in the variable is assigned a rank according to the start and increment value. Users can specify the starting and increment value.
- Mid value: The mean is calculated as the average of the values lying in the bin interval. This mean value is assigned to each value of the variable that lies in that interval.
Rescale Continuous Data
The Rescaling utility was introduced in Analytic Solver Data Science V2017. Use this utility to normalize one or more features in your data. Many Data Science workflows include feature scaling/normalization during the data preprocessing stage. Along with this general-purpose facility, you can access rescaling functionality directly from the dialogs for Supervised Algorithms available in Analytic Solver Data Science application.
Analytic Solver Data Science provides the following methods for feature scaling: Standardization, Normalization, Adjusted Normalization and Unit Norm.
- Standardization makes the feature values have zero mean and unit variance. (x−mean)/std.dev.
- Normalization scales the data values to the [0,1] range. (x−min)/(max−min)
- The Correction option specifies a small positive number ε that is applied as a correction to the formula. The corrected formula is widely used in Neural Networks when Logistic Sigmoid function is used to activate the neurons in hidden layers – it ensures that the data values never reach the asymptotic limits of the activation function. The corrected formula is [x−(min−ε)]/[(max+ε)−(min−ε)].
- Adjusted Normalization scales the data values to the [-1,1] range. [2(x−min)/(max−min)]−1. The Correction option specifies a small positive number ε that is applied as a correction to the formula. The corrected formula is widely used in Neural Networks when Hyperbolic Tangent function is used to activate the neurons in hidden layers – it ensures that the data values never reach the asymptotic limits of the activation function. The corrected formula is {2[(x−(min−ε))/((max+ε)−(min−ε))]}−1.
- Unit Normalization is another frequently used method to scale the data such that the feature vector has a unit length. This usually means dividing each value by the Euclidean length (L2-norm) of the vector. In some applications, it can be more practical to use the Manhattan Distance (L1-norm).