When does data need to be centralized and nomarlized?

When does data need to be centralized and nomarlized?

In regression problems and some machine learning algorithms, as well as in the process of training neural networks, it is often necessary to perform Zero-centered or Mean-subtraction processing and Standardization or Normalization on the raw data.
; Purpose: Through centralization and normalization, we have the data with the mean of a standard normal distribution with mean value of 0 and standard deviation of 1 by centralization and normalization.
; The calculation process is represented by the following equation:x′=x−μσx^{'}=\frac{x-\mu }{\sigma }
; The following explains why it is necessary to use these data pre-processing steps.
In some practical problems, the sample data we get are multi-dimensional, i.e., a sample is characterized by multiple features. In some practical problems, the sample data we get are multi-dimensional, i.e., a sample is characterized by multiple features. For example, in the problem of predicting house price, the factors that affect house price yy are house area x1x_{1}, number of bedrooms x2x_{2}, etc. The sample data we get are some sample points like (x1,x2)(x_{1},x_{2}), where x1x_{1}, x2x_{2} are also known as features. Obviously, the scale and value of these features are different. When predicting the house price, if we use the original data values directly, they will have different degrees of influence on the house price, but through standardization, we can make different features have the same scale. In this way, when we using the gradient descent method to learn the parameters, different features will have the same degree of influence on the parameters.
In short, when the scale (units) of the features of the original data in different dimensions are inconsistent, a standardization step is needed to pre-process the data.
The following figure shows two-dimensional data as an example: the left figure represents the original data; the middle one is the centered data, where the data are moved around the origin:The right figure divides the zero-centralized data by the standard deviation to obtain as normalized data, and it can be seen that the scale is consistent on each dimension (the length of the red line segment indicates the scale).
Preview

In fact, centralization and normalization have different meanings in different problems.
;For example, in the process of training neural networks, the convergence of the weighting parameter can be accelerated by normalizing the data.
;In addition, for the principal component analysis (PCA) problem, pre-processing steps such as centralization and normalization of the data are also required.
  • 8
0条回复