What is mean by Data affected by noise ?

 The term noise refers to a random perturbation within the values of a numerical attribute, usually resulting in noticeable anomalies.

 First, the outliers in a dataset need to be identified, so that subsequently either they can be corrected and regularized or entire records containing them are eliminated. 

In this section, we will describe a few simple techniques for identifying and regularizing data affected by noise, while in Chapter 7 we will describe in greater detail the tools from exploratory data analysis used to detect outliers.

 The easiest way to identify outliers is based on the statistical concept of dispersion. The sample mean ¯μj and the sample variance ¯σ2 j of the numerical attribute aj are calculated. If the attribute follows a distribution that is not too far from normal, the values falling outside an appropriate interval centered around the mean value ¯μj are identified as outliers, by virtue of the central limit theorem. More precisely, with a confidence of 100(1 − α)% (approximately 96% for α = 0.05), it is possible to consider as outliers those values that fall outside the interval

(μ¯ j − zα/2σ¯j ,μ¯ j + zα/2σ¯j ),

where zα/2 is the α/2 quantile of the standard normal distribution. This technique is simple to use, although it has the drawback of relying on the critical assumption that the distribution of the values of the attribute is bell-shaped and roughly normal. However, by applying Chebyshev’s theorem, described in Chapter 7, it is possible to obtain analogous bounds independent of the distribution, with intervals that are only slightly less stringent. Once the outliers have been identified, it is possible to correct them with values that are deemed more plausible or to remove an entire record containing them.

An alternative technique, illustrated in Figure 6.1, is based on the distance between observations and the use of clustering methods. Once the clusters have been identified, representing sets of records having a mutual distance that is less than the distance from the records included in other groups, the observations that are not placed in any of the clusters are identified as outliers. Clustering techniques offer the advantage of simultaneously considering several attributes, while methods based on dispersion can only take into account every single attribute separately.


A variant of clustering methods, also based on the distances between the observations, detects the outliers through two parametric values, p and d, to be assigned by the user. An observation xi is identified as an outlier if at least a percentage p of the observations in the dataset is found at a distance greater than d from xi.

The above techniques can be combined with the opinion of experts in order to identify actual outliers with respect to regular observations, even though these fall outside the intervals where regular records are expected to lie. In marketing applications, in particular, it is appropriate to consult with experts before adopting corrective measures in the case of anomalous observations.


Unlike the above methods, aimed at identifying and correcting every single anomaly, there exist also regularization techniques that automatically correct anomalous data. For example, simple or multiple regression models predict the value of the attribute aj that one wishes to regularize based on other variables existing in the dataset. 

Once the regression model has been developed, and the corresponding confidence interval around the prediction curve has been calculated, it is possible to substitute the value computed along the prediction curve for the values of the attribute aj that fall outside the interval.

A further automatic regularization technique, described in Section 6.3.4, relies on data discretization and grouping based on the proximity of the values of the attribute aj.



0 Comments

Follow Me On Instagram