What is an Incomplete data ?

To partially correct incomplete data one may adopt several techniques.

Elimination. It is possible to discard all records for which the values of one or more attributes are missing. In the case of a supervised data mining analysis, it is essential to eliminate a record if the value of the target attribute is missing. A policy based on systematic elimination of records may be ineffective when the distribution of missing values varies in an irregular way across the different attributes, since one may run the risk of incurring a substantial loss of information.


Inspection. Alternatively, one may opt for an inspection of each missing value, carried out by experts in the application domain, in order to obtain recommendations on possible substitute values. Obviously, this approach suffers from a high degree of arbitrariness and subjectivity and is rather burdensome and time-consuming for large datasets. On the other hand, experience indicates that it is one of the most accurate corrective actions if skillfully exercised.


Identification. As a third possibility, a conventional value might be used to encode and identify missing values, making it unnecessary to remove entire records from the given dataset. For example, for a continuous attribute that assumes only positive values, it is possible to assign the value {−1} to all missing data. By the same token, for a categorical attribute, one might replace missing values with a new value that differs from all those assumed by the attribute.


Substitution. Several criteria exist for the automatic replacement of missing data, although most of them appear somehow arbitrary.

For instance, missing values of an attribute may be replaced with the mean of the attribute calculated for the remaining observations.

This technique can only be applied to numerical attributes, but it will clearly be ineffective in the case of an asymmetric distribution of values. In a supervised analysis, it is also possible to replace missing values by calculating the mean of the attribute only for those records having the same target class. Finally, the maximum likelihood value, estimated using regression models or Bayesian methods, can be used as a replacement for missing values.

However, estimate procedures can become rather complex and time-consuming for a large dataset with a high percentage of missing data.



0 Comments

Follow Me On Instagram