What is Data reduction ?

When dealing with a small dataset, the transformations described above are usually adequate to prepare input data for a data mining analysis.

However, when facing a large dataset it is also appropriate to reduce its size, in order to make learning algorithms more efficient, without sacrificing the quality of the results obtained.

There are three main criteria to determine whether a data reduction technique should be used: efficiency, accuracy, and simplicity of the models generated.


Efficiency. The application of learning algorithms to a dataset smaller than the original one usually means a shorter computation time. I

f the complexity of the algorithm is a superlinear function, as is the case for most known methods, the improvement in efficiency resulting from a reduction in the dataset size may be dramatic. As described in Chapter 5, within the data mining process it is customary to run several alternative learning algorithms in order to identify the most accurate model. 

Therefore, a reduction in processing times allows the analyses to be carried out more quickly.


Accuracy. In most applications, the accuracy of the models generated represents a critical success factor, and it is, therefore, the main criterion followed in order to select one class of learning methods over another. 

As a consequence, data reduction techniques should not significantly compromise the accuracy of the model generated. As shown below, it may also be the case that some data reduction techniques, based on attribute selection, will lead to models with a higher generalization capability on future records.


Simplicity. In some data mining applications, concerned more with interpretation than with prediction, it is important that the models generated be easily translated into simple rules that can be understood by experts in the application domain. 

As a trade-off for achieving simpler rules, decision-makers are sometimes willing to allow a slight decrease in inaccuracy. Data reduction often represents an effective technique for deriving models that are more easily interpretable.

Since it is difficult to develop a data reduction technique that represents the optimal solution for all the criteria described, the analyst will aim for a suitable trade-off among all the requirements outlined.

Data reduction can be pursued in three distinct directions, described below: a reduction in the number of observations through sampling, a reduction in the number of attributes through selection and projection, and a reduction in the number of values through discretization and aggregation.





0 Comments

Follow Me On Instagram