We will use the Numbers data set. The data set contains images of handwritten digits. Recognizing handwritten digits is already a mature technology. The task of this project is to extract features and cluster the images into homogeneous groups. These groups do not necessarily have to be groups of the same digit, but can also group the data by the way a digit is written. For each digit you have 28x28 pixels with 256 gray values (8 bit).
Follow the CRISP-DM framework
A) Data Preparation
i) Describe several ways you could reprocess the data and extract features. Describe why these steps might be helpful.
ii) Construct at least 3 additional features (more is better!).
i) Perform cluster analysis using several methods (at least k-means and hierarchical clustering) for different features.
ii) How did you determine a suitable number of clusters for each method?
iii) Use internal validation measures to describe and compare the clusterings and the clusters (some visual methods would be good).
iv) Use external validation measures to describe the clusterings and the clusters. You can find the actual digits in the images in the file number_labels.csv.
i) Describe your results. What findings are the most interesting?