Intelligent Data Analysis
SUSTech Summer Semester
Course Material and Useful Links
Peter Tino
P.Tino@cs.bham.ac.uk

< Back to homepage

Assignment 📝

Last updated: 2017/06/30 (Marked with green background)

You can choose any data set(s) from the list bellow, but finding your own dataset to investigate is much more preferable (and you'll be rewarded)!
If you decide to go the easier route and use some of the data sets provided, un-tar the relevant file and go to the corresponding folder.
The folder contains the data set, as well as additional information about the data. Read the available information, especially description of the features (data dimensions).
You will need to clean the data, so that it contains only numerical features (dimensions) and the features are space-separated (not comma-separated.

To make the plots informative, you should come up with a labelling scheme for data points.
If the data can be classified into several classes (find out in the data and feature description!), use that information as the basis for your labelling scheme. In that case exclude the class information from the data dimensions.
Alternatively, you can make labels out of any dimension, e.g. by quantising it into several intervals. For example, if the data dimension represents age of a person, you can quantise it into 5 labels (classes) [child, teenager, young adult, middle age, old].
Associate the data labels with different markers and use the markers to show what kind of data points get projected to different regions of the visualization plot (computer screen).

Report

In the report describe:

You should demonstrate that you

Please Note

You will be marked solely based on how well you used the techniques presented in the course on your data. If your data cannot be reasonably explained in 2-dimensions, or it does not have a clear cluster structure, that is fine, as long as you can clearly explain why do you think this is the case by proper use/analysis of PCA, SOM and clustering.

Data sets

Before You Start ...

Before starting to work on the assignment, please carefully study the example I prepared using the boston database. Un-tar the file boston.ex.tar.gz and go to the folder "BOSTON.EX".
The subfolder "FIGURES" contains all the relevant figures as eps or gif files.
Please consult the "boston.read.me" file in BOSTON.EX.

In the labelling scheme, concentrate on more than one coordinate (dimension), e.g. in the `boston example', consider not just the price feature, but run separate experiments with per capita crime rate in the town, or pupil-teacher ratio in the town instead of the price coordinate).

For examples of nice past reports developed on wine dataset (do not use this data in your report!) using just PCA, please see reports by Christoph Stich and Josephf Preece. Many thanks Chris and Joe!