Ask Applied Statistics Expert

Portfolio - Classification and partitioning

This coursework accounts for 40% of the total mark for the portfolio.

In addition to the combined marks for each of the portfolio tasks, you will also be graded on the structure, presentation and clarity of the portfolio as a whole. So your work should be professionally presented, with good use of English.

In the real world, you will be expected to communicate the results from any analysis you perform to non-specialists, so you should conclude each task with a brief explanation of your results, presented in terms a lay person would understand.

Task 1

This task uses the well-known Iris data set.

The data were first collected by American botanist Edgar Anderson, but became a popular method of exploring various multivariate statistical methods when it was used by Ronald Fisher to explore discriminant analysis in 1936. This version is from the UCI's Machine Learning Repository . http://archive.ics.uci.edu/ml/datasets/Iris

The data consists of four different measurements taken from 50 irises each of three different species. The original data set does not include any identification label for the observations, but I have added one - you may find it useful when assessing your results (don't forget that this should not be included in any analysis).

For some of the tasks, you will need to separate the data into training and testing data sets. As the data is ordered, you will need to use some method of randomisation or randomised sampling, which you should do using the appropriate software.

You should employ the sampling functions of the data mining software you use. For consistency, and to assess the relative strengths of the software and algorithms used, you may use the sets from one package in another. But I want to see evidence that you are using as much of the relevant functionality in your software as possible.

In each case, consider whether the strength of your models can be improved by restricting the variables used.

Compare the R and RapidMiner results, giving an account of their similarities and differences, and assesing their relative strengths and weaknesses.

a) Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the species.
Use your results to decide whether you need to standardise the data in any way for the models you will build.

b) Use the k-NN algorithm to produce an assignment model for the data, using R and RapidMiner. In both cases, check the accuracy of the predictions, and use appropriate methods to try to improve it if necessary.

c) Perform a k-means cluster analysis on the data. Explain your choice of value for k and assess the strength of your results in terms of accuracy of partitioning. Can you learn anything from changing the value of k?

Use hierarchical cluster anlaysis to justify (or otherwise) your value for k.

d) Build a decision tree for the data using RapidMiner and R. Use appropriate methods to refine the tree to try to achieve maximum leaf purity based on the outcome variable species.

e) Use RapidMiner and R to produce a discriminant analysis of the data, with the goal of finding a set of discriminant equations which will best assign observations to their actual species.

f) Give an overall summary of your results above to give a description of how the combination of classification techniques builds a picture of the data set.

Identify which methods, algorithms, software etc. do the best job of explaining the data, and in particular, if the results from one method helped you refine another.

Are there any observations which cause problems for the different methods?

Task 2

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. (A cultivar is a grouping of plants which which have similar, usually sought-after properties.) The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The data is originally attributed to M. Forina, and may have been much larger. This version was donated to the UCI Machine Learning
Aeberhard.
See http://archive.ics.uci.edu/ml/datasets/Wine

repository by Stephan

(A slightly reduced version is available within your R installation, but this is the most complete version I could find.)

Note that this is a larger and more complex data set than was used in section A, and is therefore more like the data typically encountered.

a) Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the three different cultivars.

Note that as you have 13 numeric variables in this data set variables, you may find that you can reduce the size of your models based on your EDA observations.

b) Use the k-NN algorithm to produce an assignment model for the data, using R and RapidMiner. In both cases, check the accuracy of the predictions, and use appropriate methods to try to improve it if necessary.

c) Perform a k-means cluster analysis on the data. Explain your choice of value for k and assess the strength of your results in terms of accuracy of partitioning. Can you learn anything from changing the value of k?

Use hierarchical cluster anlaysis to justify (or otherwise) your value for k.

d) Build a decision tree for the data using RapidMiner and R.

Use appropriate methods to refine the tree to try to achieve maximum leaf purity based on the outcome variable cultivars.

e) Use RapidMiner and R to produce a discriminant analysis of the data, with the goal of finding a set of discriminant equations which will best assign observations to their actual cultivars.

f) In the above sections you built your models based on classifying wines according to the cultivar from which they were made.

One could quite reasonably explore some other way of classifying wines - alcohol content, for example.

Using the results of your exploratory data analysis, find a suitable method of classifying wines by their alcohol content and re-run your data mining modules to reflect this.

How do your results compare to the first set of models?

g) Give an overall summary of your results above to give a description of how the combination of classification techniques builds a picture of the data set.

Identify which methods, algorithms, software etc. do the best job of explaining the data.

Are there any observations which cause problems for the different methods?

Attachment:- Data.rar

Applied Statistics, Statistics

  • Category:- Applied Statistics
  • Reference No.:- M91695253
  • Price:- $150

Guranteed 48 Hours Delivery, In Price:- $150

Have any Question?


Related Questions in Applied Statistics

Question onea a factory manager claims that workers at

QUESTION ONE (a) A factory manager claims that workers at plant A are faster than those at plant B. To test the claim, a random sample of times (in minutes) taken to complete a given task was taken from each of the plant ...

You are expected to work in groups and write a research

You are expected to work in groups and write a research report. When you work on your report, you need to use the dataset, and other sources such as journal articles. If you use website material, please pay attention to ...

Assignment -for each of the prompts below report the

Assignment - For each of the prompts below, report the appropriate degrees of freedom, t statistic, p-value and plot using the statistical software platform of your choice (R/STATA) 1) A sample of 12 men and 14 women hav ...

Assignment - research topicpurpose the purpose of this task

Assignment - Research topic Purpose: The purpose of this task is to ensure you are progressing satisfactorily with your research project, and that you have clean, useable data to analyse for your final project report. Ta ...

Assessment task -you become interested in the non-skeletal

Assessment Task - You become interested in the non-skeletal effects of vitamin D and review the literature. On the basis of your reading you find that there is some evidence to suggest that vitamin D deficiency is linked ...

Part a -question 1 - an analyst considers to test the order

PART A - Question 1 - An analyst considers to test the order of integration of some time series data. She decides to use the DF test. She estimates a regression of the form Δy t = μ + ψy t-1 + u t and obtains the estimat ...

Medical and applied physiology experimental report

Medical and Applied Physiology Experimental Report Assignment - Title - Compare the working and spatial memory by EEG. 30 students were tested (2 memory games were played to test their memory - a card game and a number g ...

Business data analysis computer assignment -part 1

Business Data Analysis Computer Assignment - PART 1 - Economists believe that high rates of unemployment are linked to decreased life satisfaction ratings. To investigate this relationship, a researcher plans to survey a ...

Question - go to the website national quality forum nqf

Question - Go to the website, National Quality Forum (NQF), located in the Webliography, and download the article by WIRED FOR QUALITY: The Intersection of Health IT and Healthcare Quality, Number 8, MARCH 2008. You are ...

Go to the webliography source for the national cancer

Go to the Webliography source for the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) Program. In the Fast Stats, create your own cancer statistical report, "Stratified by Data Type," and u ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As