Ask Applied Statistics Expert

Part 1: K-means Clustering

K-means clustering is a clustering method. The algorithm can be described as follows:

0) Determine how many (k) clusters you will search for.

1) Randomly assign points in your data to each of the clusters.

2) Once all values have been assigned to a cluster, calculate the means or the centroid of the values in each cluster.

3) Reassign values to clusters by associating values in the data set to the nearest (euclidean distance) mean.

4) Repeat steps 2 and 3 until convergence. Convergence occurs when no values are reassigned to a new cluster.

Task 1 - Write code to perform k-means clustering on the values in the matrix 'dat'.

The true group labels are provided in the vector 'true_groups'. Of course, you can't use that until the very end where you will perform some verification.

Requirements:

1) So everyone will get consistent results, I have performed the initial assignment of points to clusters.

2) With each iteration, plot the data, colored by their current groupings, and the updated means.

3) Convergence is reached when group assignments no longer change. Your k-means clustering algorithm should reach convergence fairly quickly.

4) Print out a 'confusion' matrix showing how well the k-means clustering algorithm classified the data.

You'll probably want to write a function that will calculate distances from the three means. I also highly recommend avoiding the use of loops within each iteration. Instead try to use apply functions. My final code takes 15 lines. You don't need to write as efficiently as I do. You can write a hundred lines, but as long as it gets the job done correctly, it'll be accepted.

Part 2: EM Algorithm

The Expectation-Maximization algorithm is an iterative algorithm for finding maximum likelihood estimates of parameters when some of the data is missing. In our case, we are trying to estimate the model parameters of a mixture of multi-variate Gaussian distributions, but we are missing the assignments of the data points.

The general form of the EM algorithm consists of alternating between an expectation step and a maximization step. In the expectation step, a function is calculated. The function is the expectation of the log-likelihood of the joint distribution of the data X along with the missing values of Z given the values of X under the current estimates of $\theta$. In the maximization step, the values of $\theta$ are found that will maximize this expected log-likelihood.

The great thing is that the solution to the maximization step can often be found analytically (versus having to search for it via a computational method.) For example, the estimate of the mean that maximizes the likelihood of the data is just the sample mean.

EM Algorithm for Gaussian Mixtures -

This (brilliant) algorithm can be applied to perform clustering of Gaussian mixtures (among many other applications) in a manner similar to the k-means algorithm. A key difference between the k-means algorithm and the EM algorithm is that the EM algorithm is probabilistic. While the k-means algorithm assigned a value to the group with the nearest mean, the EM algorithm calculates the probability that a point belongs to a certain group.

In the context of a Gaussian mixture, we have the following components:

1) $X$ is our observed data

2) $Z$ is the missing data: the class or cluster to which the observations $X$ belong.

3) $X$ come from a normal distributions defined by the unknown parameters $\Theta$ (the mean $\mu$ and variance $\Sigma$).

4) $Z$ is generated by a categorical distribution based on the unknown class mixing parameters $\alpha$. ($\sum \alpha_i = 1$)

The EM algorithm for Gaussian Mixtures will behave as follows:

1) Begin with some random or arbitrary starting values of $\Theta$ and $\alpha$.

2) E-Step. In the E-step, we will use Bayes' theorem to calculate the posterior probability that an observation $i$ belongs to component $k$.

3) M-step. Based on the estimates of the 'weights' found in the E-step, we will now perform Maximum Likelihood estimation for the model parameters.

This turns out to be fairly straightforward, as the MLE estimates for a normal distribution are fairly easy to obtain analytically.

For each component, we will find the mean, variance, and mixing proportion based on the data points that are "assigned" to the component. The data points are not actually "assigned" to the components like they are in k-means, but rather the components are given a "weight" or "responsibility" for each observation.

4. Iterate between steps 2 and 3 until convergence is reached.

Coding the EM algorithm for Gaussian Mixtures -

Coding the algorithm is a matter of turning the above steps into code.

The package 'mvtnorm' handles multivariate normal distributions. The function 'dmvnorm()' can be used to find the probability of the data $N(x_i | \mu_k, \Sigma_k)$. It can even be applied in vector form, so you can avoid loops when trying to find the probabilities.

You are dealing with a 1000 x 2 matrix of data points.

A few key things to remember / help you troubleshoot your code:

1) Your matrix of 'weights' will be 1000 x 3. (one row for each observation, one column for each cluster)

2) $N_k$ is a vector of three elements. It is effectively the column sums of the weight matrix $w$.

3) $\alpha$ is a vector of three elements. The elements will add to 1.

4) $\mu$ is a 3 x 2 matrix. One row for each cluster, one column for each x variable.

5) Each covariance matrix sigma is a 2x2 matrix. There are three clusters, so there are three covariance matrices.

Part 3: Open-ended analysis

For this third part, I will ask you to perform a bit of open ended analysis with real data.

Our data is comes from kaggle. It is a list of all the menu items offered at McDonald's restaurants and their nutrition information. The task is for you to perform some cluster analysis on the McDonald's menu items.

While the task is open-ended, there are a few things that you will be required to do.

Requirements:

Data Preparation:

1) Remove the following variables, as they are almost 100% colinear with another variable:

- Calories from Fat

- total fat (% daily value)

- Saturated Fat (% Daily Value)

- Cholesterol (% Daily Value)

- Sodium (% Daily Value)

- Carbohydrates (% Daily Value)

- Dietary Fiber (% Daily Value)

2) Perform your analysis on scaled the data. (just run 'scale()'). This prevents certain variables with large variances from dominating the analysis

Initial Visualization:

3) Produce between 1 and 3 plots of the menu items and nutrition information. Try to identify variables that would be useful in clustering the data. (You are allowed to use whatever packages you want for this. ggplot2, scatterplot3d, etc.) [Don't spend too much time (over 20 min) on this task. You are not trying to win datafest visualization. Keep it simple.]

PCA as "pre-processing":

4) Perform Principal components analysis on the scaled McDonald's data. Use the built in function 'prcomp()'.

5) Plot your data using the first two principal components

6) Look at resulting rotation matrix, identify which variables seem to be weighted the heaviest for each of the first two principal components.

K-means on the PCA matrix

7) Use R's 'kmeans()' function to perform k-means clustering using just the first two components in the PCA matrix. (Just the first two columns of $x in the prcomp output)

8) You will need to decide how many clusters to look for. You can plot $tot.withinss against the number of components to see where adding additional clusters do not improve the total within sum of squares.

I do a bit of this here:

9) Plot the results of the k-means analysis on the plot of the first two principal components

Explaining it all

10) Take a look at the clusters formed by your k-means analysis. Pair them up with the names of the menu items. Can you give each cluster a name? For example, you might call one category "sugary drinks and desserts" or something similar.

11) Using the results of your principal components analysis, try to pick the two or three most important variables in the original data. Produce a plot using those variables and the k-means groups.

Attachment:- Assignment Files.rar

Applied Statistics, Statistics

  • Category:- Applied Statistics
  • Reference No.:- M92243537

Have any Question?


Related Questions in Applied Statistics

Question onea a factory manager claims that workers at

QUESTION ONE (a) A factory manager claims that workers at plant A are faster than those at plant B. To test the claim, a random sample of times (in minutes) taken to complete a given task was taken from each of the plant ...

You are expected to work in groups and write a research

You are expected to work in groups and write a research report. When you work on your report, you need to use the dataset, and other sources such as journal articles. If you use website material, please pay attention to ...

Assignment -for each of the prompts below report the

Assignment - For each of the prompts below, report the appropriate degrees of freedom, t statistic, p-value and plot using the statistical software platform of your choice (R/STATA) 1) A sample of 12 men and 14 women hav ...

Assignment - research topicpurpose the purpose of this task

Assignment - Research topic Purpose: The purpose of this task is to ensure you are progressing satisfactorily with your research project, and that you have clean, useable data to analyse for your final project report. Ta ...

Assessment task -you become interested in the non-skeletal

Assessment Task - You become interested in the non-skeletal effects of vitamin D and review the literature. On the basis of your reading you find that there is some evidence to suggest that vitamin D deficiency is linked ...

Part a -question 1 - an analyst considers to test the order

PART A - Question 1 - An analyst considers to test the order of integration of some time series data. She decides to use the DF test. She estimates a regression of the form Δy t = μ + ψy t-1 + u t and obtains the estimat ...

Medical and applied physiology experimental report

Medical and Applied Physiology Experimental Report Assignment - Title - Compare the working and spatial memory by EEG. 30 students were tested (2 memory games were played to test their memory - a card game and a number g ...

Business data analysis computer assignment -part 1

Business Data Analysis Computer Assignment - PART 1 - Economists believe that high rates of unemployment are linked to decreased life satisfaction ratings. To investigate this relationship, a researcher plans to survey a ...

Question - go to the website national quality forum nqf

Question - Go to the website, National Quality Forum (NQF), located in the Webliography, and download the article by WIRED FOR QUALITY: The Intersection of Health IT and Healthcare Quality, Number 8, MARCH 2008. You are ...

Go to the webliography source for the national cancer

Go to the Webliography source for the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) Program. In the Fast Stats, create your own cancer statistical report, "Stratified by Data Type," and u ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As