Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Statistics and Probability Expert

10-701 Machine Learning - Spring 2012 - Problem Set 1

Q1. Basic probability and statistics

1.1 Probability distribution-

Mickey is a novice in probability theory and is facing a seeming paradox. Consider a uniform probability distribution of variable X on the interval [0, ½], namely f(x) = 2, for x ∈ [0, ½]. Given that probabilities under a distribution should sum to 1, Mickey is puzzled that f(x) exceeds their sum. Explain the paradox. In other words, what is confusing Mickey?

1.2 Mean and variance-

Compute the mean and variance of X described in Section 1.1. from first principles (i.e. write down the definitions and compute their values).

1.3 Sampling-

Mickey knows how to sample from a canonical uniform distribution g(y) = 1, for y ∈ [0, 1]. Explain how Mickey can simulate the distribution described in Section 1.1 based on samples of Y, i.e. simulate a uniform distribution on [0, ½].

Show analytically that with your proposal, the new distribution would be identical in mean and variance as obtained in Section 1.2-in other words, prove that the simulated distribution converges to the desired distribution (hint: use transformation of variables).

1.4 More on expectations-

The odds that a particular face of a fair six-sided die is thrown can be approximated by a discrete uniform distribution. Equipped with his knowledge in probability, Mickey goes to gamble and hedges his bet on throwing a 6. What is the expected number of times he must throw the die to get a 6? Show your workings in clear steps (hint: use recursion).

1.5 Detective Bayes-

Mickey tosses a die multiple times, hoping for a 6. The sequence of his 10 tosses is 1, 3, 4, 2, 3, 3, 2, 5, 1, 6. Mickey is suspicious whether the die is biased towards 3 (or fair). Conduct a simple analysis based on the Bayes theorem to inform on Mickey-to what degree is the die biased? Explain your reasoning.

Assume in general every 100 dice contain 5 unfair dice that are biased towards 3 with the probability distribution of the six faces (1, 2, 3, 4, 5, 6) as P = [0.1, 0.1, 0.5, 0.1, 0.1, 0.1].

Q2. Linear regression

In regression, the input variable X is mapped to the response variable Y via a function Y = f(X) + E subject to some noise E. Linear regression assumes a linear form of the function and finds weight W that minimizes the sum of squared errors:

mini=1n(yi - wxi)2                                                                   (1)

2.1 Probabilistic interpretation-

Suppose there are n indenpendent pairs of input and response {(x1, y1), ...,(xn, yn)}. Assume these are linearly related subject to standard Gaussian noise yi = wxi + ∈i, ∈i ∼ N(0, 1). Using maximum likelihood, show that maximizing the log-likelihood function L(Y|X) = log(P(Y|X)) is equivalent to minimizing the sum of squared errors in Equation 1 (hint: first derive the likelihood expression).

2.2 Geometric interpretation-

Now think of the responses {y1,..., yn} as a vector y = [y1, ..., yn] in n-dimensional space, similarly for the estimates in linear regression yˆ = [wx1, ..., wxn]. What does Equation 1 mean geometrically to vectors y and yˆ? Provide a simple proof.

2.3 Regularization-

Real-world problems often involve input variables with multiple dimensions and limited data samples. In these scenarios, a common technique is often used to regularize the weights in linear regression and hence provides smoothing and improves generalization performance. Using matrix notations, the sum of squared errors in Equation 1 can be re-written as

arg min ||y - Xw|| + λ||w||2                                                      (2)

where X ∈ Rn×m, y ∈ Rm×1 and λ regularizes the norm of the weights. Derive an analytical expression that minimizes the term in Equation 2 with respect to w using vector derivatives. Explain briefly what happens to the weights w when λ is increased.

Q3. Density estimation

In K-nearest neighbors (KNN), the classification is achieved by majority vote in the vicinity of data. Suppose there are two classes of data each of n/2 points overlapped to some extent in a 2-dimensional space.

1. Describe what happens to the training error (using all available data) when the neighbor size k varies from n to 1.

2. Predict and explain with a sketch how the generalization error (e.g. holding out some data for testing) would change when k varies? Explain your reasoning.

3. Propose a method to determine an appropriate value for k.

4. In KNN, once k is determined, all neighbors within the perimeter of k are weighted equally in deciding the class label. Suggest a modification to the algorithm that improves this caveat.

5. Give two reasons why KNN may be undesirable when the input dimension is high.

Q4. Naive Bayes and classification

In this section, you will implement the full and naive Bayes classifiers and compare their performances on several data sets.

4.1 The rationale of "naive"-

Suppose you are modeling some N-dimensional data X with a full Gaussian distribution X ∼ N(µ, Σ), what is the number of parameters to be estimated? Now instead using the naive Bayes assumption, what is the number of parameters to be estimated? Show your calculations clearly.

4.2 Programming task-

First, simulate 10 data sets each containing two classes of data from the 2-dimensional Gaussian distributions specified in Table 1. Plot separately sets 5 and 10 on a 2-D plane with the two classes coded in different symbols and colors. Save the simulated 10 sets.

Parameter 1: µ0 = [-2 0]; Σ0 = [2 0; 0 3]; µ1 = [2 0]; Σ1 = [2 0; 0 3].

Parameter 2: µ0 = [-2 0]; Σ0 = [2 - 1.8; -1.8 3]; µ1 = [2 0]; Σ1 = [2 - 1.8; -1.8 3].

Set

# data per class

Gaussian parameters

1

20

Parameter 1

2

40

Parameter 1

3

100

Parameter 1

4

300

Parameter 1

5

1200

Parameter 1

6

20

Parameter 2

7

40

Parameter 2

8

100

Parameter 2

9

300

Parameter 2

10

1200

Parameter 2

Table 1 - Specification for the data sets

Now write two separate routines that implement the full Gaussian and naïve Bayes classifiers respectively. Apply each of these classifiers to the simulated 10 data sets using 10-fold cross validated classification-use equal number of data points from class 0 and 1 as training or testing (held-out) data in each fold by writing a generic routine that performs classification with k-fold cross validation. Save the learned parameters of each classifier in each fold (for full Gaussian Bayes, these are the means and covariances; for naive Bayes, these are the means and variances at each dimension).

1. Plot the cross-validated classification accuracies with error bars under Parameter 1 (sets 1 - 5) and Parameter 2 (sets 6 - 10) on separate figures. For each figure, overlay the accuracies of the two classifiers for comparison.

2. Carefully examine and report the parameters learned for the two classifiers under Parameter 1 and 2. How do these vary when the data size increases from 20 to 1200? For full Gaussian and naive Bayes, how close are the estimated means and covariances to the ground truth-you may use the parameters averaged across 10-fold validation as the estimated parameters.

3. Compare and comment on the performance of these classifiers under different parameterizations. What is your prediction about their performances? Do the results reflect your prediction? Explain your reasoning.

4. Submit the complete (MATLAB) code in the appendix of your write-up in compact format (e.g. double columns). These include 1) the routines for the two classifiers 2) the routine for k-fold cross validation. Do not attach or print out the simulated data sets.

Q5. Bonus question: a small real-world challenge

The US postal service uses machine learning methods to robustly identify zip codes. In this section, you will design and implement a classifier that does a simpler job-classifying hand-written digits of "0"s and "1"s. You can load these digits in MATLAB with the command "load -ascii digitX" where X= 0 or 1. Each of the ".mat" files contains 700 samples of digits in 8 × 8 pixels. To get a feel for the data, you can visualize these samples with the command "imagesc(reshape(digitX(i,:),8,8)); colorbar gray", where i is the row index.

1. Your task is to perform a leave-one-out cross validated classification on the digit data-we recommend the Naive Bayes classifier although you are free to develop your own algorithm (we do not recommend you spending too much time on developing an over-complex model). In any case, specify clearly the assumptions and formulations of your algorithm. For example, for the Naive Bayes classifier, you should specify the prior, the likelihood function and any manipulation or transformation on the data.

2. Implement and apply your classifier on the data. Report the cross-validated accuracy (this should be a single number). Did your classifier work well-explain why or why not?

3. Submit your MATLAB code in compact format and attach it in the write-up.

Attachment:- Data.rar

Statistics and Probability, Statistics

  • Category:- Statistics and Probability
  • Reference No.:- M91839116
  • Price:- $60

Priced at Now at $60, Verified Solution

Have any Question?


Related Questions in Statistics and Probability

Letnbspnmnbspv denote the normal random variable with

Let  N ( m ,  v ) denote the normal random variable with mean  m  and  variance   v . (a) If  X  is a  Standard Normal  variable, what is the distribution of the random variable 4 X  - 1? (b) If  X  ~  N (3,1) and  Y  ~  ...

A random sample ofnbsp86nbspeighth grade students scores on

A random sample of 86 eighth grade? students' scores on a national mathematics assessment test has a mean score of 292. This test result prompts a state school administrator to declare that the mean score for the? state' ...

A professor records the following final exam scores 68 6 m

A professor records the following final exam scores: 68 6 (M SD). Students scoring in the top 18% get an A. What is the cutoff score for the top 18% in this example?

Project q costs 240 it provides inflows of 120 per year for

Project Q costs 240. It provides inflows of 120 per year for three years. The cost of funds is 6%. Find the replacement chain value needed to compare it to a six year project.

1nbspis the magnitude of an earthquake related to the depth

1) Is the magnitude of an earthquake related to the depth below the surface at which the quake occurs? Let  x  be the magnitude of an earthquake (on the Richter scale), and let  y  be the depth (in kilometers) of the qua ...

You want to borrow 36000 from your local bank to buy a new

You want to borrow $36,000 from your local bank to buy a new sailboat. You can afford to make monthly payments of $750, but no more. Assuming monthly compounding, what is the highest rate you can afford on a 60-month APR ...

Suppose that systolic blood pressure for 35- to

Suppose that systolic blood pressure for 35- to 44-year-olds is normally distributed with mean 125 in mm Hg and standard deviation 10 in mm Hg. What proportion of individuals will have a blood pressure reading between 12 ...

What is the 99 confidence interval for a sample of 52 seat

What is the 99% confidence interval for a sample of 52 seat belts that have a mean length of 85.6 inches long and a standard deviation of 3.8 inches?

Gala apples sampled at random from a specific orchard in

Gala apples sampled at random from a specific orchard in New Zealand have a mean weight of μ= 147g with a variance of σ2= 841g^2, and the weights follow a normal distribution quite well. The grower wants to sell the appl ...

In a pre-election poll a candidate for district attorney

In a? pre-election poll, a candidate for district attorney receives 253 of 500 votes. Assuming that the people polled represent a random sample of the voting? population, test the claim that a majority of voters support ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As