Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Statistics and Probability Expert

10-701 Machine Learning - Spring 2012 - Problem Set 3

Q1. Decision trees

Short decision trees with few levels are more desirable in practice because they generalize well to new data and are less prone to overfit.

1.1 Information Gain-

Most of algorithms for learning the structure of decision trees rely on a heuristic to pick an attribute to split the trees in sequential steps. In this question, we are particularly interested in discrete valued attributes. We explore a popular measure: Information Gain (IG) and relate it to other concepts in probability.

We start with these definitions. Assume we have two discrete random variables X and Y taking values {1, . . . , m} and {1, . . . , n} respectively.

-Entropy

H(X) = x=1m-p(X = x) log p(X = x)                                                                              (1)

-Conditional entropy of Y given X: H(Y|X). Note that this is different from entropy of Y conditioned on X = x: H(Y|X = x).

H(Y|X) = x=1mp(X = x)H(Y|X = x)                                                                               (2)

-Information Gain

Y) = H(X) - H(X|Y)                                                                                                        (3)

-Kullback-Leibeler divergence (K-L divergence) between two distributions.

DKL(P||Q) = ∑iP(i) log P(i)/Q(i)                                                                                      (4)

-Mutual Information

I(X; Y) = x=1my=1np(X = x, Y = y) log (p(X = x, Y = y)/p(X = x)p(Y = y))                   (5)

1. Show that IG(X; Y) = IG(Y ; X).

2. Show that IG(X; Y) = H(X) + H(Y) - H(X, Y).

Having a Car

Student

Owning a house

Risk

Count

Yes

Yes

Yes

Yes

5

Yes

Yes

No

Yes

3

Yes

No

Yes

Yes

15

Yes

No

No

Yes

2

No

Yes

Yes

Yes

5

No

Yes

No

Yes

0

No

No

Yes

Yes

4

No

No

No

Yes

1

Yes

Yes

Yes

No

3

Yes

Yes

No

No

3

Yes

No

Yes

No

2

Yes

No

No

No

2

No

Yes

Yes

No

2

No

Yes

No

No

10

No

No

Yes

No

3

No

No

No

No

10

Table 1: Data from Problem 1.

3. Write the Information Gain in terms of the K-L divergence of two distributions. In other words, find P and Q such that IG(X, Y ) = DKL(P||Q).

1.2 Tree Construction-

It is that time of the year to file tax returns to the IRS. Bob is hired by a famous tax software company to assess the risk of audit for customers. Unfortunately, being a poor grad student, he has paid no attention to taxes. He decides to build a decision tree to classify customers into high and low risk groups. The data is shown in Table 1.

1. Show the steps in constructing the decision tree (without postpruning) and the final tree using the algorithm we described in class. What is the classification error?

2. Bob would like to add two new attributes: Income and Age. Assume that the decision tree splits on an attribute X by splitting the examples into two sets: X < a and X ≥ a. How about when an attribute has K distinct values? How should we determine the best splitting value a?

3. Real datasets may not be perfect, e.g. some may contain systematic errors. In each of these situations, if there are systematic errors, describe how you detect them and what we should do.

  • An attribute I has only one single value.
  • Attributes T1 and T2 are duplicates. That means all examples having the same values for these two attributes.

Q2. Neural Networks

2.1 Binary classification-

119_Figure.png

Which classifiers: kNN, logistic regression, neural networks, naive Bayes can learn the decision boundary shown in Figure 1?

2.2 Boolean functions and neural networks-

For this problem, we only consider neural networks with a hidden layer. Each unit has a weight w and uses the following activation functions:

1. Hard-threshold function: for a constant a,

580_Figure1.png

2. Linear: h(x) = wTx                                                                            (7)

In class, we show that this type of networks can compute the boolean functions OR, AND and XOR.

1. How many boolean functions f(x1, x2), where x1, 2 ∈ {0, 1}, are there?

2. For any boolean function f(x1, x2), can we build a neural network to compute f?

If yes, please prove by construction? If no, explain in details.

2.3 Two layer neural networks-

We now consider neural networks with a hidden layer with only one input signal x and one output signal y. Using hard threshold and/or linear activation functions. Which of the following cases can be exactly represented by a neural network? For each case, if yes, draw the network with choice of activations for both levels and briefly explain how the function is represented. If no, explain why.

1. Polynomials of degree one.

2. Polynomials of degree two.

3. Hinge loss: L(x) = max(1 - x, 0).

2.4 Network Training-

We train a neural network on a set of input vectors {xn} where n = 1, . . . , N, together with a corresponding set of target K-dimensional vectors {tn}. We assume that t has a Gaussian distribution with an x-dependent mean, so that:

p(t|x, w) = N (t|y(x, w), β-1I)                                                 (8)

where β is the precision of the Gaussian noise. We would like to find the parameters w using the MLE principle:

wMLE = argmaxw n=1Np(tn|xn, w)                                        (9)

1. Show that solving (9) is equivalent to minimizing the sum-of-squares error function:

E(w) = ½n=1N||y(xn, w) - tn||22                                          (10)

2. Consider a network with one hidden layer as shown in Figure 2. Assume that each hidden unit j computes a weighted sum of its inputs x of the form:

= ∑iwjixi                                                                              (11)

278_Figure2.png

where wj and z are the weights and input of this hidden unit; and use an activation function h(aj ) = zj. Also the output unit k computes a weighed sum of its inputs z of the form

bk = ∑jw'kj zj                                                                       (12)

where w'k are the weights of this output unit. This unit outputs yk = bk. Together, the output of this network is:

y(xn, w) = [y1, y2, . . . yK]T                                                 (13)

To train a neural network, we need to compute the derivatives of E(w). We can do so by considering a simpler quantity En = ½||y(xn, w) - tn||22. Show that

∂En/∂wji = (h'(aj)∑kw'kj δk)xi                                                (14)

where w'k are weights of the output units and δk = yk -tk. Hints: This is covered in section 5.3.1 of PRML. You should try to follow the derivation there.

Q3. SVM

In this problem, we consider training a SVM on a set of inputs {xn} where n = 1, . . . , N, together with a corresponding set of target values {tn} (tn ∈ {-1, 1}). The goal is to maximize the margin and at the same time allows some small misclassifications. An example is classify as follows:

2185_Figure3.png

where C ≥ 0 is a parameter that controls the trade-off between the stack variable penalty and the margin.

3.1 Slackness and Kernel-

Figure 3 shows the decision boundaries of training four SVMs using different values of C and different kernels. For each of the examples, specify which setting was used to get the result.

1988_Figure4.png

1. C = 1 and no kernel is used.

2. C = 0.1 and no kernel is used.

3. C = 0.1 and the kernel is K(xi, xj) = exp -(||xi-xj ||2/2.

4. C = 0.1 and the kernel is K(xi, xj) = xiTixj + (xiT xj)2.

3.2 Dual formalization-

 1. Using the Lagrangian multipliers, show that the corresponding Lagrangian is

L(w, b, a) = ½||w||2 + Cn=1Nξn - n=1Nan{tny(xn) - 1 + ξn} - n=1Nµnξn

Be explicit in specifying which multiplier is correspondent to which constraint.

Hints: A brief introduction to Lagrange multipliers is in Appendix E of the PRML book.

2. The Karush-Kuhn-Tucker (KKT) conditions for the optimal solutions are:

an ≥ 0 (19)

tny(xn) - 1 + ξn ≥ 0 (20)

an{tny(xn) - 1 + ξn} = 0 (21)

µn ≥ 0 (22)

ξn ≥ 0 (23)

µnξn = 0 (24)

3. By taking the derivatives of L(w, b, a), show that the dual problem is:

maxn=1Nan - ½ n=1Nm=1Nanamtntmk(xn, xm)                                  (25)

s.t. 0 ≤ an ≤ C, n = 1, . . . , N                                                                 (26)

n=1∑N antn = 0                                                                                     (27)

Please show all the steps in very details.

4. Given a solution to the dual problem, which inputs are support vectors?

5. Can you compute w from the dual solutions? If yes, give the formula, if no, explain why? Hints: Do we always have φ(xn)?

6. How can we classify a new example x'?

3.3 Implementation-

1. Implement SVM in MATLAB.

function [y_new, test_error, training_error] = train_svm(x, t, x_new, C)

This function trains a SVM with inputs x, t, C and classifies new examples x_new and outputs the predictions in y_new.

Your implementation should satisfy these requirements:

  • You cannot use any existing implementations of SVM including MATLAB's built-in functions. However, you can use a quadratic programming(QP) solver (e.g. MATLAB's quadprog).
  • You should use a subroutine to compute the Kernel distance between two input examples so that it is easy to use different kernels with your implementation.

2. A dangerous mutant of bird flu viruses was discovered and a widespread pandemic is imminent. A vaccine was developed for improving immunity against this potent virus. However, this new vaccine is very expensive and only effective for a subset of patients. Expression of two genes X, Y are shown as predictors of the vaccine's effectiveness. You are asked to train a SVM to predict the vaccine's effectiveness on patients using gene expression measurement. Download the dataset from the course website. The .mat file contains two matrices: train and test. There are 160 training examples and 40 testing examples.

(a) Using no kernel (or φ(xn) = xn)), train a SVM using different values of C = 0, 0.1, 0.3, 0.5, 1, 2, 5, 8, 10. Use 4-fold cross validation and report the train and test errors. Produce a plot of 2 curves: training and testing error. The x-axis shows different values of C and the y-axis shows the error rate.

Which value of C yields the best training and testing error rate?

Which value of C should you use?

Finally, plot the training and testing examples along with the decision boundary. What do you observe?

Hints: To plot the decision boundary, you can use the code in svmplot.m.

(b) Now use Gaussian kernels K(xi, xj) = exp{-(||xi-xj ||2/2)} and repeat the previous experiment.

(c) Summarize the result of two experiments. Should we use Gaussian kernels at all?

Q4. Hierarchical clustering v.s. K-means

Single linkage and K-means can give very different clustering results and one may be more preferred than the others. In each of the examples in Figure 4, write down whether the clusters are produced by K-means (KM) or agglomerative clustering with single linkage (SL).

2201_Figure5.png

Attachment:- Data.rar

Statistics and Probability, Statistics

  • Category:- Statistics and Probability
  • Reference No.:- M91839470
  • Price:- $50

Priced at Now at $50, Verified Solution

Have any Question?


Related Questions in Statistics and Probability

How can i describe the relationship between diversifiable

How can I describe the relationship between diversifiable risk and total risk, along with the role of Beta?

Hardboard is graded with the following frequenciesgrade a

Hardboard is graded with the following frequencies: Grade A: 1/5 Grade B: 3/4 Cull: 1/20 If six pieces come off of a hot press, find the probability that two are Grade A, three are Grade B, and one is a Cull.

The following frequency table summarizes the distances in

The following frequency table summarizes the distances in miles of 100 patients from a regional hospital. Distance           Frequency 0-2                  30 2-4                  35 4-6                  20 6-8           ...

In order to fund her retirement helen needs her portfolio

In order to fund her retirement, Helen needs her portfolio to have an expected return of 13.8 percent per year over the next 30 years. She has decided to invest in Stocks 1, 2, and 3, with 25 percent in Stock 1, 50 perce ...

A study showed that in 1990 47 of all those involved in a

A study showed that in? 1990, 47?% of all those involved in a fatal car crash wore seat belts. Of those in a fatal crash who wore seat? belts,44?% were injured and 27?% were killed. For those not wearing seat? belts, the ...

Ten students were sampled at random from a student

Ten students were sampled at random from a student population. Each was asked how many courses he or she was planning on studying in the upcoming year. The following is a list of the reported data values: 1, 2, 2, 3, 4, ...

The mean of the annual return for common stocks from 1926

The mean of the annual return for common stocks from 1926 to 1992 was 16.5%, and the standard deviation of the annual return was 19%. In later parts of the question we will ask: a. What is the probability that the stock ...

Question 1 assume you have noted the following prices for

Question: 1. Assume you have noted the following prices for paperback books and the number of pages that each book contains. Develop a least-squares estimated regression line. i. Compute the coefficient of determination ...

On average the parts from a supplier have a mean of 975

On average, the parts from a supplier have a mean of 97.5 inches and a standard deviation of 12.2 inches. Find the probability that a randomly selected part from this supplier will have a value between 85.3 and 109.7 inc ...

A puck company wants to sponsor the players with the 20

A puck company wants to sponsor the players with the 20% quickest goals in hockey games. The times of first goals are normally distributed with a mean of 8.54 minutes and a standard deviation of 4.91 minutes. How fast wo ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As