Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Computer Engineering Expert

Assignment- Decision Tree and Naïve Bayes

Build Decision Tree Model

Packages required: Install and load C50, caret, rminerpackages

Data: The data are taken from Shmueli et al. (2010). The data set consists of 2201 airplane flights in January 2004 from the Washington DC area into the NYC area. The characteristic of interest (the response) is whether or not a flight has been delayed by more than 15 min.

The explanatory variables include three different arrival airports (Kennedy, Newark, and LaGuardia); three different departure airports (Reagan, Dulles, and Baltimore); eight carriers; a categorical variable for 16 different hours of departure (6 am to 10 pm); weather condition (0=good and 1 = bad); day of week (1 = Monday, 2 = Tuesday, 3 = Wednesday, ... , 6 = Saturday and 7 = Sunday);Here the objective is to identify flights that are likely to be delayed.

Tasks:

1) Import and explore data

a. Open FlightDelay.csv and store the results into a data frame, e.g., called datFlight. All of the character values should be imported as factors. Transform specific numeric values such as weather condition, day of week and day of month as factors.

b. Use the str() and summary commands to provide a listing of the imported columns and their basic statistics. Make sure that the data types are imported as expected.

2) Prepare data for classification

a. Using a seed of 100, randomly select 60% of the rows into training (e.g. called traindata). Divide the other 40% of the rows evenly into two holdout test/validation sets (e.g., called testdata1 and testdata2).

b. Inspect (show) the distributions of the target variable in the subsets. They should preserve the distribution of the target variable in the whole data set.

3) C5.0 decision tree classifiers

a. Build/train a tree model

i. Build the tree using the C50 function with default settings

ii. Show the (textual) model/tree.

iii. How many leaves are in the tree? (Note: In C50, the size of tree is the number of leaves. In J48, the size of the tree is the number of nodes, and J48 also provides the number of leaves.)

iv. What is the predictor that first splits the tree?

b. Find rules (paths) in the tree

i. Find one path in the tree to a leaf node that is classified to ontime. Starting with the condition on the first (or top) branch of the path, write down the conditions on the tree branches belonging to this path. Enclose a condition in a pair of parentheses and precede it with "If" - e.g.

If (house <= 600469),..., and (income <57578), then STAY

ii. How many conditions and how many unique predictors are in your selected rule?

iii. What is this rule's misclassification error rate (e.g., 20/50 misclassified)?

iv. Similarly, describe a rule that classifies an instance to delay.

v. What is this rule's misclassification error?

vi. Find a shorter or longer rule with fewer or more conditions for ontine than previous rules. Repeat this for Delay. Show these two rules and their misclassification errors.

vii. What are the reasons that long rules are included in a decision tree model?

viii. What is the disadvantage of a long rule?

c. Apply and evaluate the trained model to two hold-out testing sets, one set at a time. The process for each data set includes:

i. Generate predictions (i.e. estimations) of the values of the target variable for the testing instances.

ii. Generate a confusion matrix that shows the counts of true-positive, true-negative, false-positive and false-negative predictions for both testdata1 and testdata2. Consider Ontime as positive class.

iii. Generate seven performance metrics - Accuracy (percent of all correctly classified testing instances), and precision (percent of instances predicted to have a class are accurate), recall (also true positive) and F-measure (also F-score) of Ontime and of Delay respectively. (Note: References of performance metrics in the rest of the assignment refers to these seven metrics or a set of metrics that are inclusive of these.)

iv. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets? Explain the reason for your answer.

4) C50 pruning

a. Build another C50 tree using the train set by changing the confidence factor to 0.05 (i.e. CF=0.05 in C50 function's control).

b. Describe the size of the tree built.

c. Generate predictions, confusion matrixes and performance metrics using two test sets.

d. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets? Explain the reason for your answer.

e. Would you adopt this pruning setting? Why or why not?

5) Returning to the default pruning setting, build another C50 tree with only two predictors of your choice.

a. Build a tree using the predictors of your choice in the train set.

b. Describe the size of the tree built.

c. Generate predictions, confusion matrices and performance metrics using two test sets.

d. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets?

Build Naïve Bayes Model

1) e1071 naiveBayes classifiers

a. Prepare DelayFlight for building and evaluating Naïve Bayesian classifiers. Load the caret package. Using a seed of 100, 500 and 900, randomly select 67% of a file three times into three training sets and save other 33% in three testing sets respectively. Calculate the average number of examples in testing sets.

b. Use for loop to build and understand e1071 naiveBayes models with all predictors for delay.

i. Load the e1071 and rminer packages.

ii. Build a Naïve Bayesian models using the naiveBayes function in e1071 with each traindata.

iii. Show each model. What are the values of A-priori probabilities - P(Delay) for the delay class and P(Ontime) for the ontime class for each model?

iv. Generate predictions (i.e. estimations) of the values of the target variable for instances in each testdata.

v. Save the values of TP, TN, FP, FN and calculate the average of these four values after the loop.

vi. Save the value of performance metrics of three models on their corresponding testing samples and fill out the following table.

 

Accuracy

Precision_Delay

Precision_Ontime

Recall_Delay

Recall_Ontime

F1(Delay)

F1(Ontime)

Model1




 

 

 

 

Model2




 

 

 

 

Model3




 

 

 

 

Cost Sensitive Learning

1) Imbalanced target variable class distribution

a. What is the distribution proportion of target variable from original FlightDelay dataset? Which one is the majority class (more instances) and which one is the minority class (less instances)?

b. A simple majority_rule model always classifies all instances as the majority class which is the class that has more instances in a data set. This rule is a heuristic (man-made) rule. (No code needed for this questions)

i. Use the majority_rule model to classify all of the instances in FlightDelay.csv. How many TP, TN, FP and FN will this model generate? What is the accuracy rate of applying this model to FlightDelay.csv?

2) Cost-benefit calculations and cost-sensitive models using all of the predictors

a. Using the mean values of TP, FP, TN and FN from three C50 classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per flight over all three testing results. Assume the following cost and benefit factors.

o Cost of sending notification message to a classified delay (Predicted as Delay): $50 per flight.
o Loss of delay waiting: $1000 for providing food and hotel service for customers per flight.
o Benefits of predicting a correct delay flight: $500
o No additional benefits from correctly classifying actual on time flight.

 

Predicted as

Actual

On Time

Delay

On Time

0

-1000

Delay

-50

500

b. Using the mean values of TP, FN, TN and FP from three naïvebayes classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per customer over all three testing results. Assume the same cost and benefit factors.

c. Create a cost matrix to specify the cost of misclassifying a delay flight as aon time flight to be 10 times the cost of misclassifying a on time to delay.

d. In a For loop, build, predict and evaluate C50 classifiers using this cost matrix with three pairs of train and test sets. These are C50 cost-sensitive classifiers. Print the performance metrics for each testing set as well as the average value of each performance metric over three testing sets. Generate confusion matrix for each test set. Calculate and save the mean values of TP, FP, TN and FN over the three confusion matrices of testing results.

e. Using the mean values of TP, FN, TN and FP from three C50 cost-sensitive classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per customer over all three testing results. Assume the same cost and benefit factors.

Computer Engineering, Engineering

  • Category:- Computer Engineering
  • Reference No.:- M92801348
  • Price:- $50

Priced at Now at $50, Verified Solution

Have any Question?


Related Questions in Computer Engineering

Task create an array that holds a 20 random integers

Task : Create an Array that holds a 20 random integers between 1-50. Create an iterator that will return the memory address and value for each integer present in the Array.

A sequence of natural numbers a1 a2 an is said to be a

A sequence of natural numbers (a 1 , a 2 , ..., a n ) is said to be a degree sequence if there exists an undirected graph on n vertices {v 1 , v 2 , ..., v n } such that the degree of v i  is a i  for each i = 1, 2, ..., ...

Be sure to answer all partsphosgenenbspcocl2nbspis a toxic

Be sure to answer all parts. Phosgene (COCl 2 ) is a toxic substance that forms readily from carbon monoxide and chlorine at elevated temperatures: CO( g ) + Cl 2 ( g ) ? COCl 2 ( g ) If 0.490 mol of each reactant is pla ...

Suppose the cost function of making jackets is cx x2 -

Suppose the cost function of making jackets is C(x)= x^2 - 50x+1500. How many jackets should you make to minimize the cost of the jackets? How much would be the minimum cost?

How does a database that is associated with a mobile device

How does a database that is associated with a mobile device and with mobile apps differ from a database that is stored and created using a more traditional application and server?

Question suppose that a computer can execute 1 billion

Question : Suppose that a computer can execute 1 billion instructions/sec and that a system call takes 1000 instructions, including the trap and all the context switching. How many system calls can the computer execute p ...

Jason who is very knowledgeable regarding computers agrees

Jason, who is very knowledgeable regarding computers, agrees to purchase computers for Nick's business. Jason is retained for that purpose only, he is paid a set rate for the job, and Nick exercised no control over the m ...

Systems and networksconsider communication between a sender

(Systems and Networks) Consider communication between a sender and receiver. Using a time-space diagram, illustrate reliable transmission of a message consisting of 5 segments using the selective-repeat protocol with N=3 ...

Mccann co has identified an investment project with the

McCann Co. has identified an investment project with the following cash flows. Year Cash Flow  1   $800  2    1,090 3    1,350 4    1,475 a. If the discount rate is 7 percent, what is the present value of these cash flow ...

A is the event that your friend is sick today and b is the

A is the event that your friend is sick today and B is the event that you pass this test. These events are independent and Not mutually exclusive. Calculate the probability that your friend is sick today OR you pass this ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As