Ask Computer Engineering Expert

Data Mining - Using WEKA for Classification

Step 1: Understanding File Format

Before we start using Weka, let's spend a few minutes on understanding the file format for the input data. In notepad or another text editor, open file ‘sunburn.arff'. The file is in the attribute-relation file format (ARFF format). This is one of the file formats that Weka allows for the input file. Weka can also take the other file formats, for example the csv format. In the arff file, lines beginning with a % sign are comments. Following the comments at the beginning of the file are the name of the relation (‘sunburn') and a block defining the attributes (‘hair', ‘height', ‘weight', ‘lotion', ‘burned'). Nominal attributes are followed by the set of values they cantake on, enclosed in curly braces. Numeric values are followed by the keywordnumeric. There are two further attribute types, string and date.

Although the problem is to predict the class value ‘burned' from the values of the other attributes, the class attribute is not distinguished in any way in the data file. The ARFF format merely gives a dataset; it does not specify which of the attributes is the one that is supposed to be predicted. This means that the same file can be used for investigating how well each attribute can be predicted from the others, or to find association rules, or for clustering.

Following the attribute definitions is an @data line that signals thestart of the instances in the dataset. Instances are written one per line,with values for each attribute in turn, separated by commas. If a valueis missing it is represented by a single question mark.

Step 2: Exploring Training Data

Launch Weka by clicking on: RunWeka.bat

Select ‘Explorer' from the list of Applications.

Select the ‘Preprocess' tab and click on ‘Open File'. Choose the file ‘sunburn.arff' which contains the training data set.

Once the file is open, spend some time exploring the training data set. Weka gives a summary of the relation in the dataset and shows a list of attributes in the relation. An attribute can be selected from the attribute list. Once the attribute is selected, a summary of the attribute is displayed, which includes the list of attribute values (labels) and their counts in the dataset. Finally, the class attribute can be selected and the class distributions for the different attribute values are visualized.

Q1. What's the relation for the training data set? How many instances in the data set? How many attributes are in the relation?

Q2. How many distinct values for attribute "weight"? What are the counts for these attribute values? If you select attribute "burned" as the class attribute, what are the class distributions for the distinct values of attribute "weight"? If you select attribute "height" as the class attribute, what are the class distributions for the distinct values of attribute "hair"?

Step 3: Exploring Classifiers and Decision Trees

Select the ‘Classify' tab and make sure that "J48" is chosen from the classifier list and "Use training set" is clicked as the test option. Note that attribute "burned" is chosen by default as the class attribute but the class attribute can be changed if needed.

Click ‘Start' will create a classification model/classifier from the training dataset. The classifier is listed in the Result list while the details about the classifier are displayed in the ‘Classifier output' window.

Right click on ‘trees J48' in the ‘Result List' and select ‘Visualise Tree'. This will create the "Tree View" window.

A decision tree representation of the classifier will be displayed. Now spend some time examining the decision tree. On each of the leaf nodes there is a class label and two numbers. For instance, the leaf node on the most right of the tree is "burned (9.0/2.0)". This means that 9 instances in the training dataset reach that node, of which 2 are classified incorrectly. As you can see that there are 16 instances in total across all the leaf nodes.

The displayed decision tree is learned using an implementation (J48 in this case) of the C4.5 classification algorithm. This algorithm uses entropy as the impurity function for selecting the splitting attribute. We have yet to cover the algorithm. However, we have learned another impurity function, Gini index/impurity. Can you generate a decision tree using the Hunt's algorithm along with the Gini index as the impurity function?

Q3. Generate the optimal decision tree by hand using the Hunt's algorithm along with the Gini index.

You can then compare the decision tree generated by the C4.5 algorithm with the one generated by the Hunt's algorithm.

Q4. Are these two decision trees the same?

Step 4: Examining Classifier Output

The classifier output window shows the full output. At the beginning, the Run information provides a summary of the classifier, the training data set, and the test option. Then comes the classifier model, in which a pruned decision tree in textual form is shown. On the tree, the first split is on attribute "lotion", and then, at the second level, the split is on attribute "hair". In the tree structure, a colon introduces the class label that has been assigned to a particular leaf node, followed by the number of instances that reach that node. If there were incorrectly classified instances, their number would appear, too.

The next part of the output gives a summary of the evaluation on the dataset chosen as the test option. In this case, the evaluation results are obtained using the training set.

Now you can have a look at the evaluation results.

Q5. What are the accuracy and error rates of the evaluation? How do you calculate each of these rates?

Next comes the Detailed Accuracy by Class. Here we have a table that contains TP, FP, Precision, Recall, F-Measure etc.

Q6. What are the TP, FP, Precision, Recall and F-Measure for the "burned" class? What does each of them measure? How are these metrics calculated?

Finally comes the Confusion Matrix.

Q7. How to interpret the Confusion Matrix? What does each of the four cells in the table represent?

Step 5: Using Cross-validation and examining the classification results

You can easily run J48 again with a different evaluation method. Select the "cross-validation" test option with 10 folds as default and click Start again. The classifier output is quickly replaced to show how well the learned model performs on the cross-validation. As you can see, 25% of the instances (4 out of 16) have been misclassified in the cross-validation. This indicates that the results obtained from the training set(12.5% of the instances (2 out of 16)) earlier are optimistic compared with what might be obtained from an independent test set from the same source.

Q8. How are the figures under the Detailed Accuracy By Class (e.g., TP, FP, Precision, Recall and F-Measure) compared with the ones obtained on the training set?

Q9. Have you observed any changes to the Confusion Matrix? If so what are the changes?

Step 6:

In notepad or another text editor, open file ‘sunburn2.arff'.

Add an additional attribute ‘shade' to the list of attributes:

@ATTRIBUTE 'shade' {yes, no}

The values for ‘shade' should be listed at the start of each instance. For instance, the first instance:
blonde, average,light, no, burned

becomes:
no,blonde, average,light, no, burned

Values (in order, top to bottom) for each instance are as follows:

no, no, no, no, no, no, no, no, no, no, no, yes, yes, no, no, no

Accordingly, update each instance in the file ‘sunburn2.arff' and then save the file.

In WEKA Explorer click the ‘Preprocess' tab and then click ‘Open File'. Select the new file ‘sunburn2.arff'.

Step 7:

Repeat Step3and use J48 to create a new decision tree with this file.

Q10. Does the classification accuracy increase or decrease for this new file?

Q11. Does the J48 decision tree change, if so in what way?

Step 8:

In WEKA Explorer stay in the ‘Classify' tab. Select the ‘Supplied Test set' radio button and click the ‘Set' button, followed by the ‘Open file' button. Choose and open the file ‘sunburn2TEST.arff' and click ‘Close'.

Click the ‘More Options' button and ensure there is a tick beside ‘Output predications' then press ‘OK'.

Right click on ‘tree J48' and select ‘Re-evaluate model on current test set'.

The prediction results will appear in the ‘Classifier output' under the heading ‘Predictions on test set'.

Compare the predictions to the instances in the file ‘sunburn2TEST.arff'.

Q12. Are the predictions reasonable? Are the predictions as you would expect?

Attachment:- Practical1.rar

Computer Engineering, Engineering

  • Category:- Computer Engineering
  • Reference No.:- M92283581
  • Price:- $70

Guranteed 36 Hours Delivery, In Price:- $70

Have any Question?


Related Questions in Computer Engineering

Does bmw have a guided missile corporate culture and

Does BMW have a guided missile corporate culture, and incubator corporate culture, a family corporate culture, or an Eiffel tower corporate culture?

Rebecca borrows 10000 at 18 compounded annually she pays

Rebecca borrows $10,000 at 18% compounded annually. She pays off the loan over a 5-year period with annual payments, starting at year 1. Each successive payment is $700 greater than the previous payment. (a) How much was ...

Jeff decides to start saving some money from this upcoming

Jeff decides to start saving some money from this upcoming month onwards. He decides to save only $500 at first, but each month he will increase the amount invested by $100. He will do it for 60 months (including the fir ...

Suppose you make 30 annual investments in a fund that pays

Suppose you make 30 annual investments in a fund that pays 6% compounded annually. If your first deposit is $7,500 and each successive deposit is 6% greater than the preceding deposit, how much will be in the fund immedi ...

Question -under what circumstances is it ethical if ever to

Question :- Under what circumstances is it ethical, if ever, to use consumer information in marketing research? Explain why you consider it ethical or unethical.

What are the differences between four types of economics

What are the differences between four types of economics evaluations and their differences with other two (budget impact analysis (BIA) and cost of illness (COI) studies)?

What type of economic system does norway have explain some

What type of economic system does Norway have? Explain some of the benefits of this system to the country and some of the drawbacks,

Among the who imf and wto which of these governmental

Among the WHO, IMF, and WTO, which of these governmental institutions do you feel has most profoundly shaped healthcare outcomes in low-income countries and why? Please support your reasons with examples and research/doc ...

A real estate developer will build two different types of

A real estate developer will build two different types of apartments in a residential area: one- bedroom apartments and two-bedroom apartments. In addition, the developer will build either a swimming pool or a tennis cou ...

Question what some of the reasons that evolutionary models

Question : What some of the reasons that evolutionary models are considered by many to be the best approach to software development. The response must be typed, single spaced, must be in times new roman font (size 12) an ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As