Ask Computer Engineering Expert

Data Warehousing and Data Mining:

Assignment 1: Analysis of Czech Bank Data

Learning Outcomes

This assignment will be assessed upon your ability to:

LO1. Describe and critically evaluate the role and relevance of [analytical investigation] to the solution of business information problems.

LO2. Explain the concepts that underpin the subject area of [data mining] making reference to main established concepts and some developing areas.

LO3. Apply concepts and justify decisions when modelling, designing and constructing practical examples or paper descriptions of applications in this area.

Notes

- You are required to submit this assignment twice once through turn-it-in, to generate a turn-it-in report and again, through the SHU assignment submission area so that the work can be easily marked. These will be made available to you on blackboard in due course.
- All references must follow the APA‡ format.

Problem outline

For this assignment you are required to analyse a data set taken from the data mining competition prior to the third international conference of Principles and Practices of knowledge discovery in data bases (PKDD). This conference was held in Prague in 1999. One of the challenges given for the competition was a set of datasets concerning financial transactions and details for customers at a Czech bank. Several of these files have been combined to give 4 500 observations of various financial and personal details relating to different accounts. Details of the variables included are given below.

Details of the relationships are shown below:

1369_Figure.jpg

Figure 1: Database Relationships for Czech Bank Data

You are required to analyse one table resulting from a query from this database as detailed below. Full details of the fields in this table are given below and in the appendix.

Data Provided

The final query is saved as a SAS dataset for use in Enterprise Miner. It is called czech16.sas7bdat. It is available on the SHU server in the path: D:\SASMetaShare\!SharedData\DWDM. You will need to create a library to access the data.

Details of the Query and resulting data

We wish to investigate if there are any groups of accounts with similar properties. Also we wish to determine which accounts have a loan attached to that account. For this purpose a subset of variables (fields) are selected from the final query on the database for each account.
These variables can be seen to represent, for each account different types of credits and withdrawals that take place:

Credits (this is where customers pay into their account):
- Cash
- Salary
- Pension
- Interest

Withdrawal (taking money out i.e. customers spend their money):
- Cash
- Insurance payment
- Credit Card
- Overdraft Penalty
- Statement Payment
- Household Payment
- Other bank withdrawal

All amounts are measured in Czech Koruna (Kc). For each of these the mean size of the payment is calculated as the total value of all payments over the total number of payments. So for example the variable for mean insurance payments (minsure) is calculated as the total value of insurance payments divided by the number of insurance payments. For an account which has made ten monthly insurance payments of £12, and then a further 10 of £15 (perhaps their premium went up), the value of minsure would be:

2246_Figure1.jpg

In addition to the mean for each of the variables above the average gap between cash credits (mcashcrcgap) and cash withdrawals(mcashwdcgap) is also included. (These are NOT included for the other credit/debit methods). So if an account shows a cash payment in (cash credit) each week on the same day, then the value for mcashcrcgap would be 7 days, an account with no cash credits or only one would have mcashcrcgap=0. Similarly, suppose an account shows a gap of 3 days between the first cash withdrawal and the second withdrawal.

Then if subsequent cash withdrawals occur again after 5 further days, 4 further days and finally after a further 10 days; then the value of mcashwdcgap = (3 + 5 + 4 + 10)/4 = 5.5 days.

Finally additional information is held about each account:

Account id, Age of primary account holder, if they have a credit card or not (with this bank), number of days account open, if they have a loan or not, if there is a second user of the account and the gender of the main account holder (sex). This gives the set of variables as shown in the appendix

Questions

The bank wishes to see if different customers have similar profiles and have therefore asked that the czech16 data be clustered. They are looking for about five clusters. You will need to build an Enterprise Miner diagram and use appropriate nodes to answer these questions. A typical diagram is shown in Figure 3 on page 7.

1) Since cluster analysis requires the use of fields that are symmetrical as possible you should first investigate each of the interval fields in the czech16 data.

a) Produce suitable summary measures (means, standard deviation, skewness etc.) of the data and fully interpret your results. (No marks will be awarded for discussing the kurtosis). For example, what is the largest average type of credit? Which is the most consistent? What about withdrawals? Are there any other interesting features?

b) Produce suitable plots of each of these fields. Are there any unusual features to any of your plots? By examining the actual data records can you discern why these features might occur? Are they what you might expect?

(You may initially find it best to achieve this with the "explore" feature available within Enterprise Miner, to obtain detailed statistics use a StatExplore node with loan set as input. Remember to reset loan back to target for building the tree in question 9) below).

2) Also investigate the remaining binary variables by producing suitable plots. Fully discuss your results.

3) Use the transform node in Enterprise Miner and the "Maximum Normal" option for interval variables to find suitable transformations of the interval variables. The aim is to make them as symmetrical as possible. You should ensure that in your settings, you still retain a copy of the original variables (set both Hide and Reject to "no").

a) Explain what actual transformations the software has picked.

b) Produce further plots of the transformed variables and use these to present evidence of whether the transformations have been successful. Comment clearly on your results. (Please note that it may not be possible to make all variables totally symmetrical). Consequently, state for each interval variable whether subsequent analysis should use the original (untransformed) variable or the new transformed variable. Hence explain which set of interval variables you would use for clustering.

4) a) Clustering of these data has been carried out using code resulting in the dendrogram on the following page (Figure 2). Fully discuss this dendrogram and explain how many clusters you might pick. You should give all reasonable possibilities and then explain what your final choice is likely to be.

b) Use a clustering node to cluster the interval variables only. You should use the variable set that you specified in 3)b) above. Ensure that the resulting set of interval variables are suitably standardised. Set the "Initial Cluster Seed" options as follows:
- Seed Initialization Method → MacQueen
- Minimum Radius → 1
- Drift during Training → 1
Hence obtain a CCC plot for cluster solutions of the interval variables.

c) Fully discuss the CCC plot and clearly explain how many clusters you would pick.

5) Change the settings of the clustering node so as to fit five clusters. Discuss the resulting "segment size" plot and "Mean statistics" in the output.

1484_Figure 2.jpg

Figure 2: Dendrogram Produced Using Interval Variables Only

6) Produce suitable plots to investigate the nature of each cluster. Hence illustrate the validity of your cluster solution by profiling one of your clusters and interpret the factors that make it unique. Draw any conclusions regarding your cluster solution.

7) a) From within the cluster results, obtain the decision tree for your cluster solution. Discuss this tree explaining how many leaves it has, its depth and any other relevant features. (Little credit will be given for just writing the tree out in words.) Give two examples of how the results of the tree support your answer to question 6) above.

b) Explain the disadvantages and advantages of using this tree rather than a new tree, built by adding a tree node to the stream.

8) Use all of your results above to discuss how the cluster solution may be utilised by the bank. How may the results of any of the analyses above be utilised to carry out further supervised data mining? What other possible targets might be appropriate?

9) Add a partition node and a tree node to your diagram. Set the field loan as the target in the input (czech16) node. Connect the input (czech16) to the partition node and the partition node to the tree node. A typical final Enterprise Miner diagram is shown in Figure 2

1807_Figure 3.jpg

Figure 3

a) Run the tree node and examine the results.

b) Explain the type of customer that may be more likely to have a loan. How might this information be used by the bank? How reliable is this tree likely to be on new data?

c) A customer has the following profile:

Age =30 years, card = "N", days = 500 days, mInsure =0Kc, mInterest = 3Kc, moverdt = 0Kc, mpension = 0Kc, Mregsal = 15,000Kc, mstment = 15Kc, mcardwdl = 5Kc, mcashcr = 0Kc, mcashcrcgap = 40days, Mcashwd = 7,000Kc, mcashwdcgap = 30 days, mhousehold = 10Kc, mothbwd = 0Kc, second ="N", Sex = "F".

Use your tree to find out the probability they will have a loan. Hence would you predict that they would have a loan? Justify your answer.

10) Find an example in the academic literature (i.e. an article or paper from a reputable academic journal) where data mining has been used. You should select an article where either clustering or a decision tree or another data mining technique has been used. The article must be clearly referenced using the APA** format now used at SHU. In no more than 500 words discuss your chosen article. Briefly describe the situation in which it was applied, what was discovered and whether (with reasons) you think data mining was used
effectively.

Report

Although this assignment comprises various data analyses together with a discussion of a piece of academic literature, you should write all your findings in a technical report. This should include the technical reasoning of your findings but be written in a way that would be suitable for the data science section of the bank. It should include any relevant output either in the main body of the report or in a suitable appendix (as appropriate). Marks will be lost for aspects that depart from a professional report such as:
- using an inappropriate style for this purpose (e.g. use of first person),
- poor English,
- lack of numbered headings and sub-headings,
- lack of figure labels,
- being untidy,
- inappropriate screenshots
- no page numbers
- inappropriate balance between the appendix and the main report,
- including the questions in the assignment
- incorrect length
The report should not be of more than 3000 words including question 10.

APPENDIX 1

item

meaning

remark

account_id

Identification of the account

 

age

Age of primary account holder

 

card

Whether the account holder has a credit card with this bank

 

days

Total number of days between first transaction and last

Over time period of database which is from 1/1/93 - 31/12/98.

loan

=N       if the account has a loan

=Y       if the account does not have a loan

 

mInsure

Average of Insurance payments

 

 

 

Measured over the period of the transaction data base (see comment above). The shaded rows represent withdrawals (or debits) and the un-shaded credits (money paid into the account).

mInterest

Average value of interest payments

moverdt

Average value of overdraft payments

mpension

Average value of pension payments

mregsal

Average value of regular salary payments

mstment

Average value of statement payments

mcardwdl

Average value of credit card withdrawals

mcashcr

Average value of cash credits

mcashcrcgap

Average gap between cash credits

mcashwd

Average value of cash withdrawals

mcashwdcgap

Average gap between cash withdrawals

mhousehold

Average value of household payments

mothbwd

Average value of other bank payments

second

=N       if no second account holder

=Y       if there is a secondary account holder

 

sex

Gender of primary account holder

 

Attachment:- Data Mining Assignment.rar

Computer Engineering, Engineering

  • Category:- Computer Engineering
  • Reference No.:- M92230224
  • Price:- $90

Guranteed 48 Hours Delivery, In Price:- $90

Have any Question?


Related Questions in Computer Engineering

Does bmw have a guided missile corporate culture and

Does BMW have a guided missile corporate culture, and incubator corporate culture, a family corporate culture, or an Eiffel tower corporate culture?

Rebecca borrows 10000 at 18 compounded annually she pays

Rebecca borrows $10,000 at 18% compounded annually. She pays off the loan over a 5-year period with annual payments, starting at year 1. Each successive payment is $700 greater than the previous payment. (a) How much was ...

Jeff decides to start saving some money from this upcoming

Jeff decides to start saving some money from this upcoming month onwards. He decides to save only $500 at first, but each month he will increase the amount invested by $100. He will do it for 60 months (including the fir ...

Suppose you make 30 annual investments in a fund that pays

Suppose you make 30 annual investments in a fund that pays 6% compounded annually. If your first deposit is $7,500 and each successive deposit is 6% greater than the preceding deposit, how much will be in the fund immedi ...

Question -under what circumstances is it ethical if ever to

Question :- Under what circumstances is it ethical, if ever, to use consumer information in marketing research? Explain why you consider it ethical or unethical.

What are the differences between four types of economics

What are the differences between four types of economics evaluations and their differences with other two (budget impact analysis (BIA) and cost of illness (COI) studies)?

What type of economic system does norway have explain some

What type of economic system does Norway have? Explain some of the benefits of this system to the country and some of the drawbacks,

Among the who imf and wto which of these governmental

Among the WHO, IMF, and WTO, which of these governmental institutions do you feel has most profoundly shaped healthcare outcomes in low-income countries and why? Please support your reasons with examples and research/doc ...

A real estate developer will build two different types of

A real estate developer will build two different types of apartments in a residential area: one- bedroom apartments and two-bedroom apartments. In addition, the developer will build either a swimming pool or a tennis cou ...

Question what some of the reasons that evolutionary models

Question : What some of the reasons that evolutionary models are considered by many to be the best approach to software development. The response must be typed, single spaced, must be in times new roman font (size 12) an ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As