Ask Management Information System Expert

Data Mining in the Real World

"I'm not really a contrarian about data mining. I believe in it. After all, it's my career. But data mining in the real world is a lot different from the way it's described in textbooks. "There are many reasons it's different. One is that the data are always dirty, with missing values, values way out of the range of possibility, and time values that make no sense.

Here's an example: Somebody sets the server system clock incorrectly and runs the server for a while with the wrong time. When they notice the mistake, they set the clock to the correct time. But all of the transactions that were running during that interval have an ending time before the starting time. When we run the data analysis, and compute elapsed time, the results are negative for those transactions. "Missing values are a similar problem.

Consider the records of just 10 purchases. Suppose that two of the records are missing the customer number and one is missing the year part of transaction date. So you throw out three records, which is 30 percent of the data. You then notice that two more records have dirty data, and so you throw them out, too. Now you've lost half your data. "Another problem is that you know the least when you start the study.

So you work for a few months and learn that if you had another variable; say the customer's Zip code, or age, or something else, you could do a much better analysis. But those other data just aren't available. Or, maybe they are available, but to get the data you have to reprocess millions of transactions, and you don't have the time or budget to do that. "Overfitting is another problem, a huge one. I can build a model to fit any set of data you have. Give me 100 data points and in a few minutes, I can give you 100 different equations that will predict those 100 data points. With neural networks, you can create a model of any level of complexity you want, except that none of those equations will predict new cases with any accuracy at all. When using neural nets, you have to be very careful not to overfit the data.

"Then, too, data mining is about probabilities, not certainty. Bad luck happens. Say I build a model that predicts the probability that a customer will make a purchase. Using the model on new-customer data, I find three customers who have a .7 probability of buying something. That's a good number, well over a 50-50 chance, but it's still possible that none of them will buy. In fact, the probability that none of them will buy is .3 × .3 × .3, or .027, which is 2.7 percent. "Now suppose I give the names of the three customers to a salesperson who calls on them, and sure enough, we have a stream of bad luck and none of them buys. This bad result doesn't mean the model is wrong. But what does the salesperson think? He thinks the model is worthless and can do better on his own.

He tells his manager who tells her associate, who tells the Northeast Region, and sure enough, the model has a bad reputation all across the company. "Another problem is seasonality. Say all your training data are from the summer. Will your model be valid for the winter? Maybe, but maybe not. You might even know that it won't be valid for predicting winter sales, but if you don't have winter data, what do you do?

"When you start a data mining project, you never know how it will turn out. I worked on one project for 6 months, and when we finished, I didn't think our model was any good. We had too many problems with data:

wrong, dirty, and missing. There was no way we could know ahead of time that it would happen, but it did. "When the time came to present the results to senior management, what could we do? How could we say we took 6 months of our time and substantial computer resources to create a bad model? We had a model, but I just didn't think it would make accurate predictions.

I was a junior member of the team, and it wasn't for me to decide. I kept my mouth shut, but I never felt good about it. Fortunately, the project was cancelled later for other reasons. "However, I'm only talking about my bad experiences. Some of my projects have been excellent. On many, we found interesting and important patterns and information, and a few times I've created very accurate predictive models. It's not easy, though, and you have to be very careful. Also, lucky!"

Discussion Questions

1. Summarize the concerns expressed by this contrarian.

2. Do you think the concerns raised here are sufficient to avoid data mining projects altogether?

3. If you were a junior member of a data mining team and you thought that the model that had been developed was ineffective, maybe even wrong, what would you do? If your boss disagrees with your beliefs, would you go higher in the organization? What are the risks of doing so? What else might you do?

Management Information System, Management Studies

  • Category:- Management Information System
  • Reference No.:- M92645202

Have any Question?


Related Questions in Management Information System

Search the csu library the internet or any specific

Search the CSU library, the Internet, or any specific websites, and scan IT industry magazines to find an example of an IT project that had problems due to organizational issues. Write a paper summarizing the key stakeho ...

Question how can company protect the new emerging

Question : How can company protect the new emerging technology ventures from profit pressures of the parent organization (APA format required, Turntin check required . Minimum 250 words essay) How do companies overcome l ...

Communication and team decision makingpart 1 sharpening the

Communication and Team Decision Making Part 1: Sharpening the Team Mind: Communication and Collective Intelligence A. What are some of the possible biases and points of error that may arise in team communication systems? ...

Question provide an explanation of ifwherehow does active

Question : Provide an explanation of if/where/how does Active Directory support network security,14 pages (2,000-2,500) in APA format. Include abstract and conclusion. Do not include wikis, message boards, support forums ...

Question how companies could effectively use emerging

Question : How companies could effectively use emerging technology to win over its competitors. APA format required. 250 words essay required. The response must be typed, single spaced, must be in times new roman font (s ...

Question how customers could effectively use emerging

Question : How customers could effectively use emerging technology to win over its customers. APA format required. 250 words essay required. turntin check require. The response must be typed, single spaced, must be in ti ...

Part 1 - create an 8 slide powerpoint presentation on

Part 1 - Create an 8 slide PowerPoint presentation on foundational concepts specific to physical security. Part 2 - Write 4 pages detailing the framework for the design of an integrated data center. Assessment Instructio ...

In chapter 2 of the text - managing amp using information

In Chapter 2 of the text - Managing & Using Information Systems: A Strategic Approach, the chapter discusses why information systems experience failure often because of organizational strategy. A classic example of this ...

Review at least 4 articles on balanced scorecard and

Review at least 4 articles on Balanced Scorecard and complete the following activities: 1. Write annotated summary of each article. Use APA throughout. 2. As an IT professional, discuss how you will use Balanced Scorecar ...

Data resources management questionsq1 the dama dmbok

Data Resources Management QUESTIONS Q1. The DAMA DMBOK textbook describes the following two core activities as part of the Data Architecture management exercise: "Understanding enterprise information needs" and "Develop ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As