Featured Whitepapers
Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Book excerpt: Java Data Mining concepts

Familiarize yourself with data mining functions and algorithms

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Data mining has its origins in conventional artificial intelligence, machine learning, statistics, and database technologies, so it has much of its terminology and concepts derived from these technologies. This article, an excerpt from Java Data Mining: Strategy, Standard, and Practice by Mark F. Hornick, Erik Marcade, Sunil Venkayala (Morgan Kaufman, 2007), introduces data mining concepts for those new to data mining, and will familiarize data mining experts with data mining terminology and capabilities specific to the Java Data Mining API (JDM). This article details and expands those concepts associated with mining functions and algorithms by example. Although we discuss higher-level details of the algorithms used to give some intuition about how each algorithm works, a detailed discussion of data mining algorithms is beyond the scope of this article.

Note: This excerpt was printed with permission from Morgan Kaufman, a division of Elsevier. Copyright 2007. Java Data Mining: Strategy, Standard, and Practice by Mark F. Hornick, Erik Marcade, Sunil Venkayala. For more information about this title and other similar books, please visit www.mkp.com.

This article explores data mining concepts in financial services by using a business problem faced by a hypothetical consumer bank called ABCBank. ABCBank provides banking and financial services for individuals and corporate businesses. It has branch offices throughout the country. In addition, it has online banking services for its customers. ABCBank offers products such as bank accounts for checking, savings, and certificates, and many types of credit cards, loans, and other financial services. ABCBank has a diverse customer base, distributed nationally. Customer demographics vary widely in income levels, education and professional qualifications, ethnic backgrounds, age groups, and family status.

This article introduces a business problem faced by ABCBank, its solution, and the concepts associated with the related mining function. While developing a solution for the problem, we discuss the concepts related to the data mining technique used to solve it. We follow a common description pattern for the problem, starting with a problem definition, solution approach, data description, available settings for tuning the solution, and an overview of relevant algorithms. For supervised functions, we also describe how to evaluate a model's performance, and apply a model to obtain prediction results. For unsupervised functions, we describe model content and how to use models to solve the problem.

Problem definition: How to reduce customer attrition

ABCBank is losing customers to its competitors and wants to gain a better understanding of the type of customers who are closing their accounts. ABCBank also wants to be proactive in retaining existing customers by taking appropriate measures to improve customer satisfaction. This is commonly known as the customer attrition problem in the financial services industry.

Solution approach: Predict customers who are likely to attrite

ABCBank can use customer data collected in its transactional and analytical databases to find the patterns associated with customers likely, or unlikely, to attrite. Using the data mining classification function, ABCBank can predict customers who are likely to attrite and understand the characteristics, or profiles, of such customers. Gaining a better understanding of customer behavior enables ABCBank to develop business plans to retain customers.

Classification is used to assign cases, such as customers, to discrete values, called classes or categories, of the target attribute. The target is the attribute whose values are predicted using data mining. In this problem, the target is the attribute attrite with two possible values: Attriter and Non-attriter. When referring to the model build dataset, the value Attriter indicates that the customer closed all accounts, and Non-attriter indicates the customer has at least one account at ABCBank. When referring to the prediction in the model apply dataset, the value Attriter indicates that the customer is likely to attrite and Non-attriter indicates that the customer is not likely to attrite. The prediction is often associated with a probability indicating how likely the customer is to attrite. When a target attribute has only two possible values, the problem is referred to as a binary classification problem. When a target attribute has more than two possible values, the problem is known as a multiclass classification problem.

Data specification: CUSTOMERS dataset

An important step in any data mining project is to collect related data from enterprise data sources. Identifying which attributes should be used for data mining is one of the challenges faced by the data miner and relies on appropriate domain knowledge of the data. In this example, we introduce a subset of possible customer attributes as listed in Table 1. In real-world scenarios, there may be hundreds or even thousands of customer attributes available in enterprise databases.

Table 1 lists physical attribute details of the CUSTOMERS dataset, which include name, datatype, and description. The attribute name refers to either a column name of a database table or a field name of a flat file. The attribute data type refers to the allowed type of values for that attribute. JDM defines integer, double, and string data types, which are commonly used data types for mining. JDM conformance rules allow a vendor to add more data types if required. Attribute description can be used to explain the meaning of the attribute or describe the allowed values. In general, physical data characteristics are captured by database metadata.

Table 1. Customers Table physical attribute details

Attribute name Data type Attribute description
CUST_ID INTEGER Unique customer identifier
NAME STRING Name of the customer
ADDRESS STRING Address of the customer
CITY STRING City of residence
COUNTY STRING County
STATE STRING State
EDU STRING

Educational level, e.g., diploma, bachelor’s, master’s, Ph.D.

MAR_STATUS STRING Marital status, e.g., married, single, widowed, divorced
OCCUPATION STRING Occupation of the customer, e.g., clerical, manager, sales, etc.
INCOME DOUBLE Annual income in thousands of dollars
ETHNIC_GROUP STRING Ethnic group
AGE DOUBLE Age
CAP_GAIN DOUBLE Current capital gains or losses
SAV_BALANCE DOUBLE Average monthly savings balance
CHECK_BALANCE DOUBLE Average monthly checking balance
RETIRE_BALANCE DOUBLE Current retirement account balance
MORTGAGE_AMOUNT DOUBLE Current mortgage/home loan balance
NAT_COUNTRY STRING Native country
CREDIT_RISK STRING Relative credit risk, e.g., high, medium, low
ATTRITE STRING The target attribute indicating whether a customer will attrite or not. Values include “attriter” and “non-attriter.”

Users may also specify logical attribute characteristics specific to data mining. For example, physical attribute names in the table or file can be cryptic, such as HHSIZE means household size representing the number of people living as one family. Users can map physical names to logical names to be more descriptive and hence easier to understand. Logical data characteristics also include the specification of data mining attribute type, attribute usage type, and data preparation type to indicate how these attributes should be interpreted in data mining operations. Table 2 lists the logical data specification details for the CUSTOMERS dataset shown in Table 1.

The attribute type indicates the attribute data characteristics, such as whether the attribute should be treated as numerical, categorical, or ordinal. Numerical attributes are those whose values should be treated as continuous numbers. Categorical attributes are those where attribute values correspond to discrete, nominal categories. Ordinal attributes are also those with discrete values, but their order is significant. In Table 2, the attribute type column specifies attributes such as city, county, state, education, and marital status as categorical attributes. The attribute capital gains is a numerical attribute as it has continuous data values, such as $12,500.94. The attribute credit risk is an ordinal attribute as it has high, medium, or low as ordered relative values.

The attribute usage type specifies whether an attribute is active—should be used as input to mining; inactive—excluded from mining; or supplementary—brought forward with the input values but not used explicitly for mining. In Table 2, the usage type column specifies attributes customer ID, name, and address as inactive because these attributes are identifiers or will not generalize to predict if a customer is an attriter. All other attributes are active, and used as input for data mining. In this example, we have not included supplementary attributes. However, consider a derived attribute computed as the capital gains divided by the square of age, called ageCapitalGain-Ratio. From the user perspective, if the derived attribute ageCapital-GainRatio appears in a model rule, it may be difficult to interpret the underlying values as it relates to the business. In such a case, the model can reference supplementary attributes, for example, age and capital gain. Although these supplementary attributes are not directly used in the model build, they can be presented in model details to facilitate rule understanding using the corresponding values of age and capital gain.

Table 2. Customers Table logical data specification

Attribute name Logical name Attribute type Usage type Preparation
CUST_ID Customer ID   Inactive  
NAME Name   Inactive  
ADDRESS Address   Inactive  
CITY City Categorical Active Prepared
COUNTY County Categorical Active Prepared
STATE State Categorical Active Prepared
EDU Education Categorical Active Prepared
MAR-STATUS Marital status Categorical Active Prepared
OCCU Occupation Categorical Active Prepared
INCOME Annual income level Numerical Active Not prepared
ETHNIC_GRP Ethnic group Categorical Active Prepared
AGE Age Numerical Active Not prepared
CAP_GAIN Capital gains Numerical Active Not prepared
SAV_BALANCE Avg. savings balance Numerical Active Not prepared
CHECK_BALANCE Avg. checking balance Numerical Active Not prepared
RETIRE_BALANCE Retirement balance Numerical Active Not prepared
MORTGAGE_AMOUNT Home loan balance Numerical Active Not prepared
NAT_COUNTRY Native country Categorical Active Prepared
CREDIT_RISK Credit risk Ordinal Active Prepared
ATTRITE Attrite Target    

In addition to usual ETL (Extraction Transformation and Loading) operations used for loading and transforming data, data mining can involve algorithm-specific data preparation. Such data preparation includes transformations such as binning and normalization. One may choose to prepare data manually to leverage domain-specific knowledge or to fine-tune data to improve results. The data preparation type is used to indicate if data is manually prepared. In Table 2, the preparation column lists which attributes are already prepared for model building. (Note: Extraction Transformation and Loading (ETL) is the process of extracting data from their operational data sources or external data sources, transforming the data—which includes cleansing, aggregation, summarization, and integration—and other transformations, and loading the data into a data mart or data warehouse.)

Specify settings: Fine-tune the solution to the problem

After exploring attribute values in the CUSTOMERS dataset, the data miner found some oddities in the data. The capital gains attribute has some extreme values that are out of range from the general population. Figure 1 illustrates the distribution of capital gains values in the data. Note that there are very few customers who have capital gains greater than $1,000,000; in this example such values are treated as outliers. Outliers are the values of a given attribute that are unusual compared to the rest of that attribute's data values. For example, if customers have capital gains over 1 million dollars, these values could skew mining results involving the attribute capital gains.

In this example, the capital gains attribute has a valid range of $2,000 to $1,000,000 based on the value distribution, shown in Figure 1. In JDM, we use outlier identification settings to specify the valid range, or interval, to identify outliers for the model building process. Some data mining engines (DMEs) automatically identify and treat outliers as part of the model building process. JDM allows data miners to specify an outlier treatment option per attribute to inform algorithms how to treat outliers in the build data. The outlier treatment specifies whether attribute outlier values are treated asMissing (should be handled as missing values) or asIs (should be handled as the original values). Based on the problem requirements and vendor-specific algorithm implementations, data miners can either explicitly choose the outlier treatment or leave it to the DME.

Figure 1. Capital gains value distribution

In assessing the data, the data miner noticed that the state attribute has some invalid entries. All ABCBank customers who are U.S. residents must have the state value as a two-letter abbreviation of one of the 50 states or the District of Columbia. To indicate valid attribute values to the model build, a category set can be specified in the logical data specification. The category set characterizes the values found in a categorical attribute. In this example, the category set for the state attribute contains values {AL, AK, AS, AZ, ..., WY}. The state values that are not in this set will be considered as invalid values during the model build, and may be treated as missing or terminate execution.

Our CUSTOMERS dataset has a disproportionate number of Non-attriters: 20 percent of the cases are Attriters, and 80 percent are Non-attriters. To build an unbiased model, the data miner balances the input dataset to contain an equal number of cases, with each target value using stratified sampling. In JDM, prior probabilities are used to represent the original distribution of attribute values. The prior probabilities should be specified when the original target value distribution is changed, so that the algorithm can consider them appropriately. However, not all algorithms support prior probability specification, so you will need to consult a given tool's documentation.

ABCBank management informed the data miner that it is more expensive when an attriter is misclassified, that is, predicted as a Non-attriter. This is because losing an existing customer and acquiring a new customer costs much more than trying to retain an existing customer. For this, JDM allows the specification of a cost matrix to specify costs associated with possible false predictions. A cost matrix is an N x N table that defines the cost associated with incorrect predictions, where N is the number of possible target values. In this example, the data miner specifies a cost matrix indicating that predicting a customer would not attrite when in fact he would is three times costlier than predicting the customer would attrite when he actually would not. The cost matrix for this problem is illustrated in Figure 2.

Figure 2. Cost matrix table

In this example, we are more interested to know about the customers who are likely to attrite, so the Attriter value is considered the positive target value—the value we are interested in predicting. The positive target value is necessary when computing lift and the ROC test metric. The Non-attriter value is considered the negative target value. This allows us to use the terminology false positive and false negative. A false positive (FP) occurs when a case is known to have the negative target value, but the model predicts the positive target value. A false negative (FN) occurs when a case is known to have a positive target value, but the model predicts the negative target value. The true positives are the cases where the predicted and actual positive target values are in agreement, and true negatives are the cases where the predicted and actual negative target values are in agreement. In Figure 2, note that the false negative cost is $150 and the false positive is $50 and all diagonal elements always have cost "0," because there is no cost for correct predictions.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources

This excerpt was printed with permission from Morgan Kaufman, a division of Elsevier. Copyright 2007. Java Data Mining: Strategy, Standard, and Practice, by Mark F. Hornick, Erik Marcade, Sunil Venkayala. For more information about this title and other similar books, please visit www.mkp.com

Java Data Mining API

Browse through the articles in JavaWorld's Java Enterprise Edition Research Center

Keep up with what's new at JavaWorld! Sign up for our free Enterprise Java newsletter