Book excerpt: Java Data Mining concepts

Familiarize yourself with data mining functions and algorithms

Data mining has its origins in conventional artificial intelligence, machine learning, statistics, and database technologies, so it has much of its terminology and concepts derived from these technologies. This article, an excerpt from Java Data Mining: Strategy, Standard, and Practice by Mark F. Hornick, Erik Marcade, Sunil Venkayala (Morgan Kaufman, 2007), introduces data mining concepts for those new to data mining, and will familiarize data mining experts with data mining terminology and capabilities specific to the Java Data Mining API (JDM). This article details and expands those concepts associated with mining functions and algorithms by example. Although we discuss higher-level details of the algorithms used to give some intuition about how each algorithm works, a detailed discussion of data mining algorithms is beyond the scope of this article.

Note: This excerpt was printed with permission from Morgan Kaufman, a division of Elsevier. Copyright 2007. Java Data Mining: Strategy, Standard, and Practice by Mark F. Hornick, Erik Marcade, Sunil Venkayala. For more information about this title and other similar books, please visit

This article explores data mining concepts in financial services by using a business problem faced by a hypothetical consumer bank called ABCBank. ABCBank provides banking and financial services for individuals and corporate businesses. It has branch offices throughout the country. In addition, it has online banking services for its customers. ABCBank offers products such as bank accounts for checking, savings, and certificates, and many types of credit cards, loans, and other financial services. ABCBank has a diverse customer base, distributed nationally. Customer demographics vary widely in income levels, education and professional qualifications, ethnic backgrounds, age groups, and family status.

This article introduces a business problem faced by ABCBank, its solution, and the concepts associated with the related mining function. While developing a solution for the problem, we discuss the concepts related to the data mining technique used to solve it. We follow a common description pattern for the problem, starting with a problem definition, solution approach, data description, available settings for tuning the solution, and an overview of relevant algorithms. For supervised functions, we also describe how to evaluate a model's performance, and apply a model to obtain prediction results. For unsupervised functions, we describe model content and how to use models to solve the problem.

Problem definition: How to reduce customer attrition

ABCBank is losing customers to its competitors and wants to gain a better understanding of the type of customers who are closing their accounts. ABCBank also wants to be proactive in retaining existing customers by taking appropriate measures to improve customer satisfaction. This is commonly known as the customer attrition problem in the financial services industry.

Solution approach: Predict customers who are likely to attrite

ABCBank can use customer data collected in its transactional and analytical databases to find the patterns associated with customers likely, or unlikely, to attrite. Using the data mining classification function, ABCBank can predict customers who are likely to attrite and understand the characteristics, or profiles, of such customers. Gaining a better understanding of customer behavior enables ABCBank to develop business plans to retain customers.

Classification is used to assign cases, such as customers, to discrete values, called classes or categories, of the target attribute. The target is the attribute whose values are predicted using data mining. In this problem, the target is the attribute attrite with two possible values: Attriter and Non-attriter. When referring to the model build dataset, the value Attriter indicates that the customer closed all accounts, and Non-attriter indicates the customer has at least one account at ABCBank. When referring to the prediction in the model apply dataset, the value Attriter indicates that the customer is likely to attrite and Non-attriter indicates that the customer is not likely to attrite. The prediction is often associated with a probability indicating how likely the customer is to attrite. When a target attribute has only two possible values, the problem is referred to as a binary classification problem. When a target attribute has more than two possible values, the problem is known as a multiclass classification problem.

Data specification: CUSTOMERS dataset

An important step in any data mining project is to collect related data from enterprise data sources. Identifying which attributes should be used for data mining is one of the challenges faced by the data miner and relies on appropriate domain knowledge of the data. In this example, we introduce a subset of possible customer attributes as listed in Table 1. In real-world scenarios, there may be hundreds or even thousands of customer attributes available in enterprise databases.

Table 1 lists physical attribute details of the CUSTOMERS dataset, which include name, datatype, and description. The attribute name refers to either a column name of a database table or a field name of a flat file. The attribute data type refers to the allowed type of values for that attribute. JDM defines integer, double, and string data types, which are commonly used data types for mining. JDM conformance rules allow a vendor to add more data types if required. Attribute description can be used to explain the meaning of the attribute or describe the allowed values. In general, physical data characteristics are captured by database metadata.

Table 1. Customers Table physical attribute details

Attribute nameData typeAttribute description
CUST_IDINTEGERUnique customer identifier
NAMESTRINGName of the customer
ADDRESSSTRINGAddress of the customer
CITYSTRINGCity of residence

Educational level, e.g., diploma, bachelor’s, master’s, Ph.D.

MAR_STATUSSTRINGMarital status, e.g., married, single, widowed, divorced
OCCUPATIONSTRINGOccupation of the customer, e.g., clerical, manager, sales, etc.
INCOMEDOUBLEAnnual income in thousands of dollars
CAP_GAINDOUBLECurrent capital gains or losses
SAV_BALANCEDOUBLEAverage monthly savings balance
CHECK_BALANCEDOUBLEAverage monthly checking balance
RETIRE_BALANCEDOUBLECurrent retirement account balance
MORTGAGE_AMOUNTDOUBLECurrent mortgage/home loan balance
CREDIT_RISKSTRINGRelative credit risk, e.g., high, medium, low
ATTRITESTRINGThe target attribute indicating whether a customer will attrite or not. Values include “attriter” and “non-attriter.”

Users may also specify logical attribute characteristics specific to data mining. For example, physical attribute names in the table or file can be cryptic, such as HHSIZE means household size representing the number of people living as one family. Users can map physical names to logical names to be more descriptive and hence easier to understand. Logical data characteristics also include the specification of data mining attribute type, attribute usage type, and data preparation type to indicate how these attributes should be interpreted in data mining operations. Table 2 lists the logical data specification details for the CUSTOMERS dataset shown in Table 1.

The attribute type indicates the attribute data characteristics, such as whether the attribute should be treated as numerical, categorical, or ordinal. Numerical attributes are those whose values should be treated as continuous numbers. Categorical attributes are those where attribute values correspond to discrete, nominal categories. Ordinal attributes are also those with discrete values, but their order is significant. In Table 2, the attribute type column specifies attributes such as city, county, state, education, and marital status as categorical attributes. The attribute capital gains is a numerical attribute as it has continuous data values, such as $12,500.94. The attribute credit risk is an ordinal attribute as it has high, medium, or low as ordered relative values.

The attribute usage type specifies whether an attribute is active—should be used as input to mining; inactive—excluded from mining; or supplementary—brought forward with the input values but not used explicitly for mining. In Table 2, the usage type column specifies attributes customer ID, name, and address as inactive because these attributes are identifiers or will not generalize to predict if a customer is an attriter. All other attributes are active, and used as input for data mining. In this example, we have not included supplementary attributes. However, consider a derived attribute computed as the capital gains divided by the square of age, called ageCapitalGain-Ratio. From the user perspective, if the derived attribute ageCapital-GainRatio appears in a model rule, it may be difficult to interpret the underlying values as it relates to the business. In such a case, the model can reference supplementary attributes, for example, age and capital gain. Although these supplementary attributes are not directly used in the model build, they can be presented in model details to facilitate rule understanding using the corresponding values of age and capital gain.

Table 2. Customers Table logical data specification

Attribute nameLogical nameAttribute typeUsage typePreparation
CUST_IDCustomer ID Inactive 
NAMEName Inactive 
ADDRESSAddress Inactive 
MAR-STATUSMarital statusCategoricalActivePrepared
INCOMEAnnual income levelNumericalActiveNot prepared
ETHNIC_GRPEthnic groupCategoricalActivePrepared
AGEAgeNumericalActiveNot prepared
CAP_GAINCapital gainsNumericalActiveNot prepared
SAV_BALANCEAvg. savings balanceNumericalActiveNot prepared
CHECK_BALANCEAvg. checking balanceNumericalActiveNot prepared
RETIRE_BALANCERetirement balanceNumericalActiveNot prepared
MORTGAGE_AMOUNTHome loan balanceNumericalActiveNot prepared
NAT_COUNTRYNative countryCategoricalActivePrepared
CREDIT_RISKCredit riskOrdinalActivePrepared

In addition to usual ETL (Extraction Transformation and Loading) operations used for loading and transforming data, data mining can involve algorithm-specific data preparation. Such data preparation includes transformations such as binning and normalization. One may choose to prepare data manually to leverage domain-specific knowledge or to fine-tune data to improve results. The data preparation type is used to indicate if data is manually prepared. In Table 2, the preparation column lists which attributes are already prepared for model building. (Note: Extraction Transformation and Loading (ETL) is the process of extracting data from their operational data sources or external data sources, transforming the data—which includes cleansing, aggregation, summarization, and integration—and other transformations, and loading the data into a data mart or data warehouse.)

Specify settings: Fine-tune the solution to the problem

After exploring attribute values in the CUSTOMERS dataset, the data miner found some oddities in the data. The capital gains attribute has some extreme values that are out of range from the general population. Figure 1 illustrates the distribution of capital gains values in the data. Note that there are very few customers who have capital gains greater than $1,000,000; in this example such values are treated as outliers. Outliers are the values of a given attribute that are unusual compared to the rest of that attribute's data values. For example, if customers have capital gains over 1 million dollars, these values could skew mining results involving the attribute capital gains.

In this example, the capital gains attribute has a valid range of $2,000 to $1,000,000 based on the value distribution, shown in Figure 1. In JDM, we use outlier identification settings to specify the valid range, or interval, to identify outliers for the model building process. Some data mining engines (DMEs) automatically identify and treat outliers as part of the model building process. JDM allows data miners to specify an outlier treatment option per attribute to inform algorithms how to treat outliers in the build data. The outlier treatment specifies whether attribute outlier values are treated asMissing (should be handled as missing values) or asIs (should be handled as the original values). Based on the problem requirements and vendor-specific algorithm implementations, data miners can either explicitly choose the outlier treatment or leave it to the DME.

Figure 1. Capital gains value distribution

In assessing the data, the data miner noticed that the state attribute has some invalid entries. All ABCBank customers who are U.S. residents must have the state value as a two-letter abbreviation of one of the 50 states or the District of Columbia. To indicate valid attribute values to the model build, a category set can be specified in the logical data specification. The category set characterizes the values found in a categorical attribute. In this example, the category set for the state attribute contains values {AL, AK, AS, AZ, ..., WY}. The state values that are not in this set will be considered as invalid values during the model build, and may be treated as missing or terminate execution.

Our CUSTOMERS dataset has a disproportionate number of Non-attriters: 20 percent of the cases are Attriters, and 80 percent are Non-attriters. To build an unbiased model, the data miner balances the input dataset to contain an equal number of cases, with each target value using stratified sampling. In JDM, prior probabilities are used to represent the original distribution of attribute values. The prior probabilities should be specified when the original target value distribution is changed, so that the algorithm can consider them appropriately. However, not all algorithms support prior probability specification, so you will need to consult a given tool's documentation.

ABCBank management informed the data miner that it is more expensive when an attriter is misclassified, that is, predicted as a Non-attriter. This is because losing an existing customer and acquiring a new customer costs much more than trying to retain an existing customer. For this, JDM allows the specification of a cost matrix to specify costs associated with possible false predictions. A cost matrix is an N x N table that defines the cost associated with incorrect predictions, where N is the number of possible target values. In this example, the data miner specifies a cost matrix indicating that predicting a customer would not attrite when in fact he would is three times costlier than predicting the customer would attrite when he actually would not. The cost matrix for this problem is illustrated in Figure 2.

Figure 2. Cost matrix table

In this example, we are more interested to know about the customers who are likely to attrite, so the Attriter value is considered the positive target value—the value we are interested in predicting. The positive target value is necessary when computing lift and the ROC test metric. The Non-attriter value is considered the negative target value. This allows us to use the terminology false positive and false negative. A false positive (FP) occurs when a case is known to have the negative target value, but the model predicts the positive target value. A false negative (FN) occurs when a case is known to have a positive target value, but the model predicts the negative target value. The true positives are the cases where the predicted and actual positive target values are in agreement, and true negatives are the cases where the predicted and actual negative target values are in agreement. In Figure 2, note that the false negative cost is $150 and the false positive is $50 and all diagonal elements always have cost "0," because there is no cost for correct predictions.

1 2 3 Page 1
Page 1 of 3