@lumincinta 2017-02-07T12:31:25.000000Z 字数 4550 阅读 296

R Package 'smbinning': Optimal Binning for Scoring Modeling

by Herman Jopia

R.Package

R Package 'smbinning': Optimal Binning for Scoring Modeling, by Herman Jopia

What is Binning?

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
It controls or mitigates the impact of outliers over the model.
It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

Unsupervised Discretization

Unsupervised
Discretization divides a continuous feature into groups (bins) without
taking into account any other information. It is basically a partiton
with two options: equal length intervals and equal frequency intervals.

Equal length intervals

Objective: Understand the distribution of a variable.
Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.

Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

Equal frequency intervals

Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
Example: Quartlies or Percentiles.
Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2

Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

Supervised Discretization

Supervised Discretization divides a continuous feature into groups
(bins) mapped to a target variable. The central idea is to find those
cutpoints that maximize the difference between the groups.
In the
past, analysts used to iteratively move from Fine Binning to Coarse
Binning, a very time consuming process of finding manually and visually
the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or
Recursive Partitioning, two out of several techniques available [2],
analysts can quickly find the optimal cutpoints in seconds and evaluate
the relationship with the target variable using metrics such as Weight
of Evidence and Information Value.

An Example With 'smbinning'

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data 
library(smbinning) 
data(chileancredit) 
# Training and testing samples 
chileancredit.train=subset(chileancredit,FlagSample==1) 
chileancredit.test=subset(chileancredit,FlagSample==0) 
# Run and save results 
result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) 
result$ivtable
# Relevant plots (2x2 Page) 
par(mfrow=c(2,2)) 
boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, 
horizontal=T, frame=F, col="lightgray",main="Distribution") 
mtext("Time on Books (Months)",3) 
smbinning.plot(result,option="dist",sub="Time on Books (Months)") 
smbinning.plot(result,option="badrate",sub="Time on Books (Months)") 
smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

Table 3. Time on Books cutpoints mapped to Credit Performance.

Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

References

[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).
[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.