[关闭]
@lumincinta 2017-02-07T12:31:25.000000Z 字数 4550 阅读 296

R Package 'smbinning': Optimal Binning for Scoring Modeling

by Herman Jopia

R.Package

R Package 'smbinning': Optimal Binning for Scoring Modeling, by Herman Jopia


What is Binning?

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

Unsupervised Discretization

Unsupervised
Discretization divides a continuous feature into groups (bins) without
taking into account any other information. It is basically a partiton
with two options: equal length intervals and equal frequency intervals.

Equal length intervals

Binning_eqlen
Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

Equal frequency intervals

Binning_eqfreq
Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

Supervised Discretization

Supervised Discretization divides a continuous feature into groups
(bins) mapped to a target variable. The central idea is to find those
cutpoints that maximize the difference between the groups.
In the
past, analysts used to iteratively move from Fine Binning to Coarse
Binning, a very time consuming process of finding manually and visually
the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or
Recursive Partitioning, two out of several techniques available [2],
analysts can quickly find the optimal cutpoints in seconds and evaluate
the relationship with the target variable using metrics such as Weight
of Evidence and Information Value.

An Example With 'smbinning'

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

  1. # Load package and its data
  2. library(smbinning)
  3. data(chileancredit)
  4. # Training and testing samples
  5. chileancredit.train=subset(chileancredit,FlagSample==1)
  6. chileancredit.test=subset(chileancredit,FlagSample==0)
  7. # Run and save results
  8. result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05)
  9. result$ivtable
  10. # Relevant plots (2x2 Page)
  11. par(mfrow=c(2,2))
  12. boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB,
  13. horizontal=T, frame=F, col="lightgray",main="Distribution")
  14. mtext("Time on Books (Months)",3)
  15. smbinning.plot(result,option="dist",sub="Time on Books (Months)")
  16. smbinning.plot(result,option="badrate",sub="Time on Books (Months)")
  17. smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

Binning_rp
Table 3. Time on Books cutpoints mapped to Credit Performance.

Binning_plot
Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

References

[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).
[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注