@Macux
2015-12-01T06:52:13.000000Z
字数 981
阅读 1045
R语言_学习笔记
1、准备工作:
> library(randomForest)
> bank1 <- read.csv("bank-full.csv")
> book.sample1 <- subset(bank1,subset=(y=="yes"))
> book.sample2 <- subset(bank1,select=c(1:17),subset=(y=="no"))
> bank.sample1 <- book.sample1[sample(1:nrow(book.sample1),200,replace=FALSE),]
> bank.sample2 <- book.sample2[sample(1:nrow(book.sample2),200,replace=FALSE),]
> sample <- rbind(bank.sample1,bank.sample2)
2、构建随机森林模型:
> set.seed(111)
> bank.rf <- randomForest(y ~ .,data=sample,importance=TRUE,proximity=TRUE,ntree=1000)
3、输出混淆矩阵:
> bank.rf
Call:
randomForest(formula = y ~ ., data = sample, importance = TRUE,proximity = TRUE, ntree = 1000)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 4
OOB estimate of error rate: 21.75%
Confusion matrix:
no yes class.error
no 151 49 0.245
yes 38 162 0.190
4、输出各指标(变量)的重要性:
> varImpPlot(bank.rf2)
说明
MeanDecreaseAccuracy (左图)
用来衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度。该值越大表示该变量的重要性越大。
MeanDecreaseGini (右图)
用来计算每个变量对分类树每个节点上观测值的异质性的影响,据此比较变量的重要性。该值越大表示该变量的重要性越大。
结论分析:
从图中可以看出,指标"duration"是最重要的指标。