@Macux
2015-12-01T06:52:13.000000Z
字数 981
阅读 1188
R语言_学习笔记
1、准备工作:
> library(randomForest)> bank1 <- read.csv("bank-full.csv")> book.sample1 <- subset(bank1,subset=(y=="yes"))> book.sample2 <- subset(bank1,select=c(1:17),subset=(y=="no"))> bank.sample1 <- book.sample1[sample(1:nrow(book.sample1),200,replace=FALSE),]> bank.sample2 <- book.sample2[sample(1:nrow(book.sample2),200,replace=FALSE),]> sample <- rbind(bank.sample1,bank.sample2)
2、构建随机森林模型:
> set.seed(111)> bank.rf <- randomForest(y ~ .,data=sample,importance=TRUE,proximity=TRUE,ntree=1000)
3、输出混淆矩阵:
> bank.rfCall:randomForest(formula = y ~ ., data = sample, importance = TRUE,proximity = TRUE, ntree = 1000)Type of random forest: classificationNumber of trees: 1000No. of variables tried at each split: 4OOB estimate of error rate: 21.75%Confusion matrix:no yes class.errorno 151 49 0.245yes 38 162 0.190
4、输出各指标(变量)的重要性:
> varImpPlot(bank.rf2)

说明
MeanDecreaseAccuracy (左图)
用来衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度。该值越大表示该变量的重要性越大。
MeanDecreaseGini (右图)
用来计算每个变量对分类树每个节点上观测值的异质性的影响,据此比较变量的重要性。该值越大表示该变量的重要性越大。
结论分析:
从图中可以看出,指标"duration"是最重要的指标。
