@Macux 2015-12-01T06:52:13.000000Z 字数 981 阅读 1045

R语言_RandomForest

R语言_学习笔记

1、准备工作：

> library(randomForest)
> bank1 <- read.csv("bank-full.csv")
> book.sample1 <- subset(bank1,subset=(y=="yes"))
> book.sample2 <- subset(bank1,select=c(1:17),subset=(y=="no"))
> bank.sample1 <- book.sample1[sample(1:nrow(book.sample1),200,replace=FALSE),]
> bank.sample2 <- book.sample2[sample(1:nrow(book.sample2),200,replace=FALSE),] 
> sample <- rbind(bank.sample1,bank.sample2)

2、构建随机森林模型：

> set.seed(111)
> bank.rf <- randomForest(y ~ .,data=sample,importance=TRUE,proximity=TRUE,ntree=1000)

3、输出混淆矩阵：

> bank.rf
 Call:
 randomForest(formula = y ~ ., data = sample, importance = TRUE,proximity = TRUE, ntree = 1000) 
 Type of random forest: classification
 Number of trees: 1000
 No. of variables tried at each split: 4
 OOB estimate of  error rate: 21.75%
 Confusion matrix:
     no   yes    class.error
no   151  49        0.245
yes  38   162       0.190

4、输出各指标(变量)的重要性：

> varImpPlot(bank.rf2)

此处输入图片的描述

说明
- MeanDecreaseAccuracy （左图）
  用来衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度。该值越大表示该变量的重要性越大。
- MeanDecreaseGini （右图）
  用来计算每个变量对分类树每个节点上观测值的异质性的影响,据此比较变量的重要性。该值越大表示该变量的重要性越大。
结论分析：
从图中可以看出，指标"duration"是最重要的指标。

R语言_RandomForest

内容目录