\(\leftarrow\) Ka|Ve

Bu çalışmada R ile Veri Madenciliği Uygulamaları adlı kitabın Dr.Çiğdem Selçukcan Erol tarafından yazılmış olan “Sağlık Bilimlerinde R ile Veri Madenciliği” bölümünden yararlanılmıştır.

Veriyi elde etme

215 hasta ve 6 farklı nitelik içeren sağlık verisetini UCI Machine Learning Repository’den çekiyoruz.

veri_linki = "https://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/new-thyroid.data"
veri = as.data.frame(read.table(veri_linki, header = FALSE, sep=',', dec = "."))

Restarting R session...

Satır ve sütunlara isim verelim.

rownames(veri) <- paste0("Hasta", 1:dim(veri)[1])
colnames(veri) <- c("Diyagnoz", "RT3U", "T4", "T3", "TSH","DTSH")
veri

Veri setin genel bir bakış

summary(veri)
    Diyagnoz          RT3U             T4               T3       
 Min.   :1.000   Min.   : 65.0   Min.   : 0.500   Min.   : 0.20  
 1st Qu.:1.000   1st Qu.:103.0   1st Qu.: 7.100   1st Qu.: 1.35  
 Median :1.000   Median :110.0   Median : 9.200   Median : 1.70  
 Mean   :1.442   Mean   :109.6   Mean   : 9.805   Mean   : 2.05  
 3rd Qu.:2.000   3rd Qu.:117.5   3rd Qu.:11.300   3rd Qu.: 2.20  
 Max.   :3.000   Max.   :144.0   Max.   :25.300   Max.   :10.00  
      TSH             DTSH       
 Min.   : 0.10   Min.   :-0.700  
 1st Qu.: 1.00   1st Qu.: 0.550  
 Median : 1.30   Median : 2.000  
 Mean   : 2.88   Mean   : 4.199  
 3rd Qu.: 1.70   3rd Qu.: 4.100  
 Max.   :56.40   Max.   :56.300  

Veri istatistiklerini görelleştirelim.

boxplot(veri)

Veri setinin kutu grafikleri

pairs(~T3+T4+TSH+Diyagnoz, data = veri)

Diyagnoz (Hedef nitelik) 3 kategoride incelenebilecek yapıdadır. Bu üçe ayırma işlemini yapalım.

hedef_veri <- as.factor(x = veri[[1]])
# install.packages("plyr")
library(plyr)
veri$Diyagnoz <- revalue(hedef_veri,c("1"="otrioid", "2"="hiper", "3"="hipo"))
summary(veri)
    Diyagnoz        RT3U             T4               T3       
 otrioid:150   Min.   : 65.0   Min.   : 0.500   Min.   : 0.20  
 hiper  : 35   1st Qu.:103.0   1st Qu.: 7.100   1st Qu.: 1.35  
 hipo   : 30   Median :110.0   Median : 9.200   Median : 1.70  
               Mean   :109.6   Mean   : 9.805   Mean   : 2.05  
               3rd Qu.:117.5   3rd Qu.:11.300   3rd Qu.: 2.20  
               Max.   :144.0   Max.   :25.300   Max.   :10.00  
      TSH             DTSH       
 Min.   : 0.10   Min.   :-0.700  
 1st Qu.: 1.00   1st Qu.: 0.550  
 Median : 1.30   Median : 2.000  
 Mean   : 2.88   Mean   : 4.199  
 3rd Qu.: 1.70   3rd Qu.: 4.100  
 Max.   :56.40   Max.   :56.300  

Veri’yi, egitim_verisi (%70) ve test_verisi (%30) olarak 2’ye ayıralım. Bunun için caret paketinden yararlanalım.

# install.packages("caret")
library(caret)
Zorunlu paket yükleniyor: lattice
Zorunlu paket yükleniyor: ggplot2
set.seed(123)
egitim_indeks <- createDataPartition( y = veri$Diyagnoz, p = 0.7, list = FALSE)
egitim_verisi <- veri[egitim_indeks,]
test_verisi <- veri[-egitim_indeks,]

C4.5 Karar ağaçı ile Sınıflandırma

C4.5 hem kategorik hem de sürekli verileri sınıflandırabilir. Bunun için RWeka paketi yüklenecektir.

# install.packages("RWeka")
library(RWeka)
Siniflandirma_kurallari <-J48(Diyagnoz ~ . , data = egitim_verisi)
Nis 15, 2017 11:38:15 PM com.github.fommil.netlib.ARPACK <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK
Nis 15, 2017 11:38:15 PM com.github.fommil.netlib.ARPACK <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefARPACK
show(Siniflandirma_kurallari)
J48 pruned tree
------------------

T4 <= 5.6: hipo (21.0/1.0)
T4 > 5.6
|   T4 <= 13.8
|   |   T3 <= 2.5: otrioid (102.0/2.0)
|   |   T3 > 2.5
|   |   |   T4 <= 11.9: otrioid (3.0)
|   |   |   T4 > 11.9: hiper (3.0)
|   T4 > 13.8: hiper (22.0/1.0)

Number of Leaves  :     5

Size of the tree :  9
summary(Siniflandirma_kurallari)

=== Summary ===

Correctly Classified Instances         147               97.351  %
Kappa statistic                          0.9436
Mean absolute error                      0.0342
Root mean squared error                  0.1308
Relative absolute error                 10.8452 %
Root relative squared error             33.0637 %
Total Number of Instances              151     

=== Confusion Matrix ===

   a   b   c   <-- classified as
 103   1   1 |   a = otrioid
   1  24   0 |   b = hiper
   1   0  20 |   c = hipo
plot(Siniflandirma_kurallari)

Eğitim verisi üzrinde bulduğumuz sınıflandırma kurallarını, test verisinde deneyelim.

kestirim <- predict(Siniflandirma_kurallari, test_verisi, type = "class")
show(kestirim)
 [1] otrioid hipo    otrioid otrioid otrioid otrioid otrioid otrioid otrioid
[10] otrioid otrioid otrioid otrioid otrioid otrioid otrioid otrioid otrioid
[19] otrioid otrioid otrioid otrioid hiper   otrioid otrioid otrioid otrioid
[28] otrioid otrioid otrioid otrioid otrioid otrioid otrioid otrioid otrioid
[37] otrioid otrioid otrioid otrioid otrioid otrioid otrioid otrioid otrioid
[46] hiper   otrioid hiper   hiper   hiper   hiper   hiper   hiper   hiper  
[55] hiper   otrioid hipo    hipo    otrioid hipo    hipo    hipo    hipo   
[64] hipo   
Levels: otrioid hiper hipo

Karışıklık Matrisini oluşturalım.

karisiklik_matrisi <- table(test_verisi$Diyagnoz, kestirim, dnn = c("GERCEK", "TAHMIN"))
show(karisiklik_matrisi)
         TAHMIN
GERCEK    otrioid hiper hipo
  otrioid      43     1    1
  hiper         1     9    0
  hipo          2     0    7
r <- nrow(karisiklik_matrisi)
c <- ncol(karisiklik_matrisi)
kosegen <- (function (x) x + (x-1)*c) (1:r)
dogruluk <- sum(karisiklik_matrisi[kosegen]) / sum(karisiklik_matrisi)
show(paste("Doğruluk = ", dogruluk))
[1] "Doğruluk =  0.921875"

Rastgele Orman Algoritması ile Sınıflandırma

# install.packages("randomForest")
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin
orman <- randomForest(Diyagnoz ~., data = egitim_verisi)
show(orman)

Call:
 randomForest(formula = Diyagnoz ~ ., data = egitim_verisi) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.97%
Confusion matrix:
        otrioid hiper hipo class.error
otrioid     102     1    2  0.02857143
hiper         1    24    0  0.04000000
hipo          2     0   19  0.09523810
# Test verisi üzerinde tahmin yapalım
kestirim_orman <- predict(orman, test_verisi, type = "class")
karisiklik_matrisi_orman <- table(test_verisi$Diyagnoz, kestirim_orman, dnn = c("GERCEK", "TAHMIN"))
show(karisiklik_matrisi_orman)  
         TAHMIN
GERCEK    otrioid hiper hipo
  otrioid      45     0    0
  hiper         1     9    0
  hipo          0     0    9
dogruluk_orman <- sum(karisiklik_matrisi_orman[kosegen]) / sum(karisiklik_matrisi_orman)
show(paste("Doğruluk = ", dogruluk_orman))
[1] "Doğruluk =  0.984375"

C4.5 ile Rastgele Orman yöntemlerinin karşılaştırması

karsilastirma <- data.frame(c(dogruluk, dogruluk_orman)) 
colnames(karsilastirma) <- "Dogruluk"
rownames(karsilastirma) <- c("C4.5", "Orman")
show(karsilastirma)
      Dogruluk
C4.5  0.921875
Orman 0.984375

Bonus: RNotebook üzerinde Python Kodu çalıştıma

# -*- coding: utf-8 -*-
# Deneme: Python kodu R Notebook'da çalışıyor.
report = """{} mumdur:)"""  
for i in range(1,4):
  print(report.format(i))  
1 mumdur:)
2 mumdur:)
3 mumdur:)

\(\leftarrow\) Ka|Ve

LS0tCnRpdGxlOiAiS2FyYXIgQcSfYcOnbGFyxLEgaWxlIFZlcmkgTWFkZW5jaWxpxJ9pIgphdXRob3I6IFV6YXkgw4dldGluIApvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpbJFxsZWZ0YXJyb3ckIEthfFZlXShodHRwczovL3V6YXkwMC5naXRodWIuaW8va2FodmUvKQoKCkJ1IMOnYWzEscWfbWFkYSBbUiBpbGUgVmVyaSBNYWRlbmNpbGnEn2kgVXlndWxhbWFsYXLEsV0oaHR0cDovL3d3dy5jYWdsYXlhbi5jb20vdXJ1bmRldGF5LzU1MzY3NC9SLWlsZS1WZXJpLU1hZGVuY2lsaWdpLVV5Z3VsYW1hbGFyaS1Eci1NZWhtZXQtRXJkYWwtQkFMQUJBTi1Eci1FbGlmLUtBUlRBTC05Nzg5NzU0MzYwOTM2I3N0aGFzaC5mN1pyVFkwZi5kcGJzKSBhZGzEsSBraXRhYsSxbgoqKkRyLsOHacSfZGVtIFNlbMOndWtjYW4gRXJvbCoqIHRhcmFmxLFuZGFuIHlhesSxbG3EscWfIG9sYW4KKioiU2HEn2zEsWsgQmlsaW1sZXJpbmRlIFIgaWxlIFZlcmkgTWFkZW5jaWxpxJ9pIioqIGLDtmzDvG3DvG5kZW4geWFyYXJsYW7EsWxtxLHFn3TEsXIuIAoKCiMjIFZlcml5aSBlbGRlIGV0bWUKMjE1IGhhc3RhIHZlIDYgZmFya2zEsSBuaXRlbGlrIGnDp2VyZW4gc2HEn2zEsWsgdmVyaXNldGluaSBVQ0kgTWFjaGluZSBMZWFybmluZyBSZXBvc2l0b3J5J2RlbiDDp2VraXlvcnV6LiAKCmBgYHtyfQp2ZXJpX2xpbmtpID0gImh0dHBzOi8vYXJjaGl2ZS5pY3MudWNpLmVkdS9tbC9tYWNoaW5lLWxlYXJuaW5nLWRhdGFiYXNlcy90aHlyb2lkLWRpc2Vhc2UvbmV3LXRoeXJvaWQuZGF0YSIKCnZlcmkgPSBhcy5kYXRhLmZyYW1lKHJlYWQudGFibGUodmVyaV9saW5raSwgaGVhZGVyID0gRkFMU0UsIHNlcD0nLCcsIGRlYyA9ICIuIikpCgpkaW1uYW1lcyh2ZXJpKQpgYGAKClNhdMSxciB2ZSBzw7x0dW5sYXJhIGlzaW0gdmVyZWxpbS4KCmBgYHtyfQpyb3duYW1lcyh2ZXJpKSA8LSBwYXN0ZTAoIkhhc3RhIiwgMTpkaW0odmVyaSlbMV0pCmNvbG5hbWVzKHZlcmkpIDwtIGMoIkRpeWFnbm96IiwgIlJUM1UiLCAiVDQiLCAiVDMiLCAiVFNIIiwiRFRTSCIpCnZlcmkKYGBgCgpWZXJpIHNldGluIGdlbmVsIGJpciBiYWvEscWfCgpgYGB7cn0Kc3VtbWFyeSh2ZXJpKQpgYGAKCgpWZXJpIGlzdGF0aXN0aWtsZXJpbmkgZ8O2cmVsbGXFn3RpcmVsaW0uCgpgYGB7cn0KYm94cGxvdCh2ZXJpKQpgYGAKClZlcmkgc2V0aW5pbiBrdXR1IGdyYWZpa2xlcmkKCmBgYHtyfQpwYWlycyh+VDMrVDQrVFNIK0RpeWFnbm96LCBkYXRhID0gdmVyaSkKYGBgCgpEaXlhZ25veiAoSGVkZWYgIG5pdGVsaWspIDMga2F0ZWdvcmlkZSBpbmNlbGVuZWJpbGVjZWsgeWFwxLFkYWTEsXIuIEJ1IMO8w6dlIGF5xLFybWEgacWfbGVtaW5pIHlhcGFsxLFtLgoKYGBge3J9CmhlZGVmX3ZlcmkgPC0gYXMuZmFjdG9yKHggPSB2ZXJpW1sxXV0pCiMgaW5zdGFsbC5wYWNrYWdlcygicGx5ciIpCmxpYnJhcnkocGx5cikKdmVyaSREaXlhZ25veiA8LSByZXZhbHVlKGhlZGVmX3ZlcmksYygiMSI9Im90cmlvaWQiLCAiMiI9ImhpcGVyIiwgIjMiPSJoaXBvIikpCnN1bW1hcnkodmVyaSkKYGBgCgoKVmVyaSd5aSwgZWdpdGltX3ZlcmlzaSAoJTcwKSB2ZSB0ZXN0X3ZlcmlzaSAoJTMwKSBvbGFyYWsgMid5ZSBhecSxcmFsxLFtLiBCdW51biBpw6dpbiAqKmNhcmV0KiogcGFrZXRpbmRlbiB5YXJhcmxhbmFsxLFtLgpgYGB7cn0KIyBpbnN0YWxsLnBhY2thZ2VzKCJjYXJldCIpCmxpYnJhcnkoY2FyZXQpCnNldC5zZWVkKDEyMykKZWdpdGltX2luZGVrcyA8LSBjcmVhdGVEYXRhUGFydGl0aW9uKCB5ID0gdmVyaSREaXlhZ25veiwgcCA9IDAuNywgbGlzdCA9IEZBTFNFKQplZ2l0aW1fdmVyaXNpIDwtIHZlcmlbZWdpdGltX2luZGVrcyxdCnRlc3RfdmVyaXNpIDwtIHZlcmlbLWVnaXRpbV9pbmRla3MsXQpgYGAKCiMjIEM0LjUgS2FyYXIgYcSfYcOnxLEgaWxlIFPEsW7EsWZsYW5kxLFybWEKQzQuNSBoZW0ga2F0ZWdvcmlrIGhlbSBkZSBzw7xyZWtsaSB2ZXJpbGVyaSBzxLFuxLFmbGFuZMSxcmFiaWxpci4gQnVudW4gacOnaW4gUldla2EgcGFrZXRpIHnDvGtsZW5lY2VrdGlyLgoKYGBge3J9CiMgaW5zdGFsbC5wYWNrYWdlcygiUldla2EiKQpsaWJyYXJ5KFJXZWthKQpTaW5pZmxhbmRpcm1hX2t1cmFsbGFyaSA8LUo0OChEaXlhZ25veiB+IC4gLCBkYXRhID0gZWdpdGltX3ZlcmlzaSkKc2hvdyhTaW5pZmxhbmRpcm1hX2t1cmFsbGFyaSkKYGBgCmBgYHtyfQpzdW1tYXJ5KFNpbmlmbGFuZGlybWFfa3VyYWxsYXJpKQpgYGAKCmBgYHtyfQpwbG90KFNpbmlmbGFuZGlybWFfa3VyYWxsYXJpKQpgYGAKCgpFxJ9pdGltIHZlcmlzaSDDvHpyaW5kZSBidWxkdcSfdW11eiBzxLFuxLFmbGFuZMSxcm1hIGt1cmFsbGFyxLFuxLEsIHRlc3QgdmVyaXNpbmRlIGRlbmV5ZWxpbS4KCmBgYHtyfQprZXN0aXJpbSA8LSBwcmVkaWN0KFNpbmlmbGFuZGlybWFfa3VyYWxsYXJpLCB0ZXN0X3ZlcmlzaSwgdHlwZSA9ICJjbGFzcyIpCnNob3coa2VzdGlyaW0pCmBgYAoKS2FyxLHFn8Sxa2zEsWsgTWF0cmlzaW5pIG9sdcWfdHVyYWzEsW0uCgpgYGB7cn0Ka2FyaXNpa2xpa19tYXRyaXNpIDwtIHRhYmxlKHRlc3RfdmVyaXNpJERpeWFnbm96LCBrZXN0aXJpbSwgZG5uID0gYygiR0VSQ0VLIiwgIlRBSE1JTiIpKQpzaG93KGthcmlzaWtsaWtfbWF0cmlzaSkKYGBgCgoKYGBge3J9CnIgPC0gbnJvdyhrYXJpc2lrbGlrX21hdHJpc2kpCmMgPC0gbmNvbChrYXJpc2lrbGlrX21hdHJpc2kpCmtvc2VnZW4gPC0gKGZ1bmN0aW9uICh4KSB4ICsgKHgtMSkqYykgKDE6cikKCmRvZ3J1bHVrIDwtIHN1bShrYXJpc2lrbGlrX21hdHJpc2lba29zZWdlbl0pIC8gc3VtKGthcmlzaWtsaWtfbWF0cmlzaSkKc2hvdyhwYXN0ZSgiRG/En3J1bHVrID0gIiwgZG9ncnVsdWspKQpgYGAKCgojIyBSYXN0Z2VsZSBPcm1hbiBBbGdvcml0bWFzxLEgaWxlIFPEsW7EsWZsYW5kxLFybWEKCmBgYHtyfQojIGluc3RhbGwucGFja2FnZXMoInJhbmRvbUZvcmVzdCIpCmxpYnJhcnkocmFuZG9tRm9yZXN0KQpvcm1hbiA8LSByYW5kb21Gb3Jlc3QoRGl5YWdub3ogfi4sIGRhdGEgPSBlZ2l0aW1fdmVyaXNpKQpzaG93KG9ybWFuKQpgYGAKCmBgYHtyfQojIFRlc3QgdmVyaXNpIMO8emVyaW5kZSB0YWhtaW4geWFwYWzEsW0Ka2VzdGlyaW1fb3JtYW4gPC0gcHJlZGljdChvcm1hbiwgdGVzdF92ZXJpc2ksIHR5cGUgPSAiY2xhc3MiKQprYXJpc2lrbGlrX21hdHJpc2lfb3JtYW4gPC0gdGFibGUodGVzdF92ZXJpc2kkRGl5YWdub3osIGtlc3RpcmltX29ybWFuLCBkbm4gPSBjKCJHRVJDRUsiLCAiVEFITUlOIikpCnNob3coa2FyaXNpa2xpa19tYXRyaXNpX29ybWFuKSAgCmBgYAoKCmBgYHtyfQpkb2dydWx1a19vcm1hbiA8LSBzdW0oa2FyaXNpa2xpa19tYXRyaXNpX29ybWFuW2tvc2VnZW5dKSAvIHN1bShrYXJpc2lrbGlrX21hdHJpc2lfb3JtYW4pCnNob3cocGFzdGUoIkRvxJ9ydWx1ayA9ICIsIGRvZ3J1bHVrX29ybWFuKSkKYGBgCgoKIyMgQzQuNSBpbGUgUmFzdGdlbGUgT3JtYW4gecO2bnRlbWxlcmluaW4ga2FyxZ/EsWxhxZ90xLFybWFzxLEgCgoKYGBge3J9CmthcnNpbGFzdGlybWEgPC0gZGF0YS5mcmFtZShjKGRvZ3J1bHVrLCBkb2dydWx1a19vcm1hbikpIAoKY29sbmFtZXMoa2Fyc2lsYXN0aXJtYSkgPC0gIkRvZ3J1bHVrIgpyb3duYW1lcyhrYXJzaWxhc3Rpcm1hKSA8LSBjKCJDNC41IiwgIk9ybWFuIikKCnNob3coa2Fyc2lsYXN0aXJtYSkKYGBgCgoKCgoKCgoKCgoKCgoKCiMjIEJvbnVzOiBSTm90ZWJvb2sgw7x6ZXJpbmRlIFB5dGhvbiBLb2R1IMOnYWzEscWfdMSxbWEKYGBge3B5dGhvbn0KIyAtKi0gY29kaW5nOiB1dGYtOCAtKi0KIyBEZW5lbWU6IFB5dGhvbiBrb2R1IFIgTm90ZWJvb2snZGEgw6dhbMSxxZ/EsXlvci4KcmVwb3J0ID0gIiIie30gbXVtZHVyOikiIiIgIApmb3IgaSBpbiByYW5nZSgxLDQpOgogIHByaW50KHJlcG9ydC5mb3JtYXQoaSkpICAKYGBgCgoKPGRpdiBhbGlnbj0icmlnaHQiPgpbJFxsZWZ0YXJyb3ckIEthfFZlXShodHRwczovL3V6YXkwMC5naXRodWIuaW8va2FodmUvKQo8L2Rpdj4gCgoKCgoKCgoKCgoKCgoKCg==