Chapter 06. Statistical and Machine Learning (통계 및 머신러닝)

2017. 10. 29. 13:42

※ 이 포스팅은 저자 'Kirthi Raman'의 도서 <Mastering Python Data Visualization> 을 공부하며 정리한 글입니다.

Chapter 06. Statistical and Machine Learning (통계 및 머신러닝)

최근 머신러닝은 인공지능에서 가장 중요한 부분이 됐다.

연산 속도 등 컴퓨팅 성능의 발전으로 인해, 머신러닝을 통해 인공지능 시스템을 구축할 수 있게 될 가능성이 매우 높아졌다.

Linear Regression (선형 회귀)

선형 회귀분석은 주어진 입력변수(X)와 반응변수(Y)간의 관계를 선형 관계식으로 모형화하는 것이다.

다음 주어진 데이터에 대해 선형 회귀분석을 실시. 목표 응답변수(target response variable, Y)는 acceptance 이다.

import pandas as pd

import statsmodels.formula.api as smf

from matplotlib import pyplot as plt

df = pd.read_csv('./sports.csv', index_col=0)

fig, axs = plt.subplots(1, 3, sharey=True)

df.plot(kind='scatter', x='sports', y='acceptance', ax=axs[0], figsize=(16, 8))

df.plot(kind='scatter', x='music', y='acceptance', ax=axs[1])

df.plot(kind='scatter', x='academic', y='acceptance', ax=axs[2])

# 적합된 모형을 생성

lm = smf.ols(formula='acceptance ~ music', data=df).fit()

X_new = pd.DataFrame({'music': [df.music.min(), df.music.max()]})

preds = lm.predict(X_new)

df.plot(kind='scatter', x='music', y='acceptance', figsize=(12,12), s=50)

plt.title("Linear Regression - Fitting Music vs Acceptance Rate", fontsize=20)

plt.xlabel("Music", fontsize=16)

plt.ylabel("Acceptance", fontsize=16)

# 최소제곱선 도식화

plt.plot(X_new, preds, c='red', linewidth=2)

선형 회귀분석을 사용하기 위한 많은 수학적 파이썬 라이브러리들이 있다.

주로 scikit-learn, seaborn, statsmodels, mlpy 들이 사용되고 기억할 만한 것들이다.

참고 : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Decision Tree (의사결정나무)

날씨에 따라 야외활동을 하는 것이 좋은가 나쁜가의 예제를 살펴본다.

from sklearn.externals.six import StringIO

from sklearn import tree

import pydot

X=[[1,1,1,0],[1,1,1,1],[2,1,1,0],[2,3,2,1],[1,2,1,0],[1,3,2,0],\

[3,2,1,0],[3,3,2,0],[3,3,2,1],[3,2,2,0],[1,2,2,1],[2,2,1,1],\

[2,1,2,0],[3,2,1,0]]

Y=[0,0,1,1,0,1,1,1,0,1,1,1,1,0]

clf = tree.DecisionTreeClassifier()

clf = clf.fit(X, Y)

dot_data = StringIO()

tree.export_graphviz(clf, out_file=dot_data)

graph = pydot.graph_from_dot_data(dot_data.getvalue())

graph.write_pdf("game.pdf")

Naive Bayes Classifier (나이브 베이즈 분류기)

나이브 베이즈 분류법의 선행으로 베이즈 정리를 이해할 필요가 있다.

#. 베이즈 정리 (The Bayes Theorem)

다음 그림처럼 모든 사람이 U라는 전체 공간에 속하고,

유방암에 걸린 사람들의 집합을 A, 표본 검사에서 유방암으로 진단 받은 사람들의 집합을 B라고 가정하면 A∩B의 교집합이 존재한다.

2개의 다른 공간에 주목할 필요가 있다.

(1) B - A∩B : 유방암 진단에서 양성 판정을 받았으나, 실제 유방암에 걸리지 않은 집단

(2) A - A∩B : 유방암 진단에서 음성 판정을 받았으나, 실제 유방암에 걸린 집단

여기서 우리가 알고 싶은 정보가, 양성 판정을 받은 사람이 실제로 유방암에 걸렸을 확률이라고 하자.

수학적으로 표현한다면, B(유방암 진단을 받음)가 주어졌을 때 A(실제 유방암에 걸림)인 조건부 확률이라고 할 수 있다.

(즉, B에 대한 정보가 있기 때문에 확률 계산의 전체 공간이 U에서 B로 좁혀지게 되는 것입니다.)

비슷하게, 암을 가지고 있는 사람에 대해 알고 있을 때 암진단 테스트가 양성 반응으로 나올 확률은 다음과 같다.

따라서, 베이즈 정리를 다음과 같이 유도할 수 있다.

이 베이즈 정리는 A와 B 사건에 대해 B가 일어날 확률이 0이 아닐 경우에 유용하다.

#. 나이브 베이즈 분류기 (Naive Bayes Classifier)

나이브 베이즈 분류 기법은 입력 자유도(변수의 개수)가 높을 때 사용하기 적절하다.

상대적으로 간단하지만, 좋은 성능을 나타낸다.

(참고 : http://scikit-learn.org/stable/modules/naive_bayes.html 및 http://sebastianraschka.com/Articles/2014_naive_bayes_1.html )

다음 예제를 살펴보자.

빨간색 : 유방암에 걸린 집단

파란색 : 양성 진단을 받은 집단

흰색 : 새로운 사람

이 데이터를 통해 우리가 알 수 있는 정보는 다음과 같다.

#. TextBlob 라이브러리를 활용한 나이브 베이즈 분류기

TextBlob은 문자처리를 위한 도구들의 모음이다.

또한, NLP(자연어 처리, Natural Language Processing)를 위한 API를 제공해 분류, 명사구 추출, 대화 일부분 태깅, 감정 분석 등을 할 수 있다.

아나콘다 프롬포트(Anaconda Prompt)에서 다음 명령어를 통해 TextBlob를 설치할 수 있습니다.

conda install -c conda-forge textblob

그 다음, 예제 분석을 위한 말뭉치(Corpus)를 다음과 같이 다운받을 수 있다. (이 또한 하나콘다 프롬포트에서 진행합니다.)

python -m textblob.download_corpora

TextBlob을 이용해 고유의 문자 분류기를 쉽게 만들 수 있다.

TextBlob 0.6.0 버전 기준으로 다음 분류기들을 사용할 수 있다.

- BaseClassifier

- DecisionTreeClassifier

- MaxEntClassifier

- NLTKClassifier

- NaiveBayesClassifier

- PositiveNaiveBayesClassifier

다음 예제는 NaiveBayesClassifier를 활용한 감성 분석이다.

from textblob.classifiers import NaiveBayesClassifier

from textblob.blob import TextBlob

train = [

('I like this new tv show.', 'pos'),

('This is a very exciting event!', 'pos'),

('I feel very good about after I workout.', 'pos'),

('This is my most accomplished work.', 'pos'),

("What an awesome country", 'pos'),

("They have horrible service", 'neg'),

('I do not like this new restaurant', 'neg'),

('I am tired of waiting for my new book.', 'neg'),

("I can't deal with my toothache", 'neg'),

("The fun events in costa rica were amazing",'pos'),

('He is my worst boss!', 'neg'),

('People do have bad writing skills on facebook', 'neg')

]

test = [

('The beer was good.', 'pos'),

('I do not enjoy my job', 'neg'),

("I feel amazing!", 'pos'),

('Mark is a friend of mine.', 'pos'),

("I can't believe I was asked to do this.", 'neg')

]

cl = NaiveBayesClassifier(train)

print(cl.classify("The new movie was amazing.")) # "pos"

print(cl.classify("I don't like ther noodles.")) # "neg"

print "Test Results"

cl.update(test)

# Classify a TextBlob

blob = TextBlob("The food was good. But the service was horrible. "

"My father was not pleased.", classifier=cl)

print(blob)

print(blob.classify())

for sentence in blob.sentences:

print(sentence)

print(sentence.classify())

# Compute accuracy

print("Accuracy: {0}".format(cl.accuracy(test)))

# Show 5 most informative features

cl.show_informative_features(10)

Viewing Positive Sentiments using Word Clouds (워드클라우드를 이용해 긍정적인 감정 분석)

워드클라우드는 문서에 대한 각 단어들의 빈도를 잘 표현하는 것으로 매우 유명하다.

단어의 등장 횟수에 따라 글자의 크기를 바꿔 시각화한다. 가장 자주 나타난 단어의 크기가 가장 크게 표현된다.

아나콘다 프롬포트(Anaconda Prompt)에서 다음 명령어를 통해 wordcloud 패키지를 설치할 수 있습니다.

conda install -c conda-forge wordcloud

다음 예제 코드에서 STOPWORDS 는 a, an, the, is, was, at, in 과 같은 분석적으로 의미 없는 단어들을 제외하기 위한 단어 모음이다.

from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt

from os import path

d = path.dirname("__file__")

text = open(path.join(d, './results.txt')).read()

wordcloud = WordCloud(

#font_path='/Users/MacBook/kirthi/RemachineScript.ttf',

stopwords=STOPWORDS,

background_color='#222222',

width=1000,

height=800).generate(text)

# Open a plot of the generated image.

plt.figure(figsize=(13,13))

plt.imshow(wordcloud)

plt.axis("off")

plt.show()

#. K-근접 이웃 (K-nearest Neighbors, KNN)

KNN 분류법은 가장 이해하기 쉬운 분류 방법 중에 하나이며, 부분적으로 데이터 분포의 사전 정보가 적거나 없을 때 사용된다.

사과, 배, 그리고 바나나 분류에 대한 다음 데이터와 예제를 살펴보자.

import csv

import matplotlib.pyplot as plt

count=0

x=[]

y=[]

z=[]

with open('./fruits_data.csv', 'r') as csvf:

reader = csv.reader(csvf, delimiter=',')

for row in reader:

if count > 0:

x.append(row[0])

y.append(row[1])

if ( row[2] == 'Apple' ): z.append('r')

elif ( row[2] == 'Pear' ): z.append('g')

else: z.append('y')

count += 1

plt.figure(figsize=(11,11))

recs=[]

classes=['Apples', 'Pear', 'Bananas']

class_colours = ['r','g','y']

plt.title("Apples, Bananas and Pear by Weight and Shape", fontsize=18)

plt.xlabel("Shape category number", fontsize=14)

plt.ylabel("Weight in ounces", fontsize=14)

plt.scatter(x,y,s=600,c=z)

위 사진은 주어진 데이터를 시각화 한 것이다.

( 빨강색 : 사과 / 초록색 : 배 / 노랑색 : 바나나 )

만약 종류를 알 수 없는 과일에 대한 정보가 있을 때, 그것을 예측하는 예제 코드는 다음과 같다.

from math import pow, sqrt

dist=[]

def determineFruit(xv, yv, threshold_radius):

for i in range(1,len(x)):

xdif=pow(float(x[i])-xv, 2)

ydif=pow(float(y[i])-yv, 2)

sqrtdist = sqrt(xdif+ydif)

if ( xdif < threshold_radius and

ydif < threshold_radius and sqrtdist < threshold_radius):

dist.append(sqrtdist)

else:

dist.append(99)

pear_count=0

apple_count=0

banana_count=0

for i in range(1,len(dist)):

if dist[i] < threshold_radius:

if z[i] == 'g': pear_count += 1

if z[i] == 'r': apple_count += 1

if z[i] == 'y': banana_count += 1

if ( apple_count >= pear_count and apple_count >= banana_count ):

return "apple"

elif ( pear_count >= apple_count and pear_count >= banana_count):

return "pear"

elif ( banana_count >= apple_count and banana_count >= pear_count):

return "banana"

dist=[]

determine = determineFruit(3.5,6.2, 1)

print determine

banana

Logistic Regression (로지스틱 회귀분석)

범주형 반응변수(Y=1 or 0)의 경우, 선형 회귀분석보다 시그모이드 함수를 활용하는 로지스틱 회귀분석이 더 적합하다.

왜냐하면, 입력변수와 계수의 곱의 합으로 구성된 선형모형은 그 범위가 ±∞ 이지만, 시그모이드 함수는 (0, 1) 이기 때문이다.

아래 수식은 p개의 입력변수에 대한 Y의 확률을 시그모이드 함수로 표현한 것이다.

파이썬에서 표준 시그모이드 함수를 그려보면, 왜 이 함수를 사용하는지 조금 더 쉽게 이해할 수 있다.

import matplotlib.pyplot as plt

import numpy as np

plt.title("Sigmoid Functions vs LineSpace")

x = np.linspace(-10,10,100)

y1 = 1.0 / (1.0+np.exp(-x))

plt.plot(x,y1,'r-',lw=2)

y2 = 1.0 / (1.0+np.exp(-x/2))

plt.plot(x,y2,'g-',lw=2)

y3 = 1.0 / (1.0+np.exp(-x/10))

plt.plot(x,y3,'b-',lw=2)

plt.xlabel("x")

plt.ylabel("y")

붉은색(1) --> 초록색(1/2) --> 파랑색(1/10) 순으로 x의 계수가 낮다는 점에 주목하자.

붉은색의 경우에 값이 커짐에 따라 y가 1일 확률이 급격하게 높아지지만, 파랑색의 경우에는 경사가 완만함을 알 수 있다.

또한 값이 항상 (0, 1) 범위에 있어 해당 범주에 속할 확률로 사용이 가능함을 알 수 있다.

이제, Kaggle에서 제공한 '타이타닉 생존자 예측' 위한 데이터셋을 예제로 로지스틱 회귀분석을 적용해보자.

import numpy as np

import pandas as pd

import sklearn.linear_model as lm

import sklearn.cross_validation as cv

import matplotlib.pyplot as plt

train = pd.read_csv('./titanic_train.csv')

test = pd.read_csv('./titanic_test.csv')

train[train.columns[[2,4,5,1]]].head()

Pclass Sex Age Survived

0 3 male 22.0 0

1 1 female 38.0 1

2 3 female 26.0 1

3 1 female 35.0 1

4 3 male 35.0 0

data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy()

data['Sex'] = data['Sex'] == 'female'

data = data.dropna()

data_np = data.astype(np.int32).values

X = data_np[:,:-1]

y = data_np[:,-1]

female = X[:,0] == 1

survived = y == 1

# This vector contains the age of the passengers.

age = X[:,1]

# We compute a few histograms.

bins_ = np.arange(0, 121, 5)

S = {'male': np.histogram(age[survived & ~female],

bins=bins_)[0],

'female': np.histogram(age[survived & female],

bins=bins_)[0]}

D = {'male': np.histogram(age[~survived & ~female],

bins=bins_)[0],

'female': np.histogram(age[~survived & female],

bins=bins_)[0]}

bins = bins_[:-1]

plt.figure(figsize=(15,8))

for i, sex, color in zip((0, 1),('male', 'female'), ('#3345d0', '#cc3dc0')):

plt.subplot(121 + i)

plt.bar(bins, S[sex], bottom=D[sex], color=color,

width=5, label='Survived')

plt.bar(bins, D[sex], color='#aaaaff', width=5, label='Died', alpha=0.4)

plt.xlim(0, 80)

plt.grid(None)

plt.title(sex + " Survived")

plt.xlabel("Age (years)")

plt.legend()

(X_train, X_test, y_train, y_test) = cv.train_test_split(X, y, test_size=.05)

print X_train, y_train

# Logistic Regression from linear_model

logreg = lm.LogisticRegression();

logreg.fit(X_train, y_train)

y_predicted = logreg.predict(X_test)

plt.figure(figsize=(15,8));

plt.imshow(np.vstack((y_test, y_predicted)),

interpolation='none', cmap='bone');

plt.xticks([]); plt.yticks([]);

plt.title(("Actual and predicted survival outcomes on the test set"))

logreg.coef_

array([[ 2.40922883, -0.03177726, -1.13610297]])

Support Vector Machine (서포트 벡터 머신, SVM)

SVM은 수치형 또는 범주형 반응변수(Y) 모두 적용 가능한 지도학습 알고리즘으로서, 경험적으로 좋은 성능을 제공한다고 알려져있다.

생물정보학(Bioinformatics), 문자, ㅇ므성 인식 등의 여러 범위에서 성공적인 비선형(Non-linear) 모형의 확장 알고리즘으로 자리잡고 있다.

SVM은 계산이 복잡하지 않고 구현이 쉽다는 장점이 있으나, Underfitting의 경향이 있어 정확도가 낮을 수 있다는 단점이 있다.

다음은 scikit-learn(skilearn) 라이브러리를 활용한 파이썬 예제이다.

import numpy as np

from sklearn.svm import SVR

import matplotlib.pyplot as plt

X = np.sort(5 * np.random.rand(40, 1), axis=0)

y = (np.cos(X)+np.sin(X)).ravel()

y[::5] += 3 * (0.5 - np.random.rand(8))

svr_rbfmodel = SVR(kernel='rbf', C=1e3, gamma=0.1)

svr_linear = SVR(kernel='linear', C=1e3)

svr_polynom = SVR(kernel='poly', C=1e3, degree=2)

y_rbfmodel = svr_rbfmodel.fit(X, y).predict(X)

y_linear = svr_linear.fit(X, y).predict(X)

y_polynom = svr_polynom.fit(X, y).predict(X)

plt.figure(figsize=(11,11))

plt.scatter(X, y, c='k', label='data')

plt.hold('on')

plt.plot(X, y_rbfmodel, c='g', label='RBF model')

plt.plot(X, y_linear, c='r', label='Linear model')

plt.plot(X, y_polynom, c='b', label='Polynomial model')

plt.xlabel('data')

plt.ylabel('target')

plt.title('Support Vector Regression')

plt.legend()

plt.show()

k-Means Clustering (k-평균 군집분석)

k-평균 군집분석은 데이터셋의 어떤 k 개의 영역들의 중심을 최선으로 표현할 수 있는 데이터셋의 k 개의 지점들을 찾아낸다.

이 알고리즘은 수행 전 군집의 숫자(k)를 분석가가 미리 정해주어야 한다는 단점이 있다.

그러나 k-평균 군집분석은 군집분석에 매우 많이 사용되는 방법이고, 어떤 가정도 요구하지 않는 강점을 가지고 있다.

k-평균 군집분석의 알고리즘을 간단하게 정리하면 다음과 같다.

- n개의 점 (x, y) 집합과 k개의 센트로이드 집합이 주어진다. (초기에는 랜덤하게 배정됨)

- 각 점 (x, y)에 대해, 그 지점에서 가장 가까운 센트로이드를 찾아 해당 점을 소속시킨다.

- 각 군집에서 중앙값을 찾고 이 값을 다음 k개의 센트로이드 집합으로 설정한다.

- 센트로이드 집합의 값이 변하지 않을 때까지 계속 반복한다.

다음 예제는 scikit-learn 라이브러리를 사용해 k-평균 군집분석을 구현한 파이썬 코드이다.

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

import csv

x=[]

y=[]

with open('./cluster_input.csv', 'r') as csvf:

reader = csv.reader(csvf, delimiter=',')

for row in reader:

x.append(float(row[0]))

y.append(float(row[1]))

data=[]

for i in range(0,120):

data.append([x[i],y[i]])

plt.figure(figsize=(10,10))

plt.xlim(0,12)

plt.ylim(0,12)

plt.xlabel("X values",fontsize=14)

plt.ylabel("Y values", fontsize=14)

plt.title("Before Clustering ", fontsize=20)

plt.plot(x, y, 'k.', color='#0080ff', markersize=35, alpha=0.6)

kmeans = KMeans(init='k-means++', n_clusters=3, n_init=10)

kmeans.fit(data)

plt.figure(figsize=(10,10))

plt.xlim(0,12)

plt.ylim(0,12)

plt.xlabel("X values",fontsize=14)

plt.ylabel("Y values", fontsize=14)

plt.title("After K-Means Clustering (from scikit-learn)", fontsize=20)

plt.plot(x, y, 'k.', color='#ffaaaa', markersize=45, alpha=0.6)

# Plot the centroids as a blue X

centroids = kmeans.cluster_centers_

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200,

linewidths=3, color='b', zorder=10)

plt.show()

저작자표시 비영리 동일조건 (새창열림)

'데이터 과학 > Mastering Python Data Visualization' 카테고리의 다른 글

Chapter 08. Advanced Visualization (고급 시각화) (0)	2017.10.29
Chapter 07. Bioinformatics, Genetics and Network Models (생물정보학, 유전학, 네트워크 모델) (0)	2017.10.29
Chapter 05-3. An Overview of Statistical and Machine Learning (통계 및 머신러닝 개요) (0)	2017.10.18
Chapter 05-2. Stochastic Model (확률론적 모형) (0)	2017.06.19
Chapter 05-1. Deterministic Model (결정론적 모형) (0)	2017.06.18

길가던짱구의 꿈꾸는 데이터 과학