k平均法(k-Means)

k平均法(k-Means)#

クラスタリングにおいて最もシンプルで一般的な手法であるk-Meansについて学びましょう。

k-Meansとは#

概要#

k-meansとは、教師なし学習の一種です。クラスタリングに利用されます。

●k-NNの立ち位置

機械学習
- 教師あり学習
- 教師なし学習
  - クラスタリング
    - 階層型クラスタリング
    - 非階層型クラスタリング
      - ハードクラスタリング
        
        k-Means
      - ソフトクラスタリング

さて、教師なし学習であるクラスタリングは教師あり学習とは異なり、パラメータの学習に「訓練ラベル（教師ラベル）」を利用しません。

K-Meansのパラメータ学習段階では、何らかの指標に基づいて、データを任意の数のクラスタ（グループ）に振り分けます。
このクラスタ（cluster）はクラス（class）とは別物ですので注意してください。

アルゴリズム#

各点\(x_i\)に対してランダムにクラスタを割り振る

各クラスタに割り当てられた点について重心を計算する

各点について上記で計算された重心からの距離を計算し、距離が一番近いクラスタに割り当て直す。

2.と3.の工程を、割り当てられるクラスタが変化しなくなるまで行う

(引用 https://qiita.com/g-k/items/0d5d22a12a4507ecbf11)

k-Meansの仕組みについては K-means 法を D3.js でビジュアライズしてみた - てっく煮ブログが非常に分かりやすいので，確認してください．

k-MeansのScikit-Learn実装を使った演習#

パッケージの用意とデータの確認#

k-Meansをsklearnを使って試してみましょう。

インポートするパッケージは以下の通りです。(matplotlibとplotlyをimportしていますが、どちらか得意な方を利用すれば問題ありません）

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
SEED = 2022

さて、まずはデータを読み込みましょう。
今回はwine datasetを利用します。

Note

wine datasetは以下で公開されています．

UCI Machine Learning Repository

また，scikit-learnのdatasetsパッケージにはload_wine関数が良いされているので，公開元から直接ダウンロードする必要はありません．

sklearn.datasets.load_wine

load_wine関数にはas_frameというオプションがあります。これをTrueにすることで、DataFrameを含んだ返り値をくれるので、使ってみましょう。

wine = load_wine(as_frame=True)
wine_df = pd.DataFrame(wine.frame)
wine_df.head(20)

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0
5	14.20	1.76	2.45	15.2	112.0	3.27	3.39	0.34	1.97	6.75	1.05	2.85	1450.0
6	14.39	1.87	2.45	14.6	96.0	2.50	2.52	0.30	1.98	5.25	1.02	3.58	1290.0
7	14.06	2.15	2.61	17.6	121.0	2.60	2.51	0.31	1.25	5.05	1.06	3.58	1295.0
8	14.83	1.64	2.17	14.0	97.0	2.80	2.98	0.29	1.98	5.20	1.08	2.85	1045.0
9	13.86	1.35	2.27	16.0	98.0	2.98	3.15	0.22	1.85	7.22	1.01	3.55	1045.0
10	14.10	2.16	2.30	18.0	105.0	2.95	3.32	0.22	2.38	5.75	1.25	3.17	1510.0
11	14.12	1.48	2.32	16.8	95.0	2.20	2.43	0.26	1.57	5.00	1.17	2.82	1280.0
12	13.75	1.73	2.41	16.0	89.0	2.60	2.76	0.29	1.81	5.60	1.15	2.90	1320.0
13	14.75	1.73	2.39	11.4	91.0	3.10	3.69	0.43	2.81	5.40	1.25	2.73	1150.0
14	14.38	1.87	2.38	12.0	102.0	3.30	3.64	0.29	2.96	7.50	1.20	3.00	1547.0
15	13.63	1.81	2.70	17.2	112.0	2.85	2.91	0.30	1.46	7.30	1.28	2.88	1310.0
16	14.30	1.92	2.72	20.0	120.0	2.80	3.14	0.33	1.97	6.20	1.07	2.65	1280.0
17	13.83	1.57	2.62	20.0	115.0	2.95	3.40	0.40	1.72	6.60	1.13	2.57	1130.0
18	14.19	1.59	2.48	16.5	108.0	3.30	3.93	0.32	1.86	8.70	1.23	2.82	1680.0
19	13.64	3.10	2.56	15.2	116.0	2.70	3.03	0.17	1.66	5.10	0.96	3.36	845.0

wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  target                        178 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 19.6 KB

wine_df.describe()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline	target
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258	0.938202
std	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474	0.775035
min	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000	0.000000
25%	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000	0.000000
50%	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000	1.000000
75%	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000	2.000000
max	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000	2.000000

このDataFrameを自分で操作して、データがどのような形なのかを確認してください。
データは全部で178個、3クラス。 irisと異なりクラスごとのデータの数はバラバラです。
確認してみましょう。

for i in range(3):
    n_items = (wine_df["target"] == i).sum()
    print(f"クラス{i}のデータ数: {n_items}個")

クラス0のデータ数: 59個
クラス1のデータ数: 71個
クラス2のデータ数: 48個

ちなみに、クラスを基準に可視化するとこのようになります。

px.scatter_matrix(wine_df, 
                  dimensions=wine_df.columns[:5], 
                  color=wine_df["target"],
                  height=700,width=800,
                  title="一部の説明変数ペアを使ったクラスの確認"
                 )

train_test_splitはdataframeを渡すこともできます。確認してみましょう。

X_train,X_test, y_train, y_test = train_test_split(wine_df.iloc[:,:-1], # 分割したいデータを列挙
                                                   wine_df.target, # 分割したいデータを列挙
                                                   test_size=0.3, # テストデータの割合
                                                   stratify=wine_df.target, # 層化サンプリングの指針になるlabelを指定
                                                   random_state=SEED # 乱数シードの設定
                                                  )

print(type(X_train))
X_train.head()

<class 'pandas.core.frame.DataFrame'>

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
118	12.77	3.43	1.98	16.0	80.0	1.63	1.25	0.43	0.83	3.40	0.70	2.12	372.0
47	13.90	1.68	2.12	16.0	101.0	3.10	3.39	0.21	2.14	6.10	0.91	3.33	985.0
14	14.38	1.87	2.38	12.0	102.0	3.30	3.64	0.29	2.96	7.50	1.20	3.00	1547.0
37	13.05	1.65	2.55	18.0	98.0	2.45	2.43	0.29	1.44	4.25	1.12	2.51	1105.0
174	13.40	3.91	2.48	23.0	102.0	1.80	0.75	0.43	1.41	7.30	0.70	1.56	750.0

print(type(X_test))
X_test.head()

<class 'pandas.core.frame.DataFrame'>

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
115	11.03	1.51	2.20	21.5	85.0	2.46	2.17	0.52	2.01	1.900000	1.71	2.87	407.0
158	14.34	1.68	2.70	25.0	98.0	2.80	1.31	0.53	2.70	13.000000	0.57	1.96	660.0
171	12.77	2.39	2.28	19.5	86.0	1.39	0.51	0.48	0.64	9.899999	0.57	1.63	470.0
18	14.19	1.59	2.48	16.5	108.0	3.30	3.93	0.32	1.86	8.700000	1.23	2.82	1680.0
108	12.22	1.29	1.94	19.0	92.0	2.36	2.04	0.39	2.08	2.700000	0.86	3.02	312.0

print(type(y_train))
y_train.head()

<class 'pandas.core.series.Series'>

  1
   0
   0
   0
  2
Name: target, dtype: int64

print(type(y_test))
y_test.head()

<class 'pandas.core.series.Series'>

  1
  2
  2
   0
  1
Name: target, dtype: int64

sklearnはDataFrameを入力に受け取っても正しく動作するので、今回はこのまま利用していくことにしましょう。

データの前処理#

それではsklearnのk-Meansを利用してみましょう。

クラスからインスタンスを生成
fitで訓練
各種評価指標でクラスタリングの性能を評価

この流れにそって、実験を行います。

しかしその前に、データの前処理を行います。

参考: https://www.rm48.net/post/python-k-meansクラスタリングをするときはデータの規格化、標準化をする

scaler =  StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

sklearnによる訓練#

K-Meansクラスの使い方は?を使って確認しておきます。

>>>KMeans?

Init signature:
KMeans(
    n_clusters=8, # クラスタの数
    *,
    init='k-means++', # 初期化の方法。k-means++は非常に優秀な初期化方法です。ナイーブに実装した場合の初期化は"randam"が相当します。
    n_init=10, # k-meansの性能は初期値に大きく影響されます。sklearnは異なるseedで初期化をn_init回行い、最も良いものを選択します。
    max_iter=300, # 最大更新回数
    tol=0.0001, # 許容誤差, これ以下になれば最大更新回数に達していなくても学習を終了します。
    verbose=0,  # 詳しい解析結果を表示するなら1、何も表示しないならば0
    random_state=None, # 擬似乱数生成器に渡すSEED
    copy_x=True, # 
    algorithm='auto',
)

algorithm

{“lloyd”、”elkan”、”auto”、”full”}、default =”lloyd”
使用するK-meansのアルゴリズムを設定します。
古典的なEM式アルゴリズムは “lloyd “ですが、”elkan “は三角形の不等式を用いることにより、クラスタが明確に定義されたデータセットではより効率的となることがあります。しかし、形状の配列(n_samples, n_clusters) を追加で割り当てるため、よりメモリを消費します。

“auto”と “full”は非推奨で、Scikit-Learn 1.3 で削除される予定です。これらは両方とも “lloyd” のエイリアスです。

それではインスタンスを生成します。

skmodel = KMeans(3, random_state=SEED)

次に、fitの使い方を確認してから、実行しましょう。

skmodel.fit?

skmodel.fit(scaled_X_train)

KMeans(n_clusters=3, random_state=2022)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

結果の可視化#

クラスタリングの結果を散布図にして確認してみます。
ここでは、説明変数が多すぎるので幾つかのペアを取り出して可視化を行います。

result_df = X_train.iloc[:,:5].copy()
result_df["cluster_index"] = skmodel.predict(scaled_X_train)
result_df.head()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	cluster_index
118	12.77	3.43	1.98	16.0	80.0	2
47	13.90	1.68	2.12	16.0	101.0	0
14	14.38	1.87	2.38	12.0	102.0	0
37	13.05	1.65	2.55	18.0	98.0	0
174	13.40	3.91	2.48	23.0	102.0	2

px.scatter_matrix(result_df, 
                  dimensions=result_df.columns[:4], 
                  color="cluster_index",
                  height=700,width=800,
                  title="一部の説明変数ペアを使ったクラスタの確認"
                 )

割り当てたクラスタが色に対応するような可視化を行いました。黄色クラスタとそれ以外に関しては比較的わかりやすい分かれ方をしていますが、青と紫に関してはかぶりが多いようです。

エルボー法によるクラスタ数の決定#

もっともらしいクラスタの数はどうやって決めれば良いのでしょうか。ここでは、エルボー法という最もシンプルな方法でのクラスタ数決定に挑戦します。

エルボー法
Kの値を変えながら、SSE(Sum of Squared errors of prediction, クラスタ内誤差平方和)をプロットする。SSEが急激に減少した点（Elbow（エルボー、肘）のように折れ線がカクッと折れ曲がる点）がより良いクラスタ数である。

SSEはデータ点からクラスタ重心までの距離の合計を凝集度とします。
それを全クラスタに関して計算し足し合わせます。
足し合わせた値を歪み(distortion)と言っているところが多いです。

sklearnではSSEをself.inertia_として持っています。これを使ってエルボー法を試します。

score_df = {"num_cluster":[], "cluster inertia (SSE)":[]}
for i in range(2,20):
    skmodel = KMeans(n_clusters=i, random_state=SEED)
    skmodel.fit(scaled_X_train)
    score_df["num_cluster"] += [i]
    score_df["cluster inertia (SSE)"] += [skmodel.inertia_]

px.line(score_df, 
        x="num_cluster", y="cluster inertia (SSE)",
        title="エルボー法: クラスタ数とクラスタ内誤差平方和（SSE）の関係",
       )

ここではk=3が良さそうだとわかります。

skmodel.transform(scaled_X_test)

array([[6.55874475, 4.18748589, 7.48529467, ..., 4.80866115, 6.61572635,
        5.26553473],
       [6.45395404, 6.99780301, 5.46578485, ..., 6.74317749, 6.1764118 ,
        6.96826739],
       [6.65194357, 5.30732135, 3.08178342, ..., 6.36043976, 2.87516042,
        7.43434015],
       ...,
       [2.53297739, 3.61712581, 5.50516809, ..., 3.90846158, 4.34940304,
        4.72671423],
       [1.27874731, 4.93845711, 6.44699755, ..., 4.09845814, 5.83402466,
        4.68153257],
       [6.07682719, 5.73098528, 3.90684894, ..., 5.94310883, 4.07570302,
        6.61594602]])

k-MeansのNumpyによる実装#

このセクションではk-NNをNumPyを使って実装してもらいます。

k-Meansクラス実装のヒント#

k-Meansクラスの雛形を示します。

class KMeansClustering():
    def __init__(self,k, max_iter, delta, seed=1000):
        """初期化

        Parameters
        ----------
        k : int, optional
            クラスタ数, by default 3
        max_iter : int, optional
            最大更新回数, by default 100
        delta : float, optional
            許容誤差, by default 1e-15
        seed : int, optional
            疑似乱数シード, by default None.
            Noneなら[0,2**20)の範囲でランダムにSEEDを生成
        """
        ...
        self.is_fitted = False
        
        
    def fit(self, X,y=None):
        self.inertia_ = ... # クラスタ内SSEを保存しましょう。
        ...
        return self
    
    def predict(self,X:np.ndarray)->np.ndarray:
        # 入力データに対してクラスタを割り当てる
        pred_labels = ...
        return pred_labels
    
    def transform(self, X:np.ndarray)->np.ndarray:
        """入力データとすべてのクラスタ重心との距離を計算し、行列として返す

        Parameters
        ----------
        X : np.ndarray
            入力データ Shape(n_data, n_features)

        Returns
        -------
        embedded
            Shape(n_data, self.k)
        """
        embedded = ...
        return embedded

[基礎]以下の要点を満たすように、`KMeansClustering`クラスを修正してk-Meansを実装してください。#

__init__について

docstringを参考に引数の表記を修正 (ex. seed:int=1000)
すべての引数を同様の名前のインスタンス変数として束縛
クラス内からはself.rngの名前でnumpyの乱数生成器インスタンスにアクセスできるようにする

fitについて

クラスタ内SSEをself.inertia_に保存する
データ点と各クラスタの暫定重心とのユークリッド距離を基準に訓練を行う
t時点の重心とt-1時点の重心のユークリッド距離がself.delta以下になったら訓練終了

transformについて

Xのdtypeとembeddedのdtypeを揃える
このメソッドはfitが実行されていないのに呼び出された場合、警告する
このメソッドの入力Xがfitの入力Xと異なる次元数（特徴数）である場合、警告する

KMeansClustering クラスの実装例#

上の問題の回答にあたる，KMeansClusteringクラスの実装例を示します．

Show code cell source Hide code cell source

class KMeansClustering():
    def __init__(self,k:int=3,max_epochs:int=100, n_init:int=1, delta:float=1e-15, seed:int|None=None):
        """初期化

        Parameters
        ----------
        k : int, optional
            クラスタ数, by default 3
        max_epochs : int, optional
            最大更新回数, by default 100
        n_init : int, optional
            初期インスタンス生成数, by default 1
        delta : float, optional
            許容誤差, by default 1e-15
        seed : int, optional
            疑似乱数シード, by default None
        """
        self.k = k 
        self.max_epochs = max_epochs
        self.delta = delta # Tolerance for stopping criterion. 
        self.seed = seed if isinstance(seed, int) else np.random.randint(2**20)
        self.rng = np.random.default_rng(self.seed)
        self.n_init=n_init
        
    def fit(self, X,y=None):
        num_data, num_features = X.shape
        
        # initialize centroids
        """
        self.centroids = np.zeros((self.k, num_features))
        for i in range(num_features):
            max_value = np.max(X[:,i])
            min_value = np.min(X[:,i])
            centroid = self.rng.uniform(min_value,max_value, size=(self.k,))
            self.centroids[:,i] = centroid
        """
        # initialize labels
        labels = self.rng.integers(self.k, size=(num_data))
        
        # initialize centroids
        centroids = []
        for label in range(self.k):
            current_members = X[labels == label]
            new_centroid = current_members.mean(axis=0)
            centroids.append(new_centroid)
        self.centroids = np.vstack(centroids)
        self._labels = [labels]
        # initialize labels  
        labels = self.predict(X)
        self._labels.append(labels)
        
        for epoch in range(self.max_epochs):
            delta = np.zeros(self.k)
            # update labels  
            labels = self.predict(X)
            self._labels.append(labels)
            
            # update centroids
            for label in range(self.k):
                mask = labels == label
                assert mask.sum()!=0, f"更新{epoch}回目でクラスタ{label}が消滅した. {mask.sum()}"
                
                current_members = X[mask]
                new_centroid = current_members.mean(axis=0)
                delta[label] = np.linalg.norm(self.centroids[label] - new_centroid)
                self.centroids[label] = new_centroid
                
            # breaking condition
            if self.delta >= delta.mean():
                #print(f"breaking at epoch-{epoch}")
                #labels = self.predict(X)
                #self._labels.append(labels)
                break
                
        return self
    
    def predict(self,X):
        # update labels 
        labels = np.zeros(X.shape[0], dtype=int)
        for index, x in enumerate(X):
            distance = np.linalg.norm(self.centroids - x, axis=1)
            labels[index] = np.argmin(distance)
        return labels
    
    def transform(self, X):
        embed = []
        for index, x in enumerate(X):
            distance = np.linalg.norm(self.centroids - x, axis=1)
            embed.append(distance)
        
        embed = np.concatenate(embed)
        return embed
    
# ------ 訓練と推論の実行
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X = scaler.transform(X_train)

model = KMeansClustering(k=3)
model.fit(X)

print("予測したクラスタ",model.predict(X))


# ------ 学習中のクラスタの変化を可視化

from sklearn.decomposition import PCA

pca = PCA(2)
X_embed = pca.fit_transform(X)

tmp = []
for i,l in enumerate(model._labels):
    df= pd.DataFrame(X_embed, columns=["PC1", "PC2"])
    #df= pd.DataFrame(X_train, columns=iris.feature_names)
    df["cluster_index"] = l
    df["epoch"] = i
    tmp.append(df)

df = pd.concat(tmp)

#px.scatter(df, x='sepal length (cm)', y='sepal width (cm)',color="cluster_index",
px.scatter(df, x='PC1', y='PC2',color="cluster_index",
           width=700, height=600, 
           animation_frame="epoch"
          )

予測したクラスタ [0 2 2 2 0 2 0 0 0 2 2 1 2 1 0 1 1 1 0 1 2 0 1 1 1 1 0 2 1 2 0 1 2 0 2 1 2
2 1 0 0 0 0 1 1 2 0 2 0 2 1 0 1 0 0 0 1 0 1 0 2 2 1 1 2 0 2 2 2 1 1 1 2
0 1 2 2 1 2 2 1 0 2 2 0 1 1 2 2 1 1 2 2 2 2 0 0 0 1 2 0 1 1 0 0 2 2 1 2
2 1 0 2 1 1 2 1 1 1 1 2]

関数として実装した例を以下に示します．

[基礎] KMeansClustering`を使って、iris datasetに適切なkをエルボー法で求めよ#

pass

[発展] クラスタリングの評価指標として利用されるPurityについて調べて、関数として実装せよ。またK=2からk=20までのpurityを計算し、折れ線グラフで示せ。#

純度(Purity)はクラスター内に含まれる教師ラベルの内，最もクラスター内に多くのインスタンス数を持つものの割合の重み付き和のこと。

pass

[発展]シルエット分析について調べ、これを実装せよ。またK=3,10の場合の結果を示せ。#

pass

https://hkawabata.github.io/technical-note/note/ML/k-means.html