앙상블 학습 완벽 가이드: 랜덤 포레스트부터 XGBoost까지 타이타닉 생존 예측

앙상블(Ensemble)이란?

5편에서 단일 분류 알고리즘을 비교했습니다. 이번에는 여러 모델을 결합하여 더 강력한 예측을 만드는 앙상블 학습을 다룹니다.

"세 사람이 모이면 문수보살의 지혜" — 약한 모델을 여러 개 합치면 강한 모델이 됩니다.

방법	핵심 아이디어	특징
배깅 (Bagging)	여러 모델을 병렬로 학습 → 다수결/평균	분산 감소, 과적합 방지
부스팅 (Boosting)	모델을 순차적으로 학습 → 이전 모델의 실수 보완	편향 감소, 성능 극대화
스태킹 (Stacking)	여러 모델의 예측을 새로운 모델의 입력으로 사용	다양한 모델의 장점 결합

1. 배깅 (Bagging: Bootstrap Aggregating)

핵심 원리

원본 데이터에서 중복 허용 랜덤 샘플링 (부트스트랩)으로 여러 데이터셋 생성
각 데이터셋으로 독립적인 모델을 학습
예측 시 다수결 투표(분류) 또는 평균(회귀)으로 최종 결과 결정

              +-- 샘플 1 --> 모델 1 --+
              |                       |
원본 데이터 --+-- 샘플 2 --> 모델 2 --+--> 다수결/평균 --> 최종 예측
              |                       |
              +-- 샘플 3 --> 모델 3 --+

왜 효과적인가?

개별 모델은 과적합될 수 있지만, 서로 다른 데이터로 학습하므로 과적합 방향이 다름
평균하면 개별 모델의 과적합이 상쇄됨
분산(variance) 감소에 효과적

2. 랜덤 포레스트 (Random Forest)

배깅 + 의사결정 트리의 조합에 특성 랜덤 선택을 추가한 알고리즘입니다.

핵심 아이디어

부트스트랩 샘플로 여러 트리 학습 (배깅)
각 분할(split)에서 전체 특성이 아닌 랜덤 부분 집합만 고려
이로써 트리들의 **다양성(diversity)**을 극대화

특성 선택까지 랜덤으로 하면, 트리들이 서로 더 다르게 됩니다. "모든 트리가 소득 특성만 보고 분할"하는 문제를 방지합니다.

주요 하이퍼파라미터

파라미터	의미	일반적 값
`n_estimators`	트리 수	100~500
`max_depth`	트리 최대 깊이	None(무제한) 또는 10~30
`max_features`	분할 시 고려할 특성 수	`'sqrt'` (분류), `'log2'`
`min_samples_split`	분할 최소 샘플 수	2~10
`min_samples_leaf`	리프 최소 샘플 수	1~5

Python 실습

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)

# 트리 수에 따른 성능 변화
n_trees = [1, 5, 10, 50, 100, 200]
scores = []

for n in n_trees:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    cv_score = cross_val_score(model, X, y, cv=5).mean()
    scores.append(cv_score)
    print(f"트리 {n:>3}개: CV 정확도 = {cv_score:.4f}")

plt.plot(n_trees, scores, 'o-')
plt.xlabel('트리 수')
plt.ylabel('CV 정확도')
plt.title('랜덤 포레스트: 트리 수에 따른 성능')
plt.grid(True, alpha=0.3)
plt.show()

# 결정 경계 비교: 단일 트리 vs 랜덤 포레스트
from sklearn.tree import DecisionTreeClassifier

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
models = [
    ('의사결정 트리 (1개)', DecisionTreeClassifier(random_state=42)),
    ('랜덤 포레스트 (100개)', RandomForestClassifier(n_estimators=100, random_state=42)),
]

for ax, (name, model) in zip(axes, models):
    model.fit(X, y)
    xx, yy = np.meshgrid(
        np.linspace(X[:, 0].min()-0.5, X[:, 0].max()+0.5, 200),
        np.linspace(X[:, 1].min()-0.5, X[:, 1].max()+0.5, 200)
    )
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolors='black', s=15)
    ax.set_title(f'{name}\n정확도: {model.score(X, y):.4f}')

plt.tight_layout()
plt.show()

3. 부스팅 (Boosting)

핵심 원리

모델을 순차적으로 학습하되, 이전 모델이 틀린 데이터에 더 집중하여 다음 모델을 학습합니다.

데이터 → 모델 1 → 틀린 부분에 가중치 ↑
                → 모델 2 → 틀린 부분에 가중치 ↑
                         → 모델 3 → ...
최종 예측 = 모든 모델의 가중 합산

부스팅 변형 비교

알고리즘	핵심 특징	속도	장점
AdaBoost	잘못 분류된 샘플에 가중치 증가	느림	단순하고 직관적
Gradient Boosting	잔차(residual)를 다음 모델이 학습	느림	유연한 손실 함수
XGBoost	정규화 + 병렬 처리 + 결측치 처리	빠름	Kaggle 우승 알고리즘
LightGBM	리프 기반 성장 + 히스토그램 기반	매우 빠름	대규모 데이터에 최적
CatBoost	범주형 특성 자동 처리	빠름	범주형 데이터에 강함

4. XGBoost (eXtreme Gradient Boosting)

핵심 아이디어

경사 부스팅에 정규화와 시스템 최적화를 추가한 알고리즘입니다.

$$Obj = \sum_{i=1}^{n}L(y_i, \hat{y}i) + \sum{k=1}^{K}\Omega(f_k)$$

$$L$$: 손실 함수 (예측 오차)
$$\Omega$$: 정규화 항 (트리 복잡도 제어)
각 라운드마다 잔차를 예측하는 새 트리를 추가

주요 하이퍼파라미터

파라미터	의미	일반적 값
`n_estimators`	부스팅 라운드 수	100~1000
`learning_rate`	학습률 (각 트리의 기여도)	0.01~0.3
`max_depth`	트리 최대 깊이	3~10
`subsample`	각 트리에 사용할 데이터 비율	0.7~1.0
`colsample_bytree`	각 트리에 사용할 특성 비율	0.7~1.0
`reg_alpha`	L1 정규화	0~1
`reg_lambda`	L2 정규화	1~10

Python 실습

# 설치: pip install xgboost lightgbm
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# XGBoost 학습
xgb = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss'
)
xgb.fit(X_train, y_train)

print(f"XGBoost 정확도: {xgb.score(X_test, y_test):.4f}")
print(f"CV 평균 정확도: {cross_val_score(xgb, X, y, cv=5).mean():.4f}")

5. LightGBM

XGBoost와의 차이점

특징	XGBoost	LightGBM
트리 성장	Level-wise (수평)	Leaf-wise (수직)
속도	빠름	매우 빠름 (2~10배)
메모리	보통	적음 (히스토그램 기반)
대규모 데이터	좋음	매우 좋음
과적합 위험	보통	Leaf-wise라 약간 높음

from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=-1,  # 무제한 (num_leaves로 제어)
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1
)
lgbm.fit(X_train, y_train)

print(f"LightGBM 정확도: {lgbm.score(X_test, y_test):.4f}")

6. 하이퍼파라미터 튜닝

GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
}

grid = GridSearchCV(
    XGBClassifier(random_state=42, eval_metric='logloss'),
    param_grid, cv=5,
    scoring='accuracy', n_jobs=-1
)
grid.fit(X_train, y_train)

print(f"최적 파라미터: {grid.best_params_}")
print(f"최적 CV 정확도: {grid.best_score_:.4f}")

Optuna (더 효율적인 탐색)

# 설치: pip install optuna
import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
    }

    model = XGBClassifier(**params, random_state=42, eval_metric='logloss')
    score = cross_val_score(model, X_train, y_train, cv=5).mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"최적 파라미터: {study.best_params}")
print(f"최적 CV 정확도: {study.best_value:.4f}")

7. SHAP: 블랙박스 모델 해석하기

1편에서 랜덤 포레스트와 XGBoost는 블랙박스 모델이라고 했습니다. **SHAP(SHapley Additive exPlanations)**을 사용하면 각 특성이 예측에 얼마나 기여했는지 해석할 수 있습니다.

# 설치: pip install shap
import shap

# XGBoost 모델 기반 SHAP 분석
explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)

# 전체 특성 중요도
shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
plt.title('SHAP 특성 중요도')
plt.tight_layout()
plt.show()

# 개별 예측 설명 (첫 번째 샘플)
shap.force_plot(explainer.expected_value, shap_values[0], X_test[0],
                matplotlib=True, show=False)
plt.title('개별 예측 설명 (샘플 1)')
plt.tight_layout()
plt.show()

8. 토이 프로블럼: 타이타닉 생존자 예측

Kaggle에서 가장 유명한 입문 문제입니다. 실제 데이터의 결측치, 범주형 변수, 특성 엔지니어링을 경험합니다.

8-1. 데이터 로드 및 탐색

import pandas as pd
import seaborn as sns

# seaborn 내장 타이타닉 데이터
df = sns.load_dataset('titanic')
print(f"데이터 크기: {df.shape}")
print()
print(df.head())
print()
print("=== 결측치 ===")
print(df.isnull().sum()[df.isnull().sum() > 0])
print()
print(f"생존율: {df['survived'].mean():.2%}")

8-2. 탐색적 데이터 분석 (EDA)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 성별별 생존율
sns.barplot(data=df, x='sex', y='survived', ax=axes[0, 0])
axes[0, 0].set_title('성별별 생존율')

# 객실 등급별 생존율
sns.barplot(data=df, x='pclass', y='survived', ax=axes[0, 1])
axes[0, 1].set_title('객실 등급별 생존율')

# 나이 분포
sns.histplot(data=df, x='age', hue='survived', bins=30, ax=axes[0, 2])
axes[0, 2].set_title('나이별 생존 분포')

# 요금 분포
sns.boxplot(data=df, x='survived', y='fare', ax=axes[1, 0])
axes[1, 0].set_title('요금별 생존 분포')

# 탑승항 별 생존율
sns.barplot(data=df, x='embark_town', y='survived', ax=axes[1, 1])
axes[1, 1].set_title('탑승항별 생존율')

# 동반 가족 수별 생존율
df['family_size'] = df['sibsp'] + df['parch'] + 1
sns.barplot(data=df, x='family_size', y='survived', ax=axes[1, 2])
axes[1, 2].set_title('가족 수별 생존율')

plt.tight_layout()
plt.show()

8-3. 특성 엔지니어링 및 전처리

from sklearn.preprocessing import LabelEncoder

def preprocess_titanic(df):
    data = df.copy()

    # 특성 엔지니어링
    data['family_size'] = data['sibsp'] + data['parch'] + 1
    data['is_alone'] = (data['family_size'] == 1).astype(int)

    # 호칭 추출
    data['title'] = data['who']  # man, woman, child

    # 결측치 처리
    data['age'].fillna(data['age'].median(), inplace=True)
    data['fare'].fillna(data['fare'].median(), inplace=True)
    data['embark_town'].fillna(data['embark_town'].mode()[0], inplace=True)

    # 범주형 → 숫자형
    le = LabelEncoder()
    data['sex_encoded'] = le.fit_transform(data['sex'])
    data['embark_encoded'] = le.fit_transform(data['embark_town'])
    data['title_encoded'] = le.fit_transform(data['title'])

    # 사용할 특성 선택
    features = ['pclass', 'sex_encoded', 'age', 'fare', 'family_size',
                'is_alone', 'embark_encoded', 'title_encoded']

    return data[features], data['survived']

X, y = preprocess_titanic(df)
print(f"특성 수: {X.shape[1]}")
print(X.head())

8-4. 모델 학습 및 비교

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 스케일링 (SVM, 로지스틱 회귀에 필요)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

models = {
    '로지스틱 회귀': (LogisticRegression(max_iter=1000), True),
    'SVM': (SVC(kernel='rbf'), True),
    '랜덤 포레스트': (RandomForestClassifier(n_estimators=200, random_state=42), False),
    'Gradient Boosting': (GradientBoostingClassifier(n_estimators=200, random_state=42), False),
    'XGBoost': (XGBClassifier(n_estimators=200, random_state=42, eval_metric='logloss'), False),
    'LightGBM': (LGBMClassifier(n_estimators=200, random_state=42, verbose=-1), False),
}

print(f"{'모델':<22} {'Test 정확도':>12} {'CV 평균(5-fold)':>16}")
print("-" * 54)

results = {}
for name, (model, needs_scaling) in models.items():
    Xtr = X_train_scaled if needs_scaling else X_train
    Xte = X_test_scaled if needs_scaling else X_test
    Xall = scaler.transform(X) if needs_scaling else X

    model.fit(Xtr, y_train)
    test_acc = model.score(Xte, y_test)
    cv_scores = cross_val_score(model, Xall, y, cv=5)

    results[name] = {'test': test_acc, 'cv_mean': cv_scores.mean(), 'cv_std': cv_scores.std()}
    print(f"{name:<22} {test_acc:>12.4f} {cv_scores.mean():>12.4f} ± {cv_scores.std():.4f}")

8-5. 최적 모델 튜닝

from sklearn.model_selection import GridSearchCV

# XGBoost 튜닝
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0],
}

grid = GridSearchCV(
    XGBClassifier(random_state=42, eval_metric='logloss'),
    param_grid, cv=5,
    scoring='accuracy', n_jobs=-1
)
grid.fit(X_train, y_train)

print(f"최적 파라미터: {grid.best_params_}")
print(f"최적 CV 정확도: {grid.best_score_:.4f}")
print(f"테스트 정확도: {grid.best_estimator_.score(X_test, y_test):.4f}")

8-6. 최종 평가 및 해석

from sklearn.metrics import classification_report, ConfusionMatrixDisplay

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

# 혼동 행렬
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=['사망', '생존'],
    cmap='Blues'
)
plt.title('XGBoost - 타이타닉 생존 예측')
plt.show()

# 분류 보고서
print(classification_report(y_test, y_pred, target_names=['사망', '생존']))

# 특성 중요도 비교
feature_importance = pd.Series(
    best_model.feature_importances_,
    index=X.columns
).sort_values(ascending=True)

plt.figure(figsize=(8, 5))
feature_importance.plot(kind='barh')
plt.title('XGBoost 특성 중요도')
plt.xlabel('중요도')
plt.tight_layout()
plt.show()

# SHAP 분석
import shap

explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test, feature_names=X.columns, show=False)
plt.title('SHAP Summary Plot')
plt.tight_layout()
plt.show()

해석: 성별(sex_encoded)이 가장 중요한 특성으로, 여성일수록 생존 확률이 높습니다. 객실 등급(pclass)과 요금(fare)도 중요합니다 — "여성과 아이 먼저" 규칙과 상류층의 우선 구명이 데이터에 그대로 나타납니다.

알고리즘 선택 가이드

상황	추천 알고리즘
정형 데이터 일반	XGBoost 또는 LightGBM
대규모 데이터 (100만+)	LightGBM (속도 우위)
범주형 특성이 많은 데이터	CatBoost
해석이 중요한 경우	랜덤 포레스트 + SHAP
빠른 프로토타이핑	랜덤 포레스트 (하이퍼파라미터 민감도 낮음)
Kaggle 대회	XGBoost + LightGBM + CatBoost 앙상블

다음 글에서는 정답 레이블 없이 데이터의 구조를 발견하는 비지도학습 — 군집화와 차원 축소를 다룹니다.

앙상블 학습 완벽 가이드: 랜덤 포레스트부터 XGBoost까지 타이타닉 생존 예측

앙상블(Ensemble)이란?

1. 배깅 (Bagging: Bootstrap Aggregating)

핵심 원리

왜 효과적인가?

2. 랜덤 포레스트 (Random Forest)

핵심 아이디어

주요 하이퍼파라미터

Python 실습

3. 부스팅 (Boosting)

핵심 원리

부스팅 변형 비교

4. XGBoost (eXtreme Gradient Boosting)

핵심 아이디어

주요 하이퍼파라미터

Python 실습

5. LightGBM

XGBoost와의 차이점

6. 하이퍼파라미터 튜닝

GridSearchCV

Optuna (더 효율적인 탐색)

7. SHAP: 블랙박스 모델 해석하기

8. 토이 프로블럼: 타이타닉 생존자 예측

8-1. 데이터 로드 및 탐색

8-2. 탐색적 데이터 분석 (EDA)

8-3. 특성 엔지니어링 및 전처리

8-4. 모델 학습 및 비교

8-5. 최적 모델 튜닝

8-6. 최종 평가 및 해석

알고리즘 선택 가이드

관련 글