지도학습 회귀 알고리즘 완벽 가이드: 이론부터 집값 예측까지

회귀(Regression)란?

1편 종합 분류표에서 지도학습은 회귀와 분류로 나뉜다고 했습니다. 이번 글에서는 그 중 회귀를 깊이 파고듭니다.

회귀는 연속적인 숫자 값을 예측하는 문제입니다.

문제 유형	예시	출력
회귀	집값 예측, 주가 예측, 온도 예측	연속값 (350만원, 25.3°C)
분류	스팸 판별, 질병 진단	범주 (스팸/정상, 양성/음성)

1. 선형 회귀 (Linear Regression)

핵심 아이디어

데이터에 가장 잘 맞는 **직선(또는 초평면)**을 찾는 것입니다.

$$\hat{y} = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b$$

$$\hat{y}$$: 예측값
$$w_i$$: 가중치 (각 특성의 중요도)
$$b$$: 편향 (절편)
$$x_i$$: 입력 특성

손실 함수: 평균 제곱 오차 (MSE)

"예측이 얼마나 틀렸는가"를 측정하는 함수입니다.

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

목표는 이 MSE를 최소화하는 $$w$$와 $$b$$를 찾는 것입니다.

최소화 방법 1: 정규 방정식 (Normal Equation)

수학적으로 한 번에 최적해를 구할 수 있습니다.

$$\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

장점: 반복 없이 한 번에 계산
단점: 특성 수가 많으면 역행렬 계산이 느림 ($$O(n^3)$$)

최소화 방법 2: 경사 하강법 (Gradient Descent)

기울기(gradient)를 따라 조금씩 내려가면서 최솟값을 찾습니다.

$$w := w - \alpha \frac{\partial MSE}{\partial w}$$

$$\alpha$$: 학습률 (한 번에 얼마나 움직일지)
학습률이 너무 크면 발산, 너무 작으면 수렴이 느림

# 경사 하강법 직접 구현
import numpy as np

def gradient_descent(X, y, lr=0.01, epochs=1000):
    m, n = X.shape
    w = np.zeros(n)
    b = 0

    for _ in range(epochs):
        y_pred = X @ w + b
        error = y_pred - y

        # 기울기 계산
        dw = (2/m) * (X.T @ error)
        db = (2/m) * np.sum(error)

        # 가중치 업데이트
        w -= lr * dw
        b -= lr * db

    return w, b

Python 실습: 단순 선형 회귀

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 생성 (y = 3x + 5 + 노이즈)
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 3 * X.squeeze() + 5 + np.random.randn(100) * 2

# 모델 학습
model = LinearRegression()
model.fit(X, y)

# 결과 확인
print(f"가중치 (기울기): {model.coef_[0]:.4f}")  # ≈ 3
print(f"편향 (절편): {model.intercept_:.4f}")      # ≈ 5

# 예측 및 평가
y_pred = model.predict(X)
print(f"MSE: {mean_squared_error(y, y_pred):.4f}")
print(f"R² Score: {r2_score(y, y_pred):.4f}")

# 시각화
plt.scatter(X, y, alpha=0.5, label='실제 데이터')
plt.plot(X, y_pred, color='red', linewidth=2, label='회귀선')
plt.xlabel('X')
plt.ylabel('y')
plt.title('단순 선형 회귀')
plt.legend()
plt.show()

2. 다항 회귀 (Polynomial Regression)

핵심 아이디어

데이터가 직선으로 설명되지 않을 때, 입력 특성을 거듭제곱하여 새로운 특성을 만들고 선형 회귀를 적용합니다.

$$\hat{y} = w_1x + w_2x^2 + w_3x^3 + \cdots + b$$

본질적으로는 확장된 특성 공간에서의 선형 회귀입니다.

Python 실습: 곡선 데이터 피팅

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# 비선형 데이터 생성 (y = 0.5x² - 2x + 3 + 노이즈)
np.random.seed(42)
X = np.random.rand(100, 1) * 6 - 3  # -3 ~ 3
y = 0.5 * X.squeeze()**2 - 2 * X.squeeze() + 3 + np.random.randn(100) * 0.5

# 차수별 다항 회귀 비교
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
degrees = [1, 2, 10]

for ax, degree in zip(axes, degrees):
    model = make_pipeline(
        PolynomialFeatures(degree=degree),
        LinearRegression()
    )
    model.fit(X, y)

    X_plot = np.linspace(-3, 3, 100).reshape(-1, 1)
    y_plot = model.predict(X_plot)

    ax.scatter(X, y, alpha=0.5, s=20)
    ax.plot(X_plot, y_plot, color='red', linewidth=2)
    ax.set_title(f'차수 = {degree} (R² = {r2_score(y, model.predict(X)):.3f})')
    ax.set_ylim(-2, 15)

plt.tight_layout()
plt.show()

주의: 차수가 너무 높으면 **과적합(overfitting)**이 발생합니다. 10차 다항식은 학습 데이터에 완벽히 맞지만, 새로운 데이터에는 엉망인 예측을 합니다.

3. 정규화 회귀: 과적합을 막는 방법

과적합을 방지하기 위해 **가중치가 너무 커지지 않도록 제한(패널티)**을 주는 방법입니다.

릿지 회귀 (Ridge Regression) — L2 정규화

$$MSE_{Ridge} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}i)^2 + \alpha\sum{j=1}^{p}w_j^2$$

모든 가중치를 골고루 줄임 (0에 가깝게, 하지만 정확히 0은 아님)
모든 특성을 살려두면서 과적합 방지

라쏘 회귀 (Lasso Regression) — L1 정규화

$$MSE_{Lasso} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}i)^2 + \alpha\sum{j=1}^{p}|w_j|$$

일부 가중치를 정확히 0으로 만듦 → 자동 특성 선택(Feature Selection)
불필요한 특성을 제거하여 모델을 단순화

엘라스틱넷 (ElasticNet) — L1 + L2

$$MSE_{Elastic} = MSE + \alpha_1\sum|w_j| + \alpha_2\sum w_j^2$$

L1과 L2의 장점을 결합한 방식입니다.

비교 표

방법	패널티	특성 선택	적합한 상황
릿지 (L2)	$$\sum w_j^2$$	X (모든 특성 유지)	특성이 모두 유의미할 때
라쏘 (L1)	$$\sum \|w_j\|$$	O (자동 제거)	불필요한 특성이 많을 때
엘라스틱넷	L1 + L2	O (부분 제거)	상관된 특성이 많을 때

Python 실습: 정규화 비교

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# 고차원 데이터 (특성 20개 중 실제 유효한 건 5개)
np.random.seed(42)
n_samples, n_features = 100, 20
X = np.random.randn(n_samples, n_features)
true_w = np.zeros(n_features)
true_w[:5] = [3, -2, 1.5, -1, 0.5]  # 5개만 유효
y = X @ true_w + np.random.randn(n_samples) * 0.5

# 세 모델 비교
models = {
    '선형 회귀': LinearRegression(),
    '릿지 (α=1)': Ridge(alpha=1.0),
    '라쏘 (α=0.1)': Lasso(alpha=0.1),
}

print(f"{'모델':<15} {'MSE':>8} {'0이 아닌 가중치 수':>18}")
print("-" * 45)

for name, model in models.items():
    model.fit(X, y)
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)
    nonzero = np.sum(np.abs(model.coef_) > 0.01)
    print(f"{name:<15} {mse:>8.4f} {nonzero:>18}")

# 가중치 시각화
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, (name, model) in zip(axes, models.items()):
    ax.bar(range(n_features), model.coef_)
    ax.set_title(name)
    ax.set_xlabel('특성 번호')
    ax.set_ylabel('가중치')
    ax.axhline(y=0, color='gray', linestyle='--')

plt.tight_layout()
plt.show()

라쏘가 불필요한 특성의 가중치를 0으로 만들어 자동으로 특성 선택을 하는 것을 확인할 수 있습니다.

4. 모델 평가 지표

회귀 모델이 얼마나 좋은지 판단하는 핵심 지표들입니다.

지표	수식	의미
MSE	$$\frac{1}{n}\sum(y - \hat{y})^2$$	평균 제곱 오차 (단위가 제곱)
RMSE	$$\sqrt{MSE}$$	원래 단위로 변환된 오차
MAE	$$\frac{1}{n}\sum\|y - \hat{y}\|$$	평균 절대 오차 (이상치에 강건)
R² Score	$$1 - \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}$$	설명력 (1에 가까울수록 좋음)

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_model(y_true, y_pred, model_name="모델"):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    print(f"=== {model_name} 평가 ===")
    print(f"MSE:  {mse:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"MAE:  {mae:.4f}")
    print(f"R²:   {r2:.4f}")
    return {'mse': mse, 'rmse': rmse, 'mae': mae, 'r2': r2}

5. 과적합 vs 과소적합

	과소적합 (Underfitting)	적절한 적합	과적합 (Overfitting)
학습 성능	낮음	높음	매우 높음
테스트 성능	낮음	높음	낮음
원인	모델이 너무 단순	—	모델이 너무 복잡
해결	복잡도 높이기	—	정규화, 데이터 추가

교차 검증 (Cross-Validation)

데이터를 K개로 나누어 번갈아 검증하는 방법입니다. 한 번의 학습/테스트 분할보다 더 신뢰할 수 있는 성능 추정을 제공합니다.

from sklearn.model_selection import cross_val_score

model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print(f"5-Fold R² scores: {scores}")
print(f"평균 R²: {scores.mean():.4f} ± {scores.std():.4f}")

6. 토이 프로블럼: 캘리포니아 집값 예측

이제 배운 모든 것을 종합하여 실전 문제를 풀어봅시다.

6-1. 데이터 로드 및 탐색

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# 데이터 로드
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedHouseVal')

print("=== 데이터 개요 ===")
print(f"샘플 수: {X.shape[0]}, 특성 수: {X.shape[1]}")
print(f"타겟 범위: {y.min():.2f} ~ {y.max():.2f} (단위: $100,000)")
print()
print("=== 특성 설명 ===")
feature_desc = {
    'MedInc': '블록 그룹의 중위 소득',
    'HouseAge': '블록 그룹의 중위 주택 연령',
    'AveRooms': '가구당 평균 방 수',
    'AveBedrms': '가구당 평균 침실 수',
    'Population': '블록 그룹 인구',
    'AveOccup': '가구당 평균 거주자 수',
    'Latitude': '위도',
    'Longitude': '경도',
}
for feat, desc in feature_desc.items():
    print(f"  {feat:12s}: {desc}")

print()
print(X.describe().round(2))

6-2. 데이터 시각화

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for i, col in enumerate(X.columns):
    axes[i].scatter(X[col], y, alpha=0.1, s=1)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('집값')
    axes[i].set_title(f'{col} vs 집값')

plt.tight_layout()
plt.show()

# 상관관계 히트맵
import seaborn as sns

df = X.copy()
df['MedHouseVal'] = y

corr = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0)
plt.title('특성 간 상관관계')
plt.show()

print("\n집값과의 상관관계 (절대값 기준 정렬):")
print(corr['MedHouseVal'].drop('MedHouseVal').abs().sort_values(ascending=False))

6-3. 데이터 전처리

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 학습/테스트 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"학습 세트: {X_train.shape[0]}개")
print(f"테스트 세트: {X_test.shape[0]}개")

# 특성 스케일링 (정규화 회귀에 필수)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

왜 스케일링이 필요한가? 릿지와 라쏘는 가중치 크기에 패널티를 주므로, 특성의 스케일이 다르면 공정한 비교가 불가능합니다. 소득(0~~15)과 인구(0~~35,000)의 단위 차이가 가중치에 직접 영향을 줍니다.

6-4. 모델 학습 및 비교

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

# 모델 정의
models = {
    '선형 회귀': LinearRegression(),
    '릿지 (α=1)': Ridge(alpha=1.0),
    '릿지 (α=10)': Ridge(alpha=10.0),
    '라쏘 (α=0.01)': Lasso(alpha=0.01),
    '라쏘 (α=0.1)': Lasso(alpha=0.1),
    '엘라스틱넷': ElasticNet(alpha=0.01, l1_ratio=0.5),
}

# 학습 및 평가
results = []
for name, model in models.items():
    model.fit(X_train_scaled, y_train)

    train_pred = model.predict(X_train_scaled)
    test_pred = model.predict(X_test_scaled)

    results.append({
        '모델': name,
        'Train RMSE': np.sqrt(mean_squared_error(y_train, train_pred)),
        'Test RMSE': np.sqrt(mean_squared_error(y_test, test_pred)),
        'Train R²': r2_score(y_train, train_pred),
        'Test R²': r2_score(y_test, test_pred),
        '0이 아닌 특성': np.sum(np.abs(model.coef_) > 0.001),
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

6-5. 하이퍼파라미터 튜닝

from sklearn.model_selection import GridSearchCV

# 릿지 회귀 최적 α 탐색
ridge = Ridge()
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}

grid_search = GridSearchCV(
    ridge, param_grid, cv=5,
    scoring='neg_mean_squared_error',
    return_train_score=True
)
grid_search.fit(X_train_scaled, y_train)

print(f"최적 α: {grid_search.best_params_['alpha']}")
print(f"최적 CV RMSE: {np.sqrt(-grid_search.best_score_):.4f}")

# α에 따른 성능 변화 시각화
cv_results = pd.DataFrame(grid_search.cv_results_)
alphas = param_grid['alpha']
train_scores = np.sqrt(-cv_results['mean_train_score'])
test_scores = np.sqrt(-cv_results['mean_test_score'])

plt.figure(figsize=(8, 5))
plt.plot(alphas, train_scores, 'o-', label='Train RMSE')
plt.plot(alphas, test_scores, 'o-', label='CV RMSE')
plt.xscale('log')
plt.xlabel('α (정규화 강도)')
plt.ylabel('RMSE')
plt.title('릿지 회귀: α에 따른 성능 변화')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

6-6. 최종 모델 평가

# 최적 모델로 최종 평가
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)

print("=== 최종 모델 성능 (테스트 세트) ===")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE:  {mean_absolute_error(y_test, y_pred):.4f}")
print(f"R²:   {r2_score(y_test, y_pred):.4f}")

# 특성 중요도 (가중치 절대값)
importance = pd.Series(
    np.abs(best_model.coef_),
    index=housing.feature_names
).sort_values(ascending=True)

plt.figure(figsize=(8, 5))
importance.plot(kind='barh')
plt.title('특성 중요도 (릿지 회귀 가중치 절대값)')
plt.xlabel('|가중치|')
plt.tight_layout()
plt.show()

6-7. 예측 결과 시각화

# 실제값 vs 예측값 산점도
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 산점도
axes[0].scatter(y_test, y_pred, alpha=0.3, s=10)
axes[0].plot([0, 5], [0, 5], 'r--', linewidth=2)
axes[0].set_xlabel('실제 집값 ($100,000)')
axes[0].set_ylabel('예측 집값 ($100,000)')
axes[0].set_title('실제값 vs 예측값')

# 잔차 분포
residuals = y_test - y_pred
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--')
axes[1].set_xlabel('잔차 (실제 - 예측)')
axes[1].set_ylabel('빈도')
axes[1].set_title(f'잔차 분포 (평균: {residuals.mean():.4f})')

plt.tight_layout()
plt.show()

정리

알고리즘	언제 사용?	핵심 포인트
선형 회귀	특성과 타겟이 선형 관계일 때	가장 단순하고 해석 쉬움
다항 회귀	비선형 관계가 보일 때	차수 선택이 핵심 (과적합 주의)
릿지	모든 특성이 유의미할 때	가중치를 골고루 줄임
라쏘	불필요한 특성이 많을 때	자동 특성 선택
엘라스틱넷	상관된 특성이 많을 때	릿지 + 라쏘의 장점 결합

다음 글에서는 지도학습의 나머지 절반인 분류 알고리즘을 다룹니다. 로지스틱 회귀, SVM, 의사결정 트리 등을 이론부터 붓꽃 분류 실습까지 진행합니다.

지도학습 회귀 알고리즘 완벽 가이드: 이론부터 집값 예측까지

회귀(Regression)란?

1. 선형 회귀 (Linear Regression)

핵심 아이디어

손실 함수: 평균 제곱 오차 (MSE)

최소화 방법 1: 정규 방정식 (Normal Equation)

최소화 방법 2: 경사 하강법 (Gradient Descent)

Python 실습: 단순 선형 회귀

2. 다항 회귀 (Polynomial Regression)

핵심 아이디어

Python 실습: 곡선 데이터 피팅

3. 정규화 회귀: 과적합을 막는 방법

릿지 회귀 (Ridge Regression) — L2 정규화

라쏘 회귀 (Lasso Regression) — L1 정규화

엘라스틱넷 (ElasticNet) — L1 + L2

비교 표

Python 실습: 정규화 비교

4. 모델 평가 지표

5. 과적합 vs 과소적합

교차 검증 (Cross-Validation)

6. 토이 프로블럼: 캘리포니아 집값 예측

6-1. 데이터 로드 및 탐색

6-2. 데이터 시각화

6-3. 데이터 전처리

6-4. 모델 학습 및 비교

6-5. 하이퍼파라미터 튜닝

6-6. 최종 모델 평가

6-7. 예측 결과 시각화

정리

관련 글