딥러닝 입문: 신경망, CNN, RNN부터 Transformer까지 핵심 이해와 MNIST 실습

딥러닝이란?

1편의 분류표에서 신경망, CNN, RNN, Transformer는 모두 비선형 블랙박스 모델이었습니다. 이들의 공통점은 깊은(deep) 신경망 구조를 사용한다는 것입니다.

딥러닝은 머신러닝의 한 분야로, 여러 층의 뉴런을 쌓아 데이터의 복잡한 패턴을 자동으로 학습합니다.

왜 "딥(Deep)"인가?

구분	전통적 머신러닝	딥러닝
특성 추출	사람이 직접 설계	자동으로 학습
모델 구조	얕은 구조 (1~2층)	깊은 구조 (수십~수백 층)
데이터 요구량	적음	많음
표현력	제한적	매우 높음
해석 가능성	높음	낮음 (블랙박스)

1. 신경망의 진화 과정

퍼셉트론 (1957) → MLP (1986) → CNN (1998) → RNN/LSTM (1997) → Transformer (2017)

퍼셉트론 (Perceptron)

가장 단순한 신경망. 입력의 가중합을 계산하고 임계값을 넘으면 1, 아니면 0을 출력합니다.

$$y = \begin{cases} 1 & \text{if } \sum w_ix_i + b > 0 \ 0 & \text{otherwise} \end{cases}$$

한계: XOR 같은 비선형 문제를 풀 수 없음 → 다층 퍼셉트론(MLP)으로 해결

다층 퍼셉트론 (MLP: Multi-Layer Perceptron)

여러 층을 쌓고 비선형 활성화 함수를 추가하면 어떤 함수든 근사할 수 있습니다 (만능 근사 정리).

입력층 → [은닉층 1] → [은닉층 2] → ... → 출력층
        (활성화)     (활성화)          (활성화)

2. 활성화 함수 (Activation Functions)

층과 층 사이에 비선형성을 부여하는 함수입니다. 활성화 함수가 없으면 아무리 층을 쌓아도 결국 선형 변환입니다.

함수	수식	범위	장점	단점
Sigmoid	$$\frac{1}{1+e^{-x}}$$	(0, 1)	확률 해석	기울기 소실
Tanh	$$\frac{e^x-e^{-x}}{e^x+e^{-x}}$$	(-1, 1)	0 중심	기울기 소실
ReLU	$$\max(0, x)$$	[0, ∞)	빠르고 단순	죽은 뉴런
Leaky ReLU	$$\max(0.01x, x)$$	(-∞, ∞)	죽은 뉴런 해결	—
Softmax	$$\frac{e^{x_i}}{\sum e^{x_j}}$$	(0, 1)	다중 분류 출력	출력층 전용

Python 시각화

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Sigmoid
axes[0].plot(x, 1 / (1 + np.exp(-x)), linewidth=2)
axes[0].set_title('Sigmoid')
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].grid(True, alpha=0.3)

# Tanh
axes[1].plot(x, np.tanh(x), linewidth=2, color='orange')
axes[1].set_title('Tanh')
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1].grid(True, alpha=0.3)

# ReLU
axes[2].plot(x, np.maximum(0, x), linewidth=2, color='green')
axes[2].set_title('ReLU')
axes[2].grid(True, alpha=0.3)

# Leaky ReLU
axes[3].plot(x, np.where(x > 0, x, 0.01 * x), linewidth=2, color='red')
axes[3].set_title('Leaky ReLU')
axes[3].grid(True, alpha=0.3)

plt.suptitle('활성화 함수 비교')
plt.tight_layout()
plt.show()

3. 역전파 (Backpropagation)

신경망 학습의 핵심 알고리즘. **연쇄 법칙(Chain Rule)**을 사용하여 손실 함수의 기울기를 출력층에서 입력층 방향으로 전파합니다.

과정

순전파 (Forward): 입력 → 각 층 통과 → 예측값 계산
손실 계산: 예측값과 정답의 차이 (Loss)
역전파 (Backward): 출력층 → 입력층 방향으로 기울기 계산
가중치 업데이트: 기울기 방향으로 가중치 조정

# PyTorch에서 역전파는 자동
import torch
import torch.nn as nn

# 간단한 예시
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()  # y = x1² + x2² + x3²

y.backward()  # 자동 미분

print(f"x = {x.data}")
print(f"dy/dx = {x.grad}")  # [2, 4, 6] (각 xi에 대해 2xi)

4. 옵티마이저 (Optimizers)

기울기를 사용하여 가중치를 어떻게 업데이트할지 결정합니다.

옵티마이저	핵심 아이디어	장점
SGD	기본 경사 하강법	단순, 이론적 보장
Momentum	관성 추가 (이전 방향 반영)	진동 감소, 빠른 수렴
RMSprop	최근 기울기 크기로 학습률 조정	적응적 학습률
Adam	Momentum + RMSprop 결합	가장 많이 사용, 안정적

import torch.optim as optim

model = nn.Linear(10, 1)

# SGD
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (가장 범용적)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)

# 학습 루프에서 사용
# optimizer.zero_grad()  # 기울기 초기화
# loss.backward()        # 역전파
# optimizer.step()       # 가중치 업데이트

5. CNN (합성곱 신경망)

왜 CNN인가?

이미지를 MLP로 처리하면 문제가 있습니다:

28×28 이미지 = 784개 입력 → 파라미터가 폭발적으로 증가
**공간 정보(위치 관계)**를 잃어버림

CNN은 합성곱 필터로 이미지의 지역적 패턴을 추출합니다.

핵심 구성 요소

레이어	역할	특징
합성곱 (Conv)	필터로 특징 추출	가장자리, 질감, 형태 감지
풀링 (Pool)	공간 크기 축소	위치 불변성, 파라미터 감소
완전연결 (FC)	최종 분류	추출된 특징으로 예측

합성곱 연산 직관

입력 이미지 (5×5)          필터 (3×3)        출력 (3×3)
┌─────────────┐         ┌───────┐         ┌─────────┐
│ 1 0 1 0 1 │         │ 1 0 1 │         │ 4 3 4 │
│ 0 1 0 1 0 │    *    │ 0 1 0 │    =    │ 2 4 3 │
│ 1 0 1 0 1 │         │ 1 0 1 │         │ 4 3 4 │
│ 0 1 0 1 0 │         └───────┘         └─────────┘
│ 1 0 1 0 1 │
└─────────────┘

필터가 이미지 위를 슬라이딩하면서 내적(dot product)을 계산합니다. 다양한 필터가 다양한 패턴(가장자리, 코너, 질감)을 감지합니다.

CNN 계층별 학습 내용

입력 이미지
  ↓ [Conv 1] 가장자리, 색상 변화 감지
  ↓ [Conv 2] 질감, 패턴 감지
  ↓ [Conv 3] 부분 형태 (눈, 코, 바퀴 등) 감지
  ↓ [Conv 4+] 전체 객체 (얼굴, 자동차 등) 인식
  ↓ [FC] 최종 분류

6. RNN (순환 신경망)과 LSTM

왜 RNN인가?

시계열, 텍스트 등 순서가 중요한 데이터는 일반 신경망으로 처리하기 어렵습니다. RNN은 이전 시점의 정보를 기억하면서 순차적으로 처리합니다.

$$h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)$$

$$h_t$$: 시점 t의 은닉 상태 (기억)
$$x_t$$: 시점 t의 입력
이전 은닉 상태가 다음 계산에 영향

RNN의 한계: 기울기 소실/폭발

긴 시퀀스에서 역전파 시 기울기가 0에 수렴(소실)하거나 무한대로 발산(폭발)합니다. "100번째 단어가 1번째 단어의 영향을 받기 어려움"

LSTM (Long Short-Term Memory)

게이트 메커니즘으로 기울기 소실 문제를 해결합니다.

게이트	역할
망각 게이트	이전 기억 중 버릴 것 결정
입력 게이트	새 정보 중 저장할 것 결정
출력 게이트	현재 출력에 사용할 기억 결정

import torch.nn as nn

# LSTM 사용 예시
lstm = nn.LSTM(input_size=10, hidden_size=64, num_layers=2, batch_first=True)

# 입력: (배치, 시퀀스 길이, 특성 수)
x = torch.randn(32, 20, 10)  # 배치 32, 시퀀스 20, 특성 10
output, (h_n, c_n) = lstm(x)

print(f"출력: {output.shape}")      # [32, 20, 64]
print(f"마지막 은닉: {h_n.shape}")   # [2, 32, 64]

7. Transformer

왜 Transformer인가?

RNN/LSTM은 순차적으로 처리하므로 병렬화가 어렵고 느립니다. Transformer는 Self-Attention 메커니즘으로 시퀀스 전체를 한 번에 처리합니다.

Self-Attention 핵심

"각 단어가 다른 모든 단어와의 관련성을 동시에 계산"

$$Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Q (Query): "나는 어떤 정보를 찾고 있는가"
K (Key): "나는 어떤 정보를 제공하는가"
V (Value): "내가 실제로 제공하는 정보"

Transformer의 영향

모델	분야	기반
BERT	자연어 이해	Encoder
GPT	텍스트 생성	Decoder
ViT	이미지 분류	Encoder
DALL-E	이미지 생성	Decoder
Whisper	음성 인식	Encoder-Decoder

8. 딥러닝 학습의 핵심 기법

과적합 방지

기법	원리
드롭아웃 (Dropout)	학습 시 랜덤으로 뉴런을 비활성화
배치 정규화 (Batch Norm)	각 층의 입력을 정규화
데이터 증강	이미지 회전, 반전 등으로 데이터 다양화
조기 종료	검증 손실이 증가하면 학습 중단
가중치 감소	L2 정규화 (가중치에 패널티)

학습률 스케줄링

# 학습률을 점진적으로 감소
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# 또는
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

9. 토이 프로블럼: MNIST 손글씨 분류

0~9 숫자 손글씨 이미지를 분류하는 딥러닝의 "Hello World" 문제입니다.

9-1. 데이터 로드

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np

# 데이터 변환 정의
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST 평균, 표준편차
])

# 데이터셋 다운로드
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)

# 데이터 로더
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

print(f"학습 데이터: {len(train_dataset)}개")
print(f"테스트 데이터: {len(test_dataset)}개")
print(f"이미지 크기: {train_dataset[0][0].shape}")  # [1, 28, 28]

# 샘플 이미지 시각화
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.ravel()):
    img, label = train_dataset[i]
    ax.imshow(img.squeeze(), cmap='gray')
    ax.set_title(f'레이블: {label}')
    ax.axis('off')
plt.suptitle('MNIST 샘플 이미지')
plt.tight_layout()
plt.show()

9-2. 모델 정의 (3가지)

# 모델 1: 단순 MLP
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# 모델 2: CNN
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.25)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))   # [B, 32, 14, 14]
        x = self.pool(F.relu(self.conv2(x)))   # [B, 64, 7, 7]
        x = x.view(-1, 64 * 7 * 7)
        x = self.dropout(x)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# 모델 3: 향상된 CNN (배치 정규화 추가)
class ImprovedCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(128 * 3 * 3, 256)
        self.fc2 = nn.Linear(256, 10)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        x = self.pool(F.relu(self.bn1(self.conv1(x))))   # [B, 32, 14, 14]
        x = self.pool(F.relu(self.bn2(self.conv2(x))))   # [B, 64, 7, 7]
        x = self.pool(F.relu(self.bn3(self.conv3(x))))   # [B, 128, 3, 3]
        x = x.view(-1, 128 * 3 * 3)
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.fc2(x)
        return x

9-3. 학습 및 평가 함수

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"사용 디바이스: {device}")

def train_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for data, target in loader:
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * data.size(0)
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
        total += data.size(0)

    return total_loss / total, correct / total

def evaluate(model, loader, criterion):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss = criterion(output, target)

            total_loss += loss.item() * data.size(0)
            pred = output.argmax(dim=1)
            correct += pred.eq(target).sum().item()
            total += data.size(0)

    return total_loss / total, correct / total

9-4. 3가지 모델 학습 및 비교

def train_model(model_class, model_name, epochs=10, lr=0.001):
    model = model_class().to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    train_losses, test_losses = [], []
    train_accs, test_accs = [], []

    for epoch in range(1, epochs + 1):
        train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion)
        test_loss, test_acc = evaluate(model, test_loader, criterion)

        train_losses.append(train_loss)
        test_losses.append(test_loss)
        train_accs.append(train_acc)
        test_accs.append(test_acc)

        if epoch % 2 == 0 or epoch == 1:
            print(f"[{model_name}] Epoch {epoch:2d} | "
                  f"Train Loss: {train_loss:.4f}, Acc: {train_acc:.4f} | "
                  f"Test Loss: {test_loss:.4f}, Acc: {test_acc:.4f}")

    return model, {
        'train_loss': train_losses, 'test_loss': test_losses,
        'train_acc': train_accs, 'test_acc': test_accs
    }

# 3가지 모델 학습
print("=" * 70)
mlp_model, mlp_hist = train_model(MLP, "MLP", epochs=10)
print()
cnn_model, cnn_hist = train_model(CNN, "CNN", epochs=10)
print()
icnn_model, icnn_hist = train_model(ImprovedCNN, "Improved CNN", epochs=10)

9-5. 학습 곡선 비교

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

histories = [
    ('MLP', mlp_hist),
    ('CNN', cnn_hist),
    ('Improved CNN', icnn_hist),
]

# 손실 곡선
for name, hist in histories:
    axes[0].plot(hist['test_loss'], label=f'{name}', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Test Loss')
axes[0].set_title('테스트 손실 비교')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 정확도 곡선
for name, hist in histories:
    axes[1].plot(hist['test_acc'], label=f'{name}', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('테스트 정확도 비교')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

9-6. 상세 평가

from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# 최고 성능 모델 (Improved CNN) 평가
best_model = icnn_model
best_model.eval()

all_preds = []
all_targets = []

with torch.no_grad():
    for data, target in test_loader:
        data = data.to(device)
        output = best_model(data)
        preds = output.argmax(dim=1).cpu().numpy()
        all_preds.extend(preds)
        all_targets.extend(target.numpy())

all_preds = np.array(all_preds)
all_targets = np.array(all_targets)

# 분류 보고서
print(classification_report(all_targets, all_preds,
                            target_names=[str(i) for i in range(10)]))

# 혼동 행렬
cm = confusion_matrix(all_targets, all_preds)
plt.figure(figsize=(10, 8))
disp = ConfusionMatrixDisplay(cm, display_labels=range(10))
disp.plot(cmap='Blues')
plt.title('Improved CNN - MNIST 혼동 행렬')
plt.show()

9-7. 오분류 사례 분석

# 틀린 예측 찾기
wrong_idx = np.where(all_preds != all_targets)[0]
print(f"오분류 수: {len(wrong_idx)} / {len(all_targets)} ({len(wrong_idx)/len(all_targets):.2%})")

# 오분류 샘플 시각화
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.ravel()):
    if i >= len(wrong_idx):
        break
    idx = wrong_idx[i]
    img = test_dataset[idx][0].squeeze()
    ax.imshow(img, cmap='gray')
    ax.set_title(f'정답: {all_targets[idx]}, 예측: {all_preds[idx]}', color='red')
    ax.axis('off')

plt.suptitle('오분류된 샘플들')
plt.tight_layout()
plt.show()

9-8. 모델 저장

# 최종 모델 저장
torch.save(icnn_model.state_dict(), 'mnist_improved_cnn.pth')
print("모델 저장 완료: mnist_improved_cnn.pth")

# 로드 방법
# loaded_model = ImprovedCNN()
# loaded_model.load_state_dict(torch.load('mnist_improved_cnn.pth'))
# loaded_model.eval()

10. 전이 학습 (Transfer Learning) 소개

대규모 데이터로 사전 학습된 모델을 가져와 **내 데이터에 맞게 미세 조정(fine-tuning)**하는 기법입니다. 적은 데이터로도 높은 성능을 달성할 수 있습니다.

# PyTorch 전이 학습 예시 (개념)
from torchvision import models

# ImageNet으로 사전 학습된 ResNet18 로드
resnet = models.resnet18(pretrained=True)

# 마지막 분류 층만 교체
resnet.fc = nn.Linear(resnet.fc.in_features, 10)  # 10 클래스

# 사전 학습된 가중치 고정 (선택)
for param in resnet.parameters():
    param.requires_grad = False
resnet.fc.requires_grad_(True)  # 마지막 층만 학습

실전에서는 전이 학습이 기본: 처음부터 학습하는 것보다 사전 학습 모델을 활용하는 것이 거의 항상 더 좋은 성능을 보입니다.

시리즈 정리

이 글로 머신러닝 시리즈를 마무리합니다. 전체 흐름을 돌아보겠습니다.

편	주제	핵심
1편	머신러닝 완벽 분류	학습 방식, 모델 구조, 해석 가능성 분류
2편	TensorFlow.js	브라우저에서 ML 구현
3편	PyTorch 기초	Python ML 프레임워크
4편	회귀 알고리즘	선형/다항/정규화 회귀 + 집값 예측
5편	분류 알고리즘	로지스틱/SVM/트리/k-NN + 붓꽃·유방암
6편	앙상블 학습	랜덤 포레스트/XGBoost + 타이타닉
7편	비지도학습	군집화/차원 축소 + 고객 세분화·와인
8편	딥러닝 입문	신경망/CNN/RNN/Transformer + MNIST

1편의 종합 분류표에 있던 알고리즘들을 모두 이론과 실습으로 다루었습니다. 여기서 더 깊이 들어가고 싶다면:

컴퓨터 비전: CNN 아키텍처 (ResNet, EfficientNet), 객체 탐지 (YOLO)
자연어 처리: Transformer 심화, BERT, GPT 활용
생성 모델: GAN, Diffusion Model, VAE
강화학습: Q-Learning, PPO 실습
MLOps: 모델 배포, 모니터링, 파이프라인 자동화