MarkTechPost • 111일 전

시그모이드 vs ReLU: 기하학적 맥락 상실에 따른 추론 비용

IMP

7/10

핵심 요약

딥러닝 모델에서 시그모이드(Sigmoid) 활성화 함수는 입력값을 0과 1 사이로 압축하여 기하학적 맥락을 상실하게 만들어 모델의 깊이가 주는 이점을 제한합니다. 반면, ReLU는 양수 입력값의 크기를 보존하여 심층 신경망이 과도한 너비나 연산량 없이도 풍부한 표현력을 유지할 수 있게 합니다. 이 글은 두 활성화 함수의 신호 전파 방식과 표현 기하학 차이를 실험을 통해 분석하며, 이것이 모델의 추론 효율성과 확장성에 미치는 영향을 설명합니다.

번역된 본문

인공지능(AI) 인프라, 애플리케이션, 기술, 에디터 추천, 언어 모델, 머신러닝, 튜토리얼 등

심층 신경망(Deep Neural Network)은 각 레이어가 입력 공간을 재형성하여 점차 더 복잡한 결정 경계(Decision Boundary)를 형성하는 기하학적 시스템으로 이해될 수 있습니다. 이 과정이 효과적으로 작동하려면 레이어가 의미 있는 공간 정보, 특히 데이터 포인트가 이러한 경계에서 얼마나 떨어져 있는지에 대한 거리 정보를 유지해야 합니다. 이 거리 정보가 있어야 더 깊은 레이어에서 풍부한 비선형 표현을 구축할 수 있기 때문입니다.

시그모이드(Sigmoid) 함수는 모든 입력값을 0과 1 사이의 좁은 범위로 압축하여 이 과정을 방해합니다. 값이 결정 경계에서 멀어질수록 서로 구별할 수 없게 되어 레이어 간에 기하학적 맥락(Geometric Context)이 손실됩니다. 이는 결과적으로 표현력을 약화시키고 네트워크 깊이의 효과를 제한합니다.

반면, ReLU 함수는 양수 입력값에 대해 크기(Magnitude)를 그대로 보존하므로 거리 정보가 네트워크 전체에 원활하게 흘러갈 수 있습니다. 덕분에 더 깊은 모델이 과도한 너비(Width)나 연산량(Compute) 없이도 뛰어난 표현력을 유지할 수 있습니다.

이 글에서는 이러한 순전파(Forward-pass) 동작에 초점을 맞추어, 'Two-moons(두 개의 달)' 실험을 통해 시그모이드와 ReLU가 신호 전파와 표현 기하학에서 어떻게 다른지 분석하고, 이것이 추론 효율성 및 확장성에 대해 어떤 의미를 갖는지 알아봅니다.

의존성 설정 코드 복사 완료 다른 브라우저 사용 import numpy as np import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec from matplotlib.colors import ListedColormap from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split

코드 복사 완료 다른 브라우저 사용 plt.rcParams.update({ "font.family": "monospace", "axes.spines.top": False, "axes.spines.right": False, "figure.facecolor": "white", "axes.facecolor": "#f7f7f7", "axes.grid": True, "grid.color": "#e0e0e0", "grid.linewidth": 0.6, })

T = { "bg": "white", "panel": "#f7f7f7", "sig": "#e05c5c", "relu": "#3a7bd5", "c0": "#f4a261", "c1": "#2a9d8f", "text": "#1a1a1a", "muted": "#666666", }

데이터셋 생성하기 통제된 환경에서 활성화 함수(Activation Function)의 효과를 연구하기 위해, 먼저 사이킷런(scikit-learn)의 make_moons를 사용하여 합성 데이터셋을 생성합니다. 이는 단순한 선형 경계로는 해결할 수 없는 비선형적인 두 개의 클래스 문제를 생성하므로, 신경망이 얼마나 잘 복잡한 결정 표면을 학습하는지 테스트하기에 이상적입니다.

작업을 더욱 현실적으로 만들기 위해 약간의 노이즈(Noise)를 추가하고, StandardScaler를 사용하여 두 차원의 특성(Feature)을 동일한 스케일로 표준화함으로써 안정적인 훈련을 보장합니다. 그런 다음 일반화(Generalization) 성능을 평가하기 위해 데이터셋을 훈련 세트와 테스트 세트로 분할합니다. 마지막으로 데이터 분포를 시각화합니다.

이 플롯(Plot)은 시그모이드 네트워크와 ReLU 네트워크가 모델링하려고 시도할 기준선(Baseline) 역할을 하는 기하학적 형태입니다. 이를 통해 나중에 각 활성화 함수가 레이어를 거치면서 이 공간을 어떻게 변환하는지 비교할 수 있습니다.

코드 복사 완료 다른 브라우저 사용 X, y = make_moons(n_samples=400, noise=0.18, random_state=42) X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42 )

fig, ax = plt.subplots(figsize=(7, 5)) fig.patch.set_facecolor(T["bg"]) ax.set_facecolor(T["panel"])

ax.scatter(X[y == 0, 0], X[y == 0, 1], c=T["c0"], s=40, edgecolors="white", linewidths=0.5, label="Class 0", alpha=0.9) ax.scatter(X[y == 1, 0], X[y == 1, 1], c=T["c1"], s=40, edgecolors="white", linewidths=0.5, label="Class 1", alpha=0.9)

ax.set_title("make_moons -- our dataset", color=T["text"], fontsize=13) ax.set_xlabel("x₁", color=T["muted"]); ax.set_ylabel("x₂", color=T["muted"]) ax.tick_params(colors=T["muted"]); ax.legend(fontsize=10)

plt.tight_layout() plt.savefig("moons_dataset.png", dpi=140, bbox_inches="tight") plt.show()

신경망 생성하기 다음으로, 활성화 함수의 효과를 분리하여 관찰하기 위해 작고 통제된 신경망을 구현합니다. 여기서의 목표는 고도로 최적화된 모델을 구축하는 것이 아니라, 시그모이드와 Re

원문 보기

원문 보기 (영어)

Artificial Intelligence AI Infrastructure Applications Technology Editors Pick Language Model Machine Learning Staff Tech News Tutorials A deep neural network can be understood as a geometric system, where each layer reshapes the input space to form increasingly complex decision boundaries. For this to work effectively, layers must preserve meaningful spatial information — particularly how far a data point lies from these boundaries — since this distance enables deeper layers to build rich, non-linear representations. Sigmoid disrupts this process by compressing all inputs into a narrow range between 0 and 1. As values move away from decision boundaries, they become indistinguishable, causing a loss of geometric context across layers. This leads to weaker representations and limits the effectiveness of depth. ReLU, on the other hand, preserves magnitude for positive inputs, allowing distance information to flow through the network. This enables deeper models to remain expressive without requiring excessive width or compute. In this article, we focus on this forward-pass behavior — analyzing how Sigmoid and ReLU differ in signal propagation and representation geometry using a two-moons experiment, and what that means for inference efficiency and scalability. Setting up the dependencies Copy Code Copied Use a different Browser import numpy as np import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec from matplotlib.colors import ListedColormap from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split Copy Code Copied Use a different Browser plt.rcParams.update({ "font.family": "monospace", "axes.spines.top": False, "axes.spines.right": False, "figure.facecolor": "white", "axes.facecolor": "#f7f7f7", "axes.grid": True, "grid.color": "#e0e0e0", "grid.linewidth": 0.6, }) T = { "bg": "white", "panel": "#f7f7f7", "sig": "#e05c5c", "relu": "#3a7bd5", "c0": "#f4a261", "c1": "#2a9d8f", "text": "#1a1a1a", "muted": "#666666", } Creating the dataset To study the effect of activation functions in a controlled setting, we first generate a synthetic dataset using scikit-learn’s make_moons. This creates a non-linear, two-class problem where simple linear boundaries fail, making it ideal for testing how well neural networks learn complex decision surfaces. We add a small amount of noise to make the task more realistic, then standardize the features using StandardScaler so both dimensions are on the same scale — ensuring stable training. The dataset is then split into training and test sets to evaluate generalization. Finally, we visualize the data distribution. This plot serves as the baseline geometry that both Sigmoid and ReLU networks will attempt to model, allowing us to later compare how each activation function transforms this space across layers. Copy Code Copied Use a different Browser X, y = make_moons(n_samples=400, noise=0.18, random_state=42) X = StandardScaler().fit_transform(X) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42 ) fig, ax = plt.subplots(figsize=(7, 5)) fig.patch.set_facecolor(T["bg"]) ax.set_facecolor(T["panel"]) ax.scatter(X[y == 0, 0], X[y == 0, 1], c=T["c0"], s=40, edgecolors="white", linewidths=0.5, label="Class 0", alpha=0.9) ax.scatter(X[y == 1, 0], X[y == 1, 1], c=T["c1"], s=40, edgecolors="white", linewidths=0.5, label="Class 1", alpha=0.9) ax.set_title("make_moons -- our dataset", color=T["text"], fontsize=13) ax.set_xlabel("x₁", color=T["muted"]); ax.set_ylabel("x₂", color=T["muted"]) ax.tick_params(colors=T["muted"]); ax.legend(fontsize=10) plt.tight_layout() plt.savefig("moons_dataset.png", dpi=140, bbox_inches="tight") plt.show() Creating the Network Next, we implement a small, controlled neural network to isolate the effect of activation functions. The goal here is not to build a highly optimized model, but to create a clean experimental setup where Sigmoid and ReLU can be compared under identical conditions. We define both activation functions (Sigmoid and ReLU) along with their derivatives, and use binary cross-entropy as the loss since this is a binary classification task. The TwoLayerNet class represents a simple 3-layer feedforward network (2 hidden layers + output), where the only configurable component is the activation function. A key detail is the initialization strategy: we use He initialization for ReLU and Xavier initialization for Sigmoid, ensuring that each network starts in a fair and stable regime based on its activation dynamics. The forward pass computes activations layer by layer, while the backward pass performs standard gradient descent updates. Importantly, we also include diagnostic methods like get_hidden and get_z_trace, which allow us to inspect how signals evolve across layers — this is crucial for analyzing how much geometric information is preserved or lost. By keeping architecture, data, and training setup constant, this implementation ensures that any difference in performance or internal representations can be directly attributed to the activation function itself — setting the stage for a clear comparison of their impact on signal propagation and expressiveness. Copy Code Copied Use a different Browser def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def sigmoid_d(a): return a * (1 - a) def relu(z): return np.maximum(0, z) def relu_d(z): return (z > 0).astype(float) def bce(y, yhat): return -np.mean(y * np.log(yhat + 1e-9) + (1 - y) * np.log(1 - yhat + 1e-9)) class TwoLayerNet: def __init__(self, activation="relu", seed=0): np.random.seed(seed) self.act_name = activation self.act = relu if activation == "relu" else sigmoid self.dact = relu_d if activation == "relu" else sigmoid_d # He init for ReLU, Xavier for Sigmoid scale = lambda fan_in: np.sqrt(2 / fan_in) if activation == "relu" else np.sqrt(1 / fan_in) self.W1 = np.random.randn(2, 8) * scale(2) self.b1 = np.zeros((1, 8)) self.W2 = np.random.randn(8, 8) * scale(8) self.b2 = np.zeros((1, 8)) self.W3 = np.random.randn(8, 1) * scale(8) self.b3 = np.zeros((1, 1)) self.loss_history = [] def forward(self, X, store=False): z1 = X @ self.W1 + self.b1; a1 = self.act(z1) z2 = a1 @ self.W2 + self.b2; a2 = self.act(z2) z3 = a2 @ self.W3 + self.b3; out = sigmoid(z3) if store: self._cache = (X, z1, a1, z2, a2, z3, out) return out def backward(self, lr=0.05): X, z1, a1, z2, a2, z3, out = self._cache n = X.shape[0] dout = (out - self.y_cache) / n dW3 = a2.T @ dout; db3 = dout.sum(axis=0, keepdims=True) da2 = dout @ self.W3.T dz2 = da2 * (self.dact(z2) if self.act_name == "relu" else self.dact(a2)) dW2 = a1.T @ dz2; db2 = dz2.sum(axis=0, keepdims=True) da1 = dz2 @ self.W2.T dz1 = da1 * (self.dact(z1) if self.act_name == "relu" else self.dact(a1)) dW1 = X.T @ dz1; db1 = dz1.sum(axis=0, keepdims=True) for p, g in [(self.W3,dW3),(self.b3,db3),(self.W2,dW2), (self.b2,db2),(self.W1,dW1),(self.b1,db1)]: p -= lr * g def train_step(self, X, y, lr=0.05): self.y_cache = y.reshape(-1, 1) out = self.forward(X, store=True) loss = bce(self.y_cache, out) self.backward(lr) return loss def get_hidden(self, X, layer=1): """Return post-activation values for layer 1 or 2.""" z1 = X @ self.W1 + self.b1; a1 = self.act(z1) if layer == 1: return a1 z2 = a1 @ self.W2 + self.b2; return self.act(z2) def get_z_trace(self, x_single): """Return pre-activation magnitudes per layer for ONE sample.""" z1 = x_single @ self.W1 + self.b1 a1 = self.act(z1) z2 = a1 @ self.W2 + self.b2 a2 = self.act(z2) z3 = a2 @ self.W3 + self.b3 return [np.abs(z1).mean(), np.abs(a1).mean(), np.abs(z2).mean(), np.abs(a2).mean(), np.abs(z3).mean()] Training the Networks Now we train both networks under identical conditions to ensure a fair comparison. We initialize two models — one using Sigmoid and the other using ReLU — with the same random seed so they start from equivalent weight configurations. The training loop runs for 800 epochs using mini-batch gr

머신러닝 활성화함수 딥러닝 신경망아키텍처