2026-04-26 · 構築日誌 #17

CloudWatch Alarms + SNS でアラート構築 — 壊れたら気付ける運用

個人運用のインフラで一番怖いのは、「壊れているのに気付かない」 状態です。本記事では CloudWatch Alarms + SNS で「壊れたら気付ける」仕組みを作った記録を書きます。

監視対象の選び方

すべてを監視するとコスト・通知ノイズが増えます。「壊れたら困る基幹リソースだけ」 に絞ります:

リソース	メトリクス	閾値	意味
Lambda	Errors	5 分で 1 件超	関数が落ちている
Lambda	Duration p95	3 秒超	レイテンシ悪化
API Gateway	5xx	5 分で 5 件超	サーバ側エラー連発
CloudFront	4xxErrorRate	5% 超 (10 分継続)	クライアントエラーの異常率

4xx は「ユーザーが間違ったリクエストしている」「攻撃」「リンク切れ」等を示します。完全に 0 にはできませんが、急に 5% を超えたら何か異常です。

本シリーズの observability モジュール

このモジュールが作るリソース:

aws_sns_topic — アラーム通知のハブ
aws_sns_topic_subscription — メール購読（任意）
Lambda Errors アラーム
Lambda Duration p95 アラーム
API Gateway 5xx アラーム
AWS Budgets — 月額予算超過アラート

CloudFront 4xx アラームは us-east-1 必須 という別の罠があるので、次の記事で別途扱います。

SNS Topic — 通知のハブ

resource "aws_sns_topic" "alarms" {
  name = "${var.name_prefix}-alarms"
  tags = var.tags
}

resource "aws_sns_topic_subscription" "email" {
  count     = var.alarm_email == "" ? 0 : 1     # email 指定時のみ
  topic_arn = aws_sns_topic.alarms.arn
  protocol  = "email"
  endpoint  = var.alarm_email
}

SNS Topic を作って、そこにメールを購読させる設計。SNS は「中継ハブ」 なので、後で Slack や Lambda（チャットボット連携）等に切替可能。

SNS Topic を email subscribe すると、AWS から「Confirm subscription」のメールが届きます。これをクリックしないと通知が届きません。

Lambda Errors アラーム

resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
  alarm_name          = "${var.name_prefix}-lambda-errors"
  alarm_description   = "Contact Lambda errors > 1 over 5 min"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1                      # 1 回でも超えたらアラーム
  threshold           = 1
  period              = 300                    # 5 分
  statistic           = "Sum"                  # 5 分間の合計
  metric_name         = "Errors"
  namespace           = "AWS/Lambda"
  treat_missing_data  = "notBreaching"         # データ無は「正常」扱い

  dimensions = {
    FunctionName = var.lambda_function_name
  }

  alarm_actions = [aws_sns_topic.alarms.arn]   # アラーム発火時
  ok_actions    = [aws_sns_topic.alarms.arn]   # 復旧時
}

重要パラメータの意味:

項目	意味
`comparison_operator`	閾値との比較。`GreaterThanThreshold` = 「閾値超過」
`evaluation_periods`	「連続何回」超えたらアラーム。1 だと一発、2 だと 2 周期連続必要
`period`	1 周期の長さ（秒）。300 = 5 分
`statistic`	集計方法。`Sum` / `Average` / `p95` 等
`treat_missing_data`	データ無の時の扱い。`notBreaching` = 正常扱い（オススメ）

Lambda Duration p95 アラーム

resource "aws_cloudwatch_metric_alarm" "lambda_duration" {
  alarm_name          = "${var.name_prefix}-lambda-duration-p95"
  alarm_description   = "Contact Lambda p95 duration > 3s"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2                       # 2 周期連続で発火
  threshold           = 3000                    # 3 秒（ミリ秒単位）
  period              = 300
  extended_statistic  = "p95"                   # statistic ではなく extended_statistic
  metric_name         = "Duration"
  namespace           = "AWS/Lambda"
  treat_missing_data  = "notBreaching"

  dimensions = {
    FunctionName = var.lambda_function_name
  }

  alarm_actions = [aws_sns_topic.alarms.arn]
}

p95 などのパーセンタイルは statistic ではなく extended_statistic に書きます。これは Terraform / CloudWatch API の仕様。

閾値 3 秒は「明らかに遅い」基準。本シリーズの Lambda は通常 280ms（warm）/ 1.3s（cold start）程度なので、3 秒越えは異常事態。

API Gateway 5xx アラーム

resource "aws_cloudwatch_metric_alarm" "apigw_5xx" {
  alarm_name          = "${var.name_prefix}-apigw-5xx"
  alarm_description   = "API Gateway 5xx errors > 5 over 5 min"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  threshold           = 5                       # 5 分で 5 件
  period              = 300
  statistic           = "Sum"
  metric_name         = "5xx"
  namespace           = "AWS/ApiGateway"
  treat_missing_data  = "notBreaching"

  dimensions = {
    ApiId = var.api_id
  }

  alarm_actions = [aws_sns_topic.alarms.arn]
}

HTTP API のメトリクス名は 5xx（REST API は 5XXError）。namespace は両方 AWS/ApiGateway。

AWS Budgets — 月額予算アラート

resource "aws_budgets_budget" "monthly" {
  count = var.alarm_email == "" ? 0 : 1

  name              = "${var.name_prefix}-monthly"
  budget_type       = "COST"
  limit_amount      = tostring(var.monthly_budget_usd)
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2026-04-01_00:00"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 50
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.alarm_email]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.alarm_email]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"     # 予測値で 80% 超
    subscriber_email_addresses = [var.alarm_email]
  }
}

3 段階で通知:

50%（実績）: 「半月で半分使ってる」 → 月後半でペース注意
100%（実績）: 予算超過。即対応
80%（予測）: AWS 側の機械学習で「このペースだと 80% 超えそう」 → 早期検知

本シリーズでは monthly_budget_usd = 10（約 1,500 円）。個人ポートフォリオなら十分。

アラーム発火時のメール

例（Lambda Errors アラームが発火した時）:

From: AWS Notifications <no-reply@sns.amazonaws.com>
Subject: ALARM: "iigtn-lab-prod-lambda-errors" in Asia Pacific (Tokyo)

You are receiving this email because your Amazon CloudWatch Alarm
"iigtn-lab-prod-lambda-errors" in the Asia Pacific (Tokyo) region has
entered the ALARM state, because "Threshold Crossed: 1 datapoint
[2.0 (...)] was greater than the threshold (1.0)."

State Change Time: 2026-04-26T14:23:10.000+0000
Reason for State Change: ...

携帯のメールアプリでも届くので、外出中でも気付ける。

本シリーズで実際にあったアラート見逃し（仮想再現）

運用ノートに書いた仮想ケース: 「03:42 JST に Lambda Errors アラームが発火、メール通知が届いていたが、スマホ通知が OFF だったため翌朝 8:30 まで気付かなかった」。

対策:

CRITICAL アラームだけは Slack や LINE に分岐
深夜帯は LINE Notify（音が鳴る）で起こされる設計
メール通知だけだと「見ない時間帯」がある前提

これが「壊れたら気付ける」を本当に機能させるための運用設計。技術より 通知経路の設計 が抜けがち。

動作確認

テストでアラームを発火させたい時:

# アラーム手動でアラート状態に
aws cloudwatch set-alarm-state \
  --alarm-name iigtn-lab-prod-lambda-errors \
  --state-value ALARM \
  --state-reason "Manual test"

これで通知メールが届けば、SNS 経由のフローが正しく動いている確認になる。確認後は OK 状態に戻す:

aws cloudwatch set-alarm-state \
  --alarm-name iigtn-lab-prod-lambda-errors \
  --state-value OK \
  --state-reason "Test complete"

監視はだいたい揃いましたが、CloudFront 4xx アラームには 「us-east-1 限定」 という独特の罠があります。次の記事ではその落とし穴と、ついでに AWS Budgets の細かい設定について書きます。

📚 用語集

CloudWatch Alarm (CloudWatch アラーム): AWS のメトリクスに対して閾値を設定し、超過時に通知を送る仕組み。aws_cloudwatch_metric_alarm リソースで作る。
SNS Topic: AWS のメッセージングハブ。Topic に publish されたメッセージを、複数の subscriber（Email / Lambda / SQS / HTTPS endpoint 等）に配信。
SNS Subscription: SNS Topic に紐付く受信先。Email subscription なら確認リンクを踏むまでアクティブにならない。
metric (メトリクス): 監視対象の数値時系列データ。Errors / Duration / Latency / Throttles 等、サービスごとに定義されている。
namespace (CloudWatch): メトリクスのグループ。AWS/Lambda / AWS/ApiGateway / AWS/CloudFront 等。
dimensions: メトリクスを特定するための識別子。FunctionName / ApiId / DistributionId 等。
statistic (統計): メトリクスの集計方法。Sum / Average / Maximum / Minimum / SampleCount。
extended_statistic (拡張統計): パーセンタイル統計。p95 / p99 等。statistic と排他的に使う。
p95 / p99: パーセンタイル。p95 = 全データの 95% がこの値以下。「最悪値ではなく実用上の代表値」を表す。
evaluation_periods: 「連続何周期、閾値を超えたらアラーム発火」の設定。瞬間的な変動でアラームが鳴らないようにするため 2〜3 にすることも多い。
period: メトリクス集計の 1 周期の長さ（秒）。Lambda 等では 60s / 300s（5 分）が定番。
threshold: アラーム発火の閾値。比較演算子（GreaterThanThreshold 等）と組合せて使う。
treat_missing_data: メトリクスデータが無い場合の扱い。notBreaching（正常扱い）/ breaching（異常扱い）/ missing（無視）/ ignore（最後の状態を維持）。
alarm_actions: アラーム状態 (ALARM) になった時に実行する SNS Topic 等の ARN リスト。
ok_actions: 正常状態 (OK) に復旧した時に実行する ARN リスト。「復旧通知」に使う。
insufficient_data_actions: データ不足状態 (INSUFFICIENT_DATA) になった時の通知。あまり使わない。
AWS Budgets: 月額予算とアラート閾値を設定する AWS 標準サービス。50% / 80% / 100% で通知が定番。
FORECASTED (予測): AWS が機械学習で予測した「月末時点の見込み額」。実績で気付くより早く異常を検知できる。
NotBreaching: 「データ欠損は正常扱い」を意味する treat_missing_data の値。低トラフィックなアラームでよく使う。