Skip to main content

Command Palette

Search for a command to run...

The Night Logging Saved a Production System — and Why print() Is Not Enough

Updated
12 min read
The Night Logging Saved a Production System — and Why print() Is Not Enough
R

Full-stack engineer | Python • PHP • React • Databases

At 2:07 AM, the payment system started failing.

Not with a crash. Not with a clear error. Just… silently dropping transactions.

At first, it was a trickle. Then retries. Then support tickets. Then alerts.

Within minutes, revenue was bleeding.

The system handled thousands of transactions per minute. Everything looked fine. CPU was normal. Memory was stable. No obvious outage. But something was wrong — and here's what made it worse: there were logs, but they were completely useless.


What Bad Logging Looks Like in Production

When the team opened the logs, this is what they found:

INFO Processing request
ERROR Something failed
INFO Retrying

That's it. No request ID. No user context. No indication of which service failed or which worker was affected. Debugging at that point wasn't engineering — it was guessing.


What Good Logging Would Have Looked Like

Now imagine the same incident with structured, intentional logging:

2026-06-12T02:07:03Z | INFO     | payment-service | pid=12345 | req=abc123 | payment_initiated   | user_id=7821 amount=500
2026-06-12T02:07:04Z | INFO     | payment-service | pid=12345 | req=abc123 | calling_gateway     | provider=stripe
2026-06-12T02:07:07Z | ERROR    | payment-service | pid=12345 | req=abc123 | gateway_timeout     | duration=3s
2026-06-12T02:07:07Z | WARNING  | payment-service | pid=12345 | req=abc123 | retry_scheduled     | attempt=1
2026-06-12T02:07:10Z | CRITICAL | payment-service | pid=12345 | req=abc123 | payment_failed      | reason=timeout

Now the story tells itself. A specific request failed at the gateway after a 3-second timeout. Retry logic kicked in but couldn't recover. Final failure recorded with full context. No guessing — just facts.


Logging Is Not Debugging. It's Reconstruction.

This is the mindset shift that separates junior and senior engineers.

Logging is not about printing messages. Logging is about reconstructing reality after things go wrong.

When systems scale — multiple services, async flows, retries, distributed workers — you can't step through production code with a debugger. By the time you're investigating, the process that failed is long gone. Logs become your only timeline of truth. They are the flight recorder of your application.


What to Log (And Why It Matters)

Good logs aren't random. They're intentional. Here's a practical framework.

1. Log the lifecycle of important flows

Every meaningful operation should leave a trail:

  • request received

  • validation result

  • external call made

  • response received

  • retry attempted

  • success or failure

This creates a narrative for every request — a story you can follow from start to finish even days later.

2. Log failures with context

Bad:

logger.error("Payment failed")

Good:

logger.error(
    "payment_failed | request_id=%s user_id=%s amount=%s error=%s",
    request_id, user_id, amount, error
)

The second version answers who, what, how much, and why — the four questions you'll always ask during an incident.

3. Log external dependencies

The majority of real production failures originate outside your code — in APIs, databases, or message queues. Always log the service name, latency, response code, and retry behaviour for every external call. If a third-party API starts degrading, this is where you'll see it first.

4. Log state transitions

For business workflows, log every state change:

order_created → payment_pending → payment_failed

Without these breadcrumbs, debugging business logic issues — the kind where the code technically worked but produced the wrong outcome — becomes extremely painful.


What NOT to Log

Equally important is knowing what to leave out:

  • passwords and auth tokens

  • sensitive personal data (card numbers, SSNs)

  • raw unfiltered request payloads

  • high-frequency noise that adds volume but no signal

A log entry should earn its place. If it doesn't help you debug a real problem, it's just noise that makes the real signals harder to find.


String Formatting: A Hidden Performance Detail

This is where many developers unknowingly hurt performance in high-throughput systems.

The f-string trap

logger.debug(f"User data: {get_user_data()}")

Even when DEBUG level is disabled and this line will never write anything, get_user_data() still gets called and evaluated. In a tight loop or a hot code path, this adds up.

Why logging uses % formatting

logger.debug("User data: %s", get_user_data())

When you pass arguments separately, Python's logging module uses lazy evaluation — it only formats the string if the message will actually be written. If the level is disabled, the function never runs. For debug-heavy code, this can make a measurable difference.

What about t-strings?

Python has two things that could be called "t-strings" and it's worth distinguishing them.

The older one is string.Template — it's been in the standard library since Python 2.4:

from string import Template
Template("User \(user triggered \)event").substitute(user=user, event=event)

It's safer than % or f-strings when handling untrusted input because it doesn't execute arbitrary expressions, only simple variable substitution. Rarely seen in logging pipelines in practice.

The newer one is Python 3.14's t"" literal — a genuine language-level feature that produces a Template object rather than a string, giving you access to the raw parts before interpolation happens:

# Python 3.14+
entry = t"payment_failed | user={user_id} amount={amount}"

This is genuinely interesting for structured logging because you could intercept the parts, sanitise values, or build a JSON payload — without the string ever being fully rendered as plain text. Still very new; ecosystem support is early. Worth watching, but not something you'd reach for today.

Rule of thumb: Use f-strings everywhere in your normal code. Use %s arguments inside logger calls. Keep an eye on t"" as structured logging tooling catches up.


Log Levels — Think in Incidents, Not Categories

Log levels are not just labels. They define how urgently your system is communicating with you.

  • DEBUG — granular internal detail, only useful during active development

  • INFO — the normal, expected flow of the application

  • WARNING — something unexpected happened, but the system recovered or continued

  • ERROR — a specific operation failed and needs attention

  • CRITICAL — the system itself is in danger; immediate action required

A common mistake is using ERROR for everything unexpected. If everything is an error, nothing is. Reserve CRITICAL for situations that genuinely threaten system integrity — and make sure those alerts actually wake someone up.


A Practical Logging Format

Consistency matters more than any particular style. A format that works well across services:

<timestamp> | <level> | <service> | pid=<pid> | req=<request_id> | <event> | key=value pairs

Real example:

2026-06-12T02:07:03Z | ERROR | payment-service | pid=12345 | req=abc123 | payment_failed | user_id=7821 amount=500

This format is scannable by humans and parseable by machines — important when you eventually pipe logs into a centralised system.


Why PID Matters in Multiprocessing Systems

In systems running multiple workers — Gunicorn, Celery, Python multiprocessing — logs from different processes interleave in the same output stream. Without process IDs, this happens:

INFO Processing request
INFO Processing request
ERROR Something failed

Which worker failed? No idea. With PIDs:

INFO  pid=12345 Processing request abc123
INFO  pid=12346 Processing request def456
ERROR pid=12346 Payment failed for request def456

Now you can instantly isolate which worker had the problem, whether failures are isolated to one worker or spreading across all of them, and whether a particular worker is consistently failing.

PID vs Request ID — these answer different questions. PID tells you which process handled something. Request ID tells you which flow you're following. In distributed systems, you need both.


Local vs Remote Logging

Scenario                  Recommended approach
─────────────────────────────────────────────
Small / single app        Local logging
Growing app               Hybrid (local + remote)
Distributed system        Remote / centralised

Local logging — logs written to files on the same machine. Simple to set up, fast, no infrastructure required. But logs are lost if the machine crashes, hard to search across multiple instances, and don't scale.

Remote / centralised logging — logs shipped to a dedicated system like ELK, Loki, or a managed service. Searchable, unified across services, supports alerting and dashboards. Adds setup complexity and cost, but essentially mandatory once you're running more than one service.

For growing applications, a sensible middle ground is to log locally in development and early production, then add a remote sink once you have multiple services or need cross-service correlation.


Monolith vs Distributed Systems

Monolith — file-based logging is perfectly fine. PIDs help. Log levels and a consistent format are usually sufficient to debug most issues.

Distributed system — the rules change. You must have request IDs that propagate across service boundaries (often called correlation IDs). Logs should be structured (JSON is preferred over plain text because it's machine-parseable). A centralised logging system isn't optional. Every log entry must include the service name or you'll have no idea where it originated.

Without these in a distributed system, logs from ten services tell ten disconnected stories instead of one coherent narrative.


Logging Libraries

logging (built-in) The standard library choice. Flexible, well-supported, integrates with everything. The configuration is verbose but gives you full control. Good default for most applications.

loguru A third-party library that dramatically simplifies the setup with sensible defaults and a clean API. Great for smaller projects or teams that want to move fast. Less granular control than the built-in module.

structlog Purpose-built for structured, machine-readable logs. Excellent for microservices where logs will be ingested by a centralised system. Has a learning curve but produces the cleanest output for large-scale systems.


Remote Logging Systems

ELK Stack (Elasticsearch, Logstash, Kibana) — the industry standard for self-hosted centralised logging. Powerful search, rich visualisation, highly configurable. Operationally heavy to run yourself.

Grafana Loki — a lighter-weight alternative designed to work alongside Prometheus. More efficient storage, simpler to operate, but less powerful full-text search than Elasticsearch.

Datadog / Splunk — fully managed, production-grade logging platforms. Rich alerting, anomaly detection, and integrations out of the box. Expensive at scale, but they eliminate all operational overhead.


Logging Execution Time

How long did that function take? That question comes up constantly — slow database queries, degrading third-party APIs, background jobs creeping past their SLA. If you're not logging duration, you'll only find out something is slow once it's already causing timeouts.

The simplest approach for a one-off:

import time

start = time.perf_counter()
result = call_payment_gateway(payload)
duration = time.perf_counter() - start

logger.info(
    "gateway_call_complete | provider=%s duration_ms=%.2f status=%s",
    provider, duration * 1000, result.status
)

Use time.perf_counter() rather than time.time() for measuring elapsed duration — it's higher resolution and not affected by system clock adjustments.

For anything you want to time repeatedly across your codebase, a decorator keeps things clean:

import time
import functools

def log_duration(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        try:
            result = func(*args, **kwargs)
            duration_ms = (time.perf_counter() - start) * 1000
            logger.info(
                "function_complete | name=%s duration_ms=%.2f",
                func.__name__, duration_ms
            )
            return result
        except Exception as e:
            duration_ms = (time.perf_counter() - start) * 1000
            logger.error(
                "function_failed | name=%s duration_ms=%.2f error=%s",
                func.__name__, duration_ms, str(e)
            )
            raise
    return wrapper

# Usage
@log_duration
def process_payment(user_id, amount):
    ...

This logs duration on both success and failure — which matters. A function that fails in 2ms failed for a different reason than one that fails after 30 seconds.

For timing a specific block of code rather than a whole function, a context manager is neater:

from contextlib import contextmanager

@contextmanager
def timed_block(label):
    start = time.perf_counter()
    try:
        yield
    finally:
        duration_ms = (time.perf_counter() - start) * 1000
        logger.info("timed_block | label=%s duration_ms=%.2f", label, duration_ms)

# Usage
with timed_block("db_fetch_user_orders"):
    orders = db.query("SELECT * FROM orders WHERE user_id = %s", user_id)

What to actually time depends on your application. A web API cares about p95 response times per endpoint. A data pipeline cares about per-stage throughput. A background worker cares about job duration trends over time. The rule of thumb: log duration wherever you call something external — a database, an API, a cache — because that's where latency problems almost always originate.


A Production-Ready Setup

Here's a minimal but solid starting point for any Python service:

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)-8s | payment-service | pid=%(process)d | %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%SZ"
)

logger = logging.getLogger(__name__)

# Usage
logger.info(
    "payment_initiated | request_id=%s user_id=%s amount=%s",
    request_id, user_id, amount
)

logger.error(
    "payment_failed | request_id=%s user_id=%s error=%s",
    request_id, user_id, str(error)
)

A few additions worth considering as you grow:

  • Log rotation — use RotatingFileHandler or TimedRotatingFileHandler to prevent log files from growing unbounded and filling your disk

  • JSON formatter — swap the plain-text format for a JSON one (e.g. via python-json-logger) when you're ready to ship logs to a centralised system

  • Correlation ID middleware — in web frameworks, inject a unique request ID at the entry point and thread it through every log call for that request


The Real Cost of Bad Logging

Logging is often treated as an afterthought — something you add when you have time. But bad logging doesn't just make debugging slower. It makes incidents longer, outages more expensive, and on-call shifts more stressful.

The difference between a 10-minute incident resolution and a 3-hour one often isn't the complexity of the bug. It's whether the logs told a clear story or left engineers guessing in the dark.

When production breaks, logs aren't helpful. They're everything. And how quickly you recover comes down to one thing: how well you logged before things went wrong.


Found this useful? Follow for more deep dives into backend engineering, production systems, and the things they don't teach in tutorials.

Python - Basics to Expert

Part 1 of 1

A complete, real-world journey from Python fundamentals to production-grade systems. This series goes beyond syntax to cover performance, debugging, logging, system design, and best practices used by experienced engineers. Whether you're starting out or leveling up, each article builds practical intuition with real examples.