Wazuh 5: a production survival guide

If you run Wazuh seriously, you already know where it hurts.

Alert noise. Fragile upgrades. A stack that works… until it scales.
Wazuh 5 is interesting only if it improves that. Not because of “features”, but because of operations.

This post is a decision framework: what Wazuh 5 must prove before you call it production, and how to adopt it without creating a new source of debt.

Related posts (same philosophy: signal over theater):

Thesis: Wazuh 5 won’t fix a bad design

There are two common ways to “fail” with Wazuh:

Deploy it without deciding what you want to detect (and what you are willing to ignore), and end up in alert fatigue.
Treat indexing/retention/queries as an afterthought, and hit performance walls as soon as you scale.

If you don’t have that nailed down today, changing versions won’t save you.

Production criteria (not a wishlist)

1) Upgrades without fear (and without version booby-traps)

A repeatable upgrade procedure.
Clear compatibility between components.
A reasonable rollback path.
No post-upgrade “manual fixes” to get back to green.

And a classic point of pain: the UI piece (historically, the Wazuh App / plugin in the dashboard) has often been a compatibility trap.
If Wazuh 5 makes this predictable, that’s a win. If not, it goes straight into your risk checklist.

2) Performance with margin (EPS, spikes, and latency)

Stable ingestion under real spikes.
Latency under control.
“Normal” dashboards that don’t degrade the whole system.

Trench insight: the bottleneck is usually not the agent. It is the indexer when you mix long retention with heavy queries, and dashboards that quietly become BI in disguise.

If Wazuh 5 improves this (index rotation/management, hot/warm strategies, or simply more sane defaults), great — but I only trust it after I see it hold real load.

3) Signal > noise (operable detection)

Base rules that are actually useful.
Alerts with actionable context.
Noise trimming without losing visibility.

The goal is not “more alerts”. It’s alerts that save you time.

4) Vulnerabilities with real prioritization

Reliable inventory (what is truly installed).
Enrichment that helps you decide.
Views built for operations, not infinite lists.

Trench insight: the problem is not the number of CVEs. Many don’t apply, others aren’t urgent, and the rest competes with real incidents. If you don’t prioritize, you’ll ignore it.

5) Operability and stack security

Clear health checks.
Tested backups/restores.
Sane defaults and reasonable limits.
Consistent RBAC and TLS.

If it’s not operable, it’s not production.

What is worth it (if the above is true)

Endpoints: auth, processes, common persistence, integrity.
Critical sources: firewall and cloud IAM/audit without constant duct tape.
Multi-environment: separation of data and permissions without chaos (if you need it).
Minimal automation: evidence-rich tickets, useful notifications, block/isolate where supported.

What I would NOT do (even if it sounds tempting)

No big-bang migration. Parallel or nothing.
I would not enable vulnerability detection without a prioritization workflow: you’ll end up ignoring it.
I would not trust default rules without seeing them fire in my environment and tuning them.
I would not enable every module on day one: start with what reduces MTTR or prevents repeat incidents.
I would not measure success by alert volume.

How I’d adopt it (for real)

Before the pilot

10–15 “top” detections (the ones you actually care about).
Critical assets and key log sources.
Baseline for EPS, retention, and cost/resources.
3–5 success metrics.

Small pilot (with “top rules” examples)

IT + 1 critical server + 1 source (firewall or IAM).
10–20 high-value rules, for example:
- user creation and privileged group changes (AD/LDAP)
- sudo usage on critical servers and “out-of-pattern” commands
- authentication anomalies (VPN/SSO): bursts of failures, odd geos, impossible hours
- new services/daemons on servers (classic persistence)
- changes to sensitive files (SSH, PAM, sudoers, scheduled tasks)
- binaries executed from suspicious paths (if your telemetry supports it)
Weekly tuning with an explicit decision: what stays and what gets killed.
Stop rule: if after 2–3 weeks you don’t reduce noise or improve MTTR, the pilot failed.

Roll out in waves

By criticality or business unit.
With a clear rollback.
Documenting “gotchas” before expanding coverage.

Quick checklist

Repeatable upgrades (including the UI/App).
EPS with margin (spikes included).
Retention that doesn’t kill performance.
Controllable signal/noise.
Proven backup + restore.
Coherent RBAC/TLS and hardening.

Final word

If you don’t know what you want to detect and how you will measure it, the version doesn’t matter.

Production is not installing. It’s operating without it eating you alive.

Guilgo Blog