Fix: ImpulseDB Postgres backups are failing

A customer's backups aren't refreshing, or the audit log shows repeated backup.failed entries for a service or host.

Last updated 20 days ago

Fix: ImpulseDB Postgres backups are failing

What you're seeing

  • A customer's most recent backup timestamp under Servers > (server) > Backups is days old.
  • The audit log shows repeated backup.failed entries for a service or shared host.
  • A customer support ticket: "your backup card says my last backup is from a week ago."
  • WAL-G base backups stopped landing in the S3 bucket.

Why it happens

ImpulseDB Postgres runs nightly logical dumps via pg_dump on every host, plus optional continuous WAL streaming via WAL-G to an S3-compatible target. A backup failure usually traces to one of:

  • S3 credentials revoked or rotated and the new values weren't pasted into Settings > Backups.
  • The destination bucket is out of quota or doesn't exist.
  • WAL-G binary missing on the Postgres host (a partial cloud-init that didn't include it) or out of date.
  • Network outage from the Postgres host to S3 (or to the ImpulseMinio admin proxy if using the one-click ImpulseMinio integration).
  • Disk on the Postgres host filled up, so the local dump can't write before upload.

Fix

  1. Filter the audit log to the failing service or server. Open Tools > Audit Log, filter to module=impulsedb-postgres and event=backup.failed. The most recent entry has the verbatim error from WAL-G or pg_dump β€” it names the failing step.
  2. For credential / S3 endpoint errors, open Settings > Backups and re-paste the access key, secret, endpoint, and bucket. If you're using the ImpulseMinio integration, confirm the region credentials in mod_impulseminio_regions are still valid by opening the ImpulseMinio admin tab and checking that the region row is green. Save in ImpulseDB Postgres and click Reconfigure backups on the affected server row β€” it re-runs the WAL-G config over SSH without rebooting Postgres.
  3. For bucket quota errors, check the S3 provider console for the destination bucket's usage. Either raise the quota, prune old base backups (wal-g delete retain FULL 7 on the host keeps the 7 most recent), or rotate to a new bucket.
  4. For "WAL-G not found" errors, SSH to the host:
    which wal-g wal-g --version
    If it's missing, the server was provisioned before WAL-G credentials were configured. Click Reconfigure backups on the server row to re-run the WAL-G install over SSH.
  5. For disk-full errors, SSH in and check df -h /var/backups/postgres. The shipped retention prune runs daily; if backups are accumulating because the prune step itself errored, drop the oldest dumps by hand, then re-run the backup task. The nightly cron is impulsepostgres_usage.php (runs at 02:00). The task is registered with ImpulseCore's runner, so you can also trigger it ad-hoc from Tools > Diagnostics.
  6. For network errors, test S3 reachability from the Postgres host:
    curl -I https://<s3-endpoint>/<bucket>
    A connection-refused or timeout points at firewall / outbound egress on the host's network.

How to confirm it worked

Trigger a one-off backup from the per-server Backup Now button (WAL-G base backup) or from the Tools > Diagnostics backup task. Within a few minutes:

  • The audit log appends a backup.completed event for the service or host.
  • The Servers > (server) > Backups card updates with a fresh timestamp and size.
  • For WAL-G, wal-g backup-list on the host shows the new base backup.

Related