Skip to content

Release-recovery runbook

This runbook restores Gad360/apothem to a working state when the release cutover sequence fails mid-flight. The cutover is a four-stage atomic sequence (Stage 1 verify-private-snapshot; Stage 2 delete-existing; Stage 3 create-fresh; Stage 4 force-push fresh history); failure at Stage 2, 3, or 4 leaves the repository in a recoverable but inconsistent state. The recovery path is governed by the recovery snapshot captured at Stage 1 — the snapshot is the recoverable retreat path; without it, recovery is manual reconstruction from the operator's local machine and downstream caches.

The runbook is written for the maintainer wielding gh, git, and the recovery-snapshot archive at _inputs/recovery-snapshot.tar.gz. A competent new contributor follows it without prior context, provided the recovery snapshot exists.

1. Stage taxonomy and failure surfaces

The four cutover stages and their failure modes:

Stage Action Failure mode Recovery path
1 Verify private snapshot Snapshot capture fails or is incomplete Re-run capture script before proceeding to Stage 2; document the gap in the snapshot manifest
2 Delete existing repository gh repo delete returns non-204 OR the deletion partially completed (some metadata persists) §2 Stage-2 recovery
3 Create fresh repository gh repo create returns non-201 OR the new repository creates with wrong scope / topics / homepage §3 Stage-3 recovery
4 Force-push fresh history The push fails OR the push lands but post-push verification (Verified badge, branch protection, Pages, DNS) does not satisfy §4 Stage-4 recovery

Each stage's recovery path is idempotent — running it on a clean state is a no-op; running it on a partial-failure state restores the expected post-stage condition. The recovery paths assume the snapshot at _inputs/recovery-snapshot.tar.gz is present and valid.

2. Stage-2 recovery — Delete-existing failed

Stage 2 invokes gh repo delete Gad360/apothem --yes. The expected response is HTTP 204 (No Content). Failure modes:

  • The call returns 403 / 404 / 5xx.
  • The call returns 204 but a follow-up gh api repos/Gad360/apothem still returns 200 (deletion did not propagate or was reverted).
  • The deletion partially completed (the repository is gone, but a webhook / app-installation / outside-collaborator binding persists on the GitHub side).

2.1 Diagnose

gh api repos/Gad360/apothem 2>&1 | head -3

Three outcomes:

  • HTTP 404. The repository is fully deleted; Stage 2 succeeded. Proceed to Stage 3 — no recovery needed.
  • HTTP 200. The deletion did not land; the repository is still there. Re-run gh repo delete Gad360/apothem --yes; if that fails again, the recovery path is the operator's GitHub UI ("Delete this repository" at the bottom of the Settings page) or escalation to GitHub Support if the UI fails the same way.
  • HTTP 5xx. GitHub-side outage; wait 5 minutes and re-check. The deletion call's 204 may have queued the action; verify after the outage clears.

2.2 Restore from snapshot (if the deletion partially landed)

When the diagnosis at §2.1 shows a 404 (deleted) but the cutover sequence cannot continue (the maintainer aborts), restore the pre-cutover state from the snapshot:

# Extract the snapshot
mkdir -p /tmp/apothem-recovery
tar -xzf plans/apothem-release/_inputs/recovery-snapshot.tar.gz -C /tmp/apothem-recovery

# Re-create the repository from the snapshot's metadata
gh repo create Gad360/apothem --private \
  --description "$(cat /tmp/apothem-recovery/repo-description.txt)" \
  --homepage "$(cat /tmp/apothem-recovery/repo-homepage.txt)"

# Push the snapshot's git bundle as the initial history
cd /tmp/apothem-recovery
git clone repo.bundle apothem
cd apothem
git remote add origin git@github.com:Gad360/apothem.git
git push --all origin
git push --tags origin

# Re-apply the snapshot's branch protection
gh api -X PUT /repos/Gad360/apothem/branches/main/protection \
  --input /tmp/apothem-recovery/branch-protection.json

# Re-apply the snapshot's Pages config
gh api -X POST /repos/Gad360/apothem/pages \
  --input /tmp/apothem-recovery/pages-config.json

# Re-apply topics
gh api -X PUT /repos/Gad360/apothem/topics \
  --input /tmp/apothem-recovery/topics.json

The snapshot carries every per-repo binding the cutover sequence intended to preserve; the recovery script applies them in the order the snapshot manifest declares. Verify after each step:

gh api repos/Gad360/apothem --jq '.private, .description, .homepage'
gh api repos/Gad360/apothem/pages --jq '.cname, .https_enforced'
gh api repos/Gad360/apothem/branches/main/protection --jq '.required_status_checks.contexts | length'

3. Stage-3 recovery — Create-fresh failed

Stage 3 invokes gh repo create Gad360/apothem --private --description ... --homepage .... The expected response is HTTP 201. Failure modes:

  • The call returns 422 (the repository name is taken — Stage 2 did not complete).
  • The call returns 201 but the repository creates with wrong visibility, missing description, missing homepage, or missing topics.
  • The call returns 5xx (GitHub-side outage).

3.1 Diagnose

gh api repos/Gad360/apothem --jq '.private, .description, .homepage, .topics'

Four outcomes:

  • HTTP 404. The repository was not created; re-run the Stage 3 command after verifying Stage 2 fully completed.
  • private: true AND description / homepage / topics match the spec. Stage 3 succeeded; proceed to Stage 4.
  • private: false. The repository created public; recover via gh api -X PATCH /repos/Gad360/apothem -f private=true. Re-verify the visibility flipped.
  • Description / homepage / topics empty or mismatched. Apply patches:
gh api -X PATCH /repos/Gad360/apothem \
  -f description="<spec-grade description>" \
  -f homepage="https://apothem.ahmedgad.com"
gh api -X PUT /repos/Gad360/apothem/topics \
  -f names='["agents","claude-code","conformity-gates","framework"]'

(Replace the names array with the spec-ratified topic set.)

3.2 Idempotent re-creation

If the repository state is sufficiently broken that patching is more work than recreating, the operator deletes and re-creates:

gh repo delete Gad360/apothem --yes  # Stage 2 again
sleep 5
gh repo create Gad360/apothem --private \
  --description "<spec-grade description>" \
  --homepage "https://apothem.ahmedgad.com"

The 5-second sleep gives GitHub's deletion queue time to settle before the create call lands.

4. Stage-4 recovery — Force-push or post-push verification failed

Stage 4 invokes git push --force-with-lease origin main against the fresh repository, then applies branch protection, re-enables Pages, and verifies the commit's GPG signature surfaces as Verified. Failure modes:

  • The push fails (network error, GPG signing failure, force-lease rejection).
  • The push lands but the post-push verification fails — Verified badge absent, branch protection not applied, Pages not enabled, DNS CNAME absent.

4.1 Diagnose push

gh api repos/Gad360/apothem/commits/main --jq '.commit.message, .commit.verification.verified, .commit.verification.reason'

Three outcomes:

  • HTTP 404. The push did not land; the repository is empty. Re-run the Stage 4 push from the operator's local machine. Verify the local main ref points at the signed initial commit before pushing.
  • verified: true AND commit message matches the spec's initial commit. Stage 4 push succeeded; proceed to §4.2 binding verification.
  • verified: false. The push landed but the GPG signature did not verify. Inspect .commit.verification.reason for the failure class:
  • unsigned — the commit was not signed at the push source. Re-sign locally (git commit --amend -S --no-edit) and force-push again.
  • unknown_key — the signing key is not registered with the operator's GitHub account. Add it via Settings → SSH and GPG keys → New GPG key.
  • bad_email — the committer email does not match the GPG key's UID. Fix the local git config (git config user.email me@ahmedgad.com) and re-sign.

4.2 Apply branch protection, Pages, DNS

# Branch protection — extract from snapshot
gh api -X PUT /repos/Gad360/apothem/branches/main/protection \
  --input plans/apothem-release/_inputs/branch-protection.json

# Pages — the Pages-enablement runbook is the canonical procedure;
# this stage re-applies the snapshot's Pages config
gh api -X POST /repos/Gad360/apothem/pages \
  --input plans/apothem-release/_inputs/pages-config.json

# DNS — the Pages-enablement runbook governs the registrar-side step;
# at this stage the maintainer confirms the CNAME still resolves
dig apothem.ahmedgad.com CNAME +short

Expected DNS output:

gad360.github.io.

If the DNS record is absent, the cutover did not retain the apex's DNS record (the registrar is unaffected by GitHub-side cutover); the maintainer follows the Pages-enablement runbook's provider appendix to re-add it.

4.3 Recover from a fully botched Stage 4

When Stage 4 partially landed and the post-push state is too divergent to patch, the recovery path is:

  1. Delete the repository again (§2 recovery).
  2. Re-create the repository (§3 recovery).
  3. Re-push the fresh history from a known-clean local clone:
cd /tmp/apothem-fresh
git clone --bare $(pwd)/<source-of-truth> apothem.git
cd apothem.git
git remote add origin git@github.com:Gad360/apothem.git
git push --mirror origin
  1. Re-apply protection + Pages + DNS per §4.2.

The mirror push lands the full commit graph + tags + refs in one operation; the protection / Pages / DNS apply afterward.

5. Snapshot integrity

The recovery snapshot at _inputs/recovery-snapshot.tar.gz is the recovery path's binding constraint. A missing or corrupted snapshot forfeits §2.2's restore path; the operator falls back to manual reconstruction. Verify integrity before any recovery cycle:

tar -tzf plans/apothem-release/_inputs/recovery-snapshot.tar.gz | head -20

Expected entries (sample):

recovery-snapshot/
recovery-snapshot/repo.bundle
recovery-snapshot/repo-description.txt
recovery-snapshot/repo-homepage.txt
recovery-snapshot/branch-protection.json
recovery-snapshot/pages-config.json
recovery-snapshot/topics.json
recovery-snapshot/release-assets/
recovery-snapshot/issues.json

If any expected entry is missing, the snapshot is incomplete; the maintainer re-runs the Stage 1 capture script before proceeding with recovery. Re-running Stage 1 against a deleted repository (§2 already landed) produces an empty snapshot — the snapshot must be captured before Stage 2 fires.

6. Snapshot retention

The recovery snapshot is retained for 7 days post-cutover. After 7 days, the snapshot is retired via the per-file destructive-op confirmation surface:

ls -lh plans/apothem-release/_inputs/recovery-snapshot.tar.gz

The maintainer routes the retiral through the canonical confirmation channel (the AskUserQuestion invocation invoked at /plan-execute Cluster 8 closure). The 7-day window covers DNS propagation maxima, CDN cache lifetimes, and the empirical window within which a downstream consumer surfaces a cutover-induced break.

7. Failure recovery — meta

When the recovery cycle itself fails (the snapshot is corrupted, the GitHub API is offline, the maintainer's local clone is stale), the last-resort path is manual reconstruction from:

  • The maintainer's local main branch (assumed to mirror the public state at cutover time).
  • Downstream package manager caches (PyPI, Homebrew, Scoop, AUR all retain prior versions).
  • The maintainer's GPG key archive (signed tags can be recreated locally).
  • The DNS registrar's record history (most registrars retain a rollback window).

This path is escape-hatch only; the snapshot path at §2.2 + §4.2 is the canonical recovery surface. Manual reconstruction is logged as a high-severity finding in the post-cutover audit.

8. Cross-references

  • Spec source. Specification §2.7 enumerates the four cutover stages and the snapshot manifest contents.
  • Pages flow. The Pages-enablement runbook (docs/runbooks/ pages-enablement.md) is the canonical procedure for the registrar-side and Pages-side configuration this runbook references during §4.2 recovery.
  • Release flow. The release-cycle runbook (docs/runbooks/ release-cycle.md) governs the per-version release procedure after the cutover settles; this runbook governs the cutover itself.
  • Decisions. D-9 (cutover atomicity), D-29 (snapshot retention), D-31 (Pages re-application), D-32 (DNS continuity) are the decisions this runbook operationalises.