Run • Troubleshooting • Performance • Automation • Security

Documentation technique (Linux / HPC Ops)

Une base “terrain” : checklists, commandes utiles, patterns de diagnostic et livrables. Objectif : rendre la plateforme stable, prédictible et opérable (Runbook + observabilité + standardisation).

Incident Diag SLURM Diag I/O Hardening

Portfolio technique (exemples d’impact)

Stabilisation scheduler

Diagnostic saturation / deadlocks, hygiène files d’attente, garde-fous (limits), procédures N2/N3.

SLURMRunbookMTTR

Industrialisation des configs

Playbooks Ansible, templates unit/systemd, versionning Git, validation config (CI légère).

AnsibleGitIdempotent

Stockage & performance I/O

Analyse latence, contention, tuning clients, recommandations bonnes pratiques (jobs + scratch).

LustreI/OTuning

Livrables typiques
Runbook, checklists, standards de config, scripts de diag, tableaux KPI, post-mortem.

Principe
Chaque incident notable → RCA + action durable + doc. On “éteint l’incendie”, puis on empêche la récidive.

# Quick sanity checks
$ uptime; free -h; df -h
$ journalctl -p err..alert -n 50 --no-pager
$ systemctl --failed