Standard Operating Procedures

Essential Reading

All users must read and comply with these procedures before using the system.

Table of Contents

1. Purpose

This SOP defines the procedures and responsibilities for authorized users of the Sapitwa HPC Cluster at the Malawi Liverpool Wellcome (MLW) Programme. It ensures users operate within acceptable norms to maintain data security, system performance, and fairness in resource utilization.

2. Scope

This SOP applies to all users accessing the Sapitwa HPC Cluster, including MLW staff, affiliated researchers, students, and collaborators granted access by MLW

3. Roles and Responsibilities

Key Stakeholders

Users

  • • Submit jobs efficiently using SLURM job scheduler
  • • Manage data responsibly
  • • Follow security best practices

HPC Administrators

  • • Maintain and monitor the system
  • • Create and manage user accounts
  • • Provide technical support and training

Principal Investigators (PIs)

  • • Endorse user account requests
  • • Oversee usage within research groups

4. Access and Accounts

Account Management

  • • Access is granted through institutional SSO and MFA
  • • Users must submit a formal request form approved by a PI
  • • Accounts remain active for the duration of a project and are reviewed annually
  • • User accounts that remain inactive for a period of 12 months may be disabled following a formal notification. To request reactivation, users must initiate the process through their PI or designated supervisor. Upon submission of a valid reactivation request by the PI, account access will be restored within 48 hours.
  • • User home directories are located at /head/NFS/$USER, where $USER is the system username.

5. Code of Conduct & Acceptable Use

Usage Policies

  • • Use resources only for MLW-approved academic/research purposes
  • • Do not run cryptocurrency mining, unauthorized commercial work, or system stress tests
  • • Users must not share accounts
  • • Respect fair use and shared resource policies

6. System Overview

Infrastructure Details

  • • Hardware: Dell PowerEdge R940xa (virtualized nodes)
  • • Nodes: 2 CPU nodes, 2 GPU nodes (NVIDIA A100 in 7x 10GB MIG mode)
  • • OS: Rocky Linux 8
  • • Job Scheduler: SLURM
  • • Environment Management: Lmod and EasyBuild
  • • Storage: NFS-mounted scratch and home directories on Dell PowerVault ME5024
  • • Access Methods: SSH and Open OnDemand (https://sapitwa.mlw.mw)
  • • Monitoring: Metrics and performance dashboards at https://metrics.mlw.mw

7. Using the System

7.1 Login Instructions

Access Methods

SSH:
ssh username@sapitwa.mlw.mw
(MLW network/VPN only)
Web Portal: https://sapitwa.mlw.mw (No SSH client needed)

7.2 Environment Modules

Common Module Commands

List all available modules:

module avail

Show detailed module information:

module show module_name

Search for specific modules:

module spider

Search for a specific module:

module spider module_name

Load a module:

module load module_name

Unload a module:

module unload module_name

List currently loaded modules:

module list

Unload all modules:

module purge

7.3 Job Submission

SLURM Commands

Submit a job:

sbatch jobscript.sh

View job queue:

squeue

View your jobs only:

squeue -u $USER

View detailed job history:

sacct

Cancel a job:

scancel job_id

Show cluster status:

sinfo

7.4 Interactive Jobs

Interactive Session Commands

Request an interactive terminal session:

srun --pty bash

Request a session with specific resources:

srun --pty --cpus-per-task=4 --mem=8G bash

Request a GPU session:

srun --pty --gres=gpu:1 bash

Allocate resources without starting a session:

salloc

Allocate specific resources:

salloc --cpus-per-task=4 --mem=8G

7.5 Software

  • • Pre-installed packages via EasyBuild and modules
  • • Users may use Conda, pip or Spack in /home/NFS/$USER. Note that $USER is a variable for username in the cluster.

8. Data Management

Storage Guidelines

  • /head/NFS/$USER: User's home directory (quota limited) with 50GB. Special request will be required to increase the quota. Note that $USER is a variable for the username.
  • /scratch: High-speed storage (periodically purged)
  • • Use rsync, scp, or sftp for data transfers
  • • No sensitive data storage allowed on HPC

Data Policies

  • • Keep source data in project space
  • • Use scratch space for temporary files
  • • Clean up completed job files
  • • Compress unused data

File Management Commands

Check your storage quotas:

quota -s

Check directory size:

du -sh directory_name

Find files larger than 100MB:

find . -type f -size +100M -exec ls -lh {} \;

9. Security and Compliance

Security Requirements

  • • Follow institutional IT security policies
  • • Use SSH keys, strong passwords and MFA
  • • No plaintext password storage or credential sharing
  • • Report suspicious activity immediately

MFA Setup

Configure your authenticator app:

  1. 1
    Install one of the supported authenticator apps:
    • • Microsoft Authenticator (recommended)
    • • Google Authenticator
    • • FreeOTP
  2. 2
    Visit
    https://sapitwa.mlw.mw
  3. 3 Follow the MFA setup instructions

10. Acknowledgement and Publications

Standardized Acknowledgement Format

"Computational resources were provided by the Sapitwa High Performance Computing (HPC) Cluster at the Malawi Liverpool Wellcome Research Programme (MLW), funded through the Wellcome core award to MLW (206545/Z/17/Z) and institutional funding support from the Liverpool School of Tropical Medicine."

Notify HPC administrators of publications for impact reporting.

11. Support and Training

Getting Help

Primary Support

  • • Email: hpc-support@mlw.mw
  • • Onboarding training provided for new users
  • • Microsoft Teams: HPC Community Channel
  • • Hours: Monday-Friday, 8:00 AM - 5:00 PM

Documentation

Emergency Support

For after-hours emergencies affecting critical research workflows, use the MS-Teams Channel.

12. Incident Reporting

Reporting Requirements

  • • Report outages and failures to HPC support team
  • • Security incidents must be reported to MLW IT security

13. Account Termination and Data Retention

End of Service

  • • Accounts terminated upon project end or staff departure
  • • Data in /head/NFS/$USER retained for 30 days post-termination

System Maintenance

Maintenance Schedule

  • • Regular Updates: Last Monday of each month
  • • Emergency Maintenance: 24-hour notice when possible
  • • System Status: Check Teams channel