MeluxiNa, clusters, jobs and all that stuff

aaronkaplan · April 15, 2026, 1:20pm

Managed to get access to Meluxina, the SLURM batch job system takes time to getting used to. Trying to get axolotl.ai installed, but running out of disk space. Will try some more in the evening and report / share instructions on how to get a batch job running.

For reference, here is how to configure Meluxina’s job system with axolotl for fine-tuning models (note: I could not test sharding across nodes yet since the disk space was full) :

#!/bin/bash -l
#SBATCH --job-name=neurocti-hunting-meluxina	
#SBATCH --account=<project_id. Not user_id. this is confusing! >
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=10:00:00
#SBATCH --output=logs/neurcti-hunting%j.out
#SBATCH --error=logs/neurcti-hunting%j.err
#SBATCH --qos=default

set -e

# --------------------------
# conda
# --------------------------
source $HOME/miniconda3/etc/profile.d/conda.sh

#conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
#conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

#conda create -n axo312 python=3.12 -y
conda activate axo312

# --------------------------
# Parameters for the trainer
# --------------------------

WANDB_API_TOKEN=<your WANDB API KEY>
WANDB_API_KEY=$WANDB_API_TOKEN

HF_TOKEN="<YOUR HF TOKEN>"

PRECISION="fb16"
LORA_RANK=32
EPOCHS=5
LEARNING_RATE=0.0002
SEQUENCE_LENGTH=8192
VERSION="v2-meluxina"     # always increase this with very proper release
PARAM_STR="${PRECISION}-r${LORA_RANK}-lr${LEARNING_RATE}-sl${SEQUENCE_LENGTH}-e${EPOCHS}-${VERSION}"

MODELNAME=f"neurocti-qwen3-32b-orion10k-instruct-{PARAM_STR}"
SHORT_MODELNAME=f"neurocti-qwen3-32b-orion10k-instruct-{VERSION}"

HF_MODELNAME=f"ctitools/{SHORT_MODELNAME}"
YAML_CONFIG=f"{MODELNAME}.yaml"

AXOLOTL_DO_NOT_TRACK=1

# --------------------------
# NCCL configuration
# --------------------------
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=NVL

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MASTER_ADDR=$(hostname)


# Optional but recommended
export HF_HOME=${SLURM_TMPDIR:-$HOME}/hf_cache
mkdir -p $HF_HOME

# ------------------------
# pip install all
# ------------------------
python -V
python -m pip --version

module load GCC
gcc --version

python -m pip install -U pip setuptools wheel packaging ninja

# install torch first, matching your CUDA
# example only:
python -m pip install torch torchvision
python -m pip install --no-build-isolation "axolotl"

hash -r
axolotl fetch examples

# ------------------------
# train
# -------------------------

#torchrun --nproc_per_node=$SLURM_NTASKS \
#         --master_port=29500 \
#         $HOME/conda_envs/vulntrain/bin/vulntrain-train-severity-classification \
#            --base-model $BASE_MODEL \
##            --dataset-id $DATASET_ID \
#            --repo-id $RESULT_REPO_ID \
#            --model-save-dir $RESULT_SAVE_DIR \
#            --no-push \
#            --no-cache

mkdir -p ctitools

axolotl preprocess $YAML_CONFIG_FILE
axolotl train $YAML_CONFIG_FILE