Managed to get access to Meluxina, the SLURM batch job system takes time to getting used to. Trying to get axolotl.ai installed, but running out of disk space. Will try some more in the evening and report / share instructions on how to get a batch job running.
For reference, here is how to configure Meluxina’s job system with axolotl for fine-tuning models (note: I could not test sharding across nodes yet since the disk space was full) :
#!/bin/bash -l
#SBATCH --job-name=neurocti-hunting-meluxina
#SBATCH --account=<project_id. Not user_id. this is confusing! >
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=10:00:00
#SBATCH --output=logs/neurcti-hunting%j.out
#SBATCH --error=logs/neurcti-hunting%j.err
#SBATCH --qos=default
set -e
# --------------------------
# conda
# --------------------------
source $HOME/miniconda3/etc/profile.d/conda.sh
#conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
#conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
#conda create -n axo312 python=3.12 -y
conda activate axo312
# --------------------------
# Parameters for the trainer
# --------------------------
WANDB_API_TOKEN=<your WANDB API KEY>
WANDB_API_KEY=$WANDB_API_TOKEN
HF_TOKEN="<YOUR HF TOKEN>"
PRECISION="fb16"
LORA_RANK=32
EPOCHS=5
LEARNING_RATE=0.0002
SEQUENCE_LENGTH=8192
VERSION="v2-meluxina" # always increase this with very proper release
PARAM_STR="${PRECISION}-r${LORA_RANK}-lr${LEARNING_RATE}-sl${SEQUENCE_LENGTH}-e${EPOCHS}-${VERSION}"
MODELNAME=f"neurocti-qwen3-32b-orion10k-instruct-{PARAM_STR}"
SHORT_MODELNAME=f"neurocti-qwen3-32b-orion10k-instruct-{VERSION}"
HF_MODELNAME=f"ctitools/{SHORT_MODELNAME}"
YAML_CONFIG=f"{MODELNAME}.yaml"
AXOLOTL_DO_NOT_TRACK=1
# --------------------------
# NCCL configuration
# --------------------------
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=NVL
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MASTER_ADDR=$(hostname)
# Optional but recommended
export HF_HOME=${SLURM_TMPDIR:-$HOME}/hf_cache
mkdir -p $HF_HOME
# ------------------------
# pip install all
# ------------------------
python -V
python -m pip --version
module load GCC
gcc --version
python -m pip install -U pip setuptools wheel packaging ninja
# install torch first, matching your CUDA
# example only:
python -m pip install torch torchvision
python -m pip install --no-build-isolation "axolotl"
hash -r
axolotl fetch examples
# ------------------------
# train
# -------------------------
#torchrun --nproc_per_node=$SLURM_NTASKS \
# --master_port=29500 \
# $HOME/conda_envs/vulntrain/bin/vulntrain-train-severity-classification \
# --base-model $BASE_MODEL \
## --dataset-id $DATASET_ID \
# --repo-id $RESULT_REPO_ID \
# --model-save-dir $RESULT_SAVE_DIR \
# --no-push \
# --no-cache
mkdir -p ctitools
axolotl preprocess $YAML_CONFIG_FILE
axolotl train $YAML_CONFIG_FILE