This is the setup I use for my Vast machines. It may not work on every system, especially where GRUB settings or motherboard options vary by platform.
For performance and VM compatibility, set the following:
Required for motherboards using MCIO ports and risers:
x16 for each pair of ports supporting GPUs.x4x4/x4x4 for pairs supporting NVMe drives.x8/x8, or whatever is required, for ports supporting NICs.Other recommended settings, if your BIOS exposes them:
While Vast supports Ubuntu 24.04 now, it still has issues, and Vast is not fully optimized for it yet.
I recommend Ubuntu 22.04.3 Server LTS rather than 22.04.5. See the kernel notes below for the reasoning.
xfs and mount it at /var/lib/docker.80 GB for the ext4 OS partition.xfs and mount it at /var/lib/docker.5.15 on Ubuntu 22.04. Most people use HWE / 6.8, but that has not worked reliably for me on Genoa.
5.15, the NVML fix is required.6.8 (HWE) works for most configurations as long as you install Ubuntu 22.04.3. Something changed in 22.04.4, causing NVIDIA driver issues and CUDA version conflicts. It is fine to upgrade via apt after install, but start from 22.04.3.sudo nano /etc/default/grub
Replace the first section with the following. This includes the performance-related settings I use and preserves the ability to switch kernels from the boot menu. This is intended for AMD Genoa systems and may not work on all hardware combinations.
GRUB_DEFAULT=saved
GRUB_SAVEDEFAULT=true
GRUB_TIMEOUT_STYLE=menu
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt pci=pcie_bus_perf nvidia_drm.modeset=0 systemd.unified_cgroup_hierarchy=false pcie_aspm=off rd.driver.blacklist=nouveau modprobe.blacklist=nouveau"
GRUB_CMDLINE_LINUX=""
Then run:
sudo update-grub
Do this if you created your RAID array during Ubuntu install or set up a separate data drive during install.
Run:
sudo nano /etc/fstab
Ensure the xfs RAID array or separate data drive has the following options, specifically prjquota:
UUID="cf5ada4e-a7b6-4682-a071-6d0bc4c5ac78" /var/lib/docker xfs rw,auto,prjquota 0 0
Do this if you installed Ubuntu without formatting your separate data drive during setup.
Identify the drive.
lsblk -o NAME,SIZE,FSTYPE,MOUNTPOINT
Example output:
nvme0n1 7.68T
nvme1n1 7.68T
nvme2n1 7.68T
nvme3n1 7.68T
nvme4n1 953.9G LVM2_member
In this example, nvme4n1 is the OS disk, and one of the 7.68 TB drives is unused.
Partition the drive.
sudo parted /dev/nvme0n1
(parted) mklabel gpt
(parted) mkpart primary xfs 0% 100%
(parted) quit
Format the partition.
sudo mkfs.xfs -f -b size=4096 -m reflink=1 /dev/nvme0n1p1
Create the mount point.
sudo mkdir -p /var/lib/docker
Get the UUID.
sudo blkid /dev/nvme0n1p1
# Example: /dev/nvme0n1p1: UUID="abcd-1234-5678-efgh" TYPE="xfs"
Update fstab.
sudo nano /etc/fstab
Add the following line, replacing the UUID with the one from blkid:
UUID="abcd-1234-5678-efgh" /var/lib/docker xfs rw,auto,prjquota 0 0
Mount and verify.
sudo mount -a
df -h | grep /var/lib/docker
If your OS disk was not fully allocated during install:
Check the current disk layout.
lsblk -o NAME,SIZE,FSTYPE,MOUNTPOINT
Example output:
nvme4n1 931.5G
├─nvme4n1p1 1G vfat /boot/efi
└─nvme4n1p2 200G ext4 /
Double-check that it is the expected root partition:
df -h /
Example output:
Filesystem Size Used Avail Use% Mounted on
/dev/nvme4n1p2 196G 30G 157G 16% /
Expand the partition, replacing nvme4n1p2 with your actual root partition.
sudo growpart /dev/nvme4n1 2
If growpart is not available:
sudo apt install cloud-guest-utils
Resize the filesystem.
For an ext4 root filesystem:
sudo resize2fs /dev/nvme4n1p2
Verify the expansion.
df -h /
Do this if you did not set up your RAID array during Ubuntu install.
First, install a convenience tool:
sudo apt install nvme-cli
List the drives and identify which are data drives and which is the OS drive:
sudo nvme list
Example output:
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S526NY0MA00246 VO007680KWVMV 1 7.68 TB / 7.68 TB 4 KiB + 0 B HPK3
/dev/nvme1n1 S526NY0MA00245 VO007680KWVMV 1 7.68 TB / 7.68 TB 4 KiB + 0 B HPK3
/dev/nvme2n1 S4FVNY0M300461 VO007680KWVMV 1 7.68 TB / 7.68 TB 4 KiB + 0 B HPK3
/dev/nvme3n1 S4FVNY0M300371 VO007680KWVMV 1 7.68 TB / 7.68 TB 4 KiB + 0 B HPK3
/dev/nvme4n1 20091710240508 PCIe SSD 1 1.02 TB / 1.02 TB 512 B + 0 B ECFM22.7
Optional: check the LBA formats for your drives and switch to the one marked best.
sudo nvme id-ns -H /dev/nvme0n1 | grep "LBA Format"
Example output:
[3:0] : 0x2 Current LBA Format Selected
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x1 Better
LBA Format 1 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0x3 Degraded
LBA Format 2 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
LBA Format 3 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0x2 Good
Optional: select the LBA format ID you want from the output above.
sudo nvme format -l 3 -f /dev/nvme0n1
Repeat for each data drive.
Manually increase the mdadm sync speed limit:
echo 2000000 | sudo tee /proc/sys/dev/raid/speed_limit_max
Create the RAID array, changing [0-3] to the correct drive numbers. This example creates a RAID 10 array. Adjust --level and --raid-devices for your setup.
sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme[0-3]n1
Wait for the array to build. This may take a while depending on drive size. To check status:
cat /proc/mdstat
Example output:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid10 nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
15002667520 blocks super 1.2 128K chunks 2 near-copies [4/4] [UUUU]
[>....................] resync = 0.0% (3200256/15002667520) finish=1249.8min speed=200016K/sec
bitmap: 112/112 pages [448KB], 65536KB chunk
unused devices: <none>
Once the array finishes building, create the xfs filesystem:
sudo mkfs.xfs -f -b size=4096 -m reflink=1 /dev/md0
Create the mount directory:
sudo mkdir -p /var/lib/docker
Get the array UUID:
sudo blkid /dev/md0
Example output:
/dev/md0: UUID="abcd-1234-5678-efgh" TYPE="xfs"
Update fstab:
sudo nano /etc/fstab
Replace the UUID with the one from the previous step:
UUID="cf5ada4e-a7b6-4682-a071-6d0bc4c5ac78" /var/lib/docker xfs rw,auto,prjquota 0 0
Mount and verify:
sudo mount -a
mount | grep /var/lib/docker
Example output:
/dev/md0 on /var/lib/docker type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=2048,prjquota)
Confirm the mount size:
df -h
Example output:
Filesystem Size Used Avail Use% Mounted on
tmpfs 38G 51M 38G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 935G 36G 852G 5% /
tmpfs 189G 372K 189G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/nvme4n1p2 2.0G 131M 1.7G 8% /boot
/dev/nvme4n1p1 1.1G 6.1M 1.1G 1% /boot/efi
tmpfs 38G 4.0K 38G 1% /run/user/1000
/dev/md0 14T 104G 14T 1% /var/lib/docker
Run ip a and note the MAC addresses for your NICs.
Edit the netplan config:
sudo nano /etc/netplan/01-network.yaml
Example configuration:
network:
version: 2
ethernets:
eth0:
match:
macaddress: 9c:6b:00:74:12:e8 # from "ip a"
set-name: eth0
addresses:
- 192.168.30.3/24 # Machine IP and subnet
nameservers:
addresses:
- 192.168.30.1 # DNS server, or gateway if it forwards DNS
routes:
- to: 0.0.0.0/0
via: 192.168.30.1 # Gateway IP
dhcp4: false
dhcp6: false
eth1: # Secondary interface; duplicate as needed for additional NICs
match:
macaddress: 9c:6b:00:74:12:e9
set-name: eth1
dhcp4: true
dhcp6: false
optional: true # Continue boot without delay if this interface is disconnected
Replace the eth names, MAC addresses, and IPs as needed.
Update permissions for the new netplan file:
sudo chmod 600 /etc/netplan/01-network.yaml
Delete the old cloud-init file:
sudo rm /etc/netplan/50-cloud-init.yaml
Apply netplan:
sudo netplan generate
sudo netplan apply
Install the prerequisites:
sudo apt-get install build-essential gcc-multilib dkms pkg-config libglvnd-dev
Download the NVIDIA driver version you want to use. Example:
mkdir drivers
cd drivers
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/595.58.03/NVIDIA-Linux-x86_64-595.58.03.run
sudo chmod +x NVIDIA-Linux-x86_64-595.58.03.run
Install the driver:
sudo ./NVIDIA-Linux-x86_64-595.58.03.run
Choose the MIT/Open option for the kernel modules if that is what you intend to use, and answer Yes to the remaining prompts as appropriate for your setup.
When the install completes, run nvidia-smi and make sure the GPU or GPUs appear.
Note: this is no longer recommended if you are enabling VMs, which is now Vast's default.
Installation instructions:
https://docs.coolercontrol.org/installation/debian.html
Enable the web UI:
sudo systemctl stop coolercontrold
sudo nano /etc/coolercontrol/config.toml
Find the IP and port settings near the bottom, uncomment them, and either enter your machine IP or 0.0.0.0 to listen on all interfaces:
port = 11987
ipv4_address = "0.0.0.0"
Start the service:
sudo systemctl start coolercontrold
Access the dashboard at http://[ip]:11987, or whatever port you chose.
Note: If you do not click the copy icon, the install may fail.
If you are using a RAID array for Docker data, add --raidgpt to the install command. Example:
wget https://console.vast.ai/install -O install; sudo python3 install [APIKEY] --raidgpt; history -d $((HISTCMD-1));
Run the install command.
If you are using kernel 5.15, run the NVML fix:
sudo wget https://raw.githubusercontent.com/jjziets/vasttools/main/nvml_fix.py
sudo python3 nvml_fix.py
Rerun the installer.
Confirm that the machine appears in the Vast console.
Add the port range to the Vast service, replacing it with your intended ports:
sudo bash -c 'echo -n "16384-32768" > /var/lib/vastai_kaalia/host_port_range'
If you used the single-GPU troubleshooting path, shut down and reattach the remaining GPUs.
Start the server.
Run:
sudo python3 /var/lib/vastai_kaalia/send_mach_info.py
If the speedtest is showing incorrect values, often because it hit a slow server, run the following a few times to refresh the average:
sudo python3 /var/lib/vastai_kaalia/send_mach_info.py --speedtest