Running GPU-GPAW with A100 nodes on Niflheim
To get started, make sure you have
used the virtual environment script gpaw_venv.py
on this page Compiling GPAW on Niflheim.
This will automatically build the latest GPU version of GPAW.
Add A100 nodes to myqueue config.
Type
mq info
to locate the myqueue root, and the config.py file in there, and add following items to the list of nodes:('a100G1', {'nodename': 'a100', 'cores': 128, 'memory': '512000M', 'extra_args':['--gpus-per-node=1', '--mem=100G']}), ('a100G2', {'nodename': 'a100', 'cores': 128, 'memory': '512000M', 'extra_args':['--gpus-per-node=2', '--mem=200G']}), ('a100G4', {'nodename': 'a100', 'cores': 128, 'memory': '512000M', 'extra_args':['--gpus-per-node=4', '--mem=400G']}),
This will allow you to run calculations with 1, 2 or 4 GPUs (4 GPUs equals a full node).
Use the latest
gpaw_venv.py
script to build the environment. This automatically will setGPAW_NEW=1
andGPAW_USE_GPUS=1
environment variables when using a100 nodes. Thus, the user only has to submit to a100 nodes, and GPUs should work automatically. Check out the wave function memory section from the output, it should havestorage: GPU
under it.If do not want to use the latest
gpaw_venv.py
, alternatively, you need to prepare your script for GPUs. One needs to use the NEW gpaw, when running with GPUs, so either you need to addexport GPAW_NEW=1
to your virtual environment, or import the calculator from new GPAW directly. In addition you need to addparallel={'gpu': True, ...}
to your input of GPAW, or setexport GPAW_USE_GPUS=1
environment variable. Here is an example script to relax a nanostructure (note that you can keep the absolute path, if you want to rerun this test):def gpaw_test(atoms, gpu=False): kpts = {'density': 4} params = {'convergence': {'density': 1e-06}, 'kpts': kpts, 'random': True, 'mode': {'ecut': 800, 'name': 'pw', 'force_complex_dtype': True}, 'occupations': {'name': 'fermi-dirac', 'width': 0.05}, 'mixer': {'method': 'fullspin', 'backend': 'pulay'}, 'txt': f'relax_gpu_{gpu}.txt', 'parallel': {'gpu': gpu}, 'xc': 'LDA'} from gpaw.new.ase_interface import GPAW atoms.calc = GPAW(**params) E = atoms.get_potential_energy() F = atoms.get_forces() return E, F from ase.io import read atoms = read('/home/niflheim/kuisma/benchmarks/bilayer_example.json') import sys gpaw_test(atoms, gpu=eval(sys.argv[1]))
Submit with myqueue. We will submit two calculations, one with a full A100 node (4 GPUs), and one with the fastest CPU node (epyc96). Always select the number of cores equal to the number of total GPUs:
mq submit -R 4:a100G4:1h 'gpaw python gpu_example.py True' mq submit -R 96:epyc96:1h 'gpaw python gpu_example.py False'
The system is a 256 atom bilayer. The expected runtime for this system is 5 minutes with A100 node, and 20 minutes with full epyc96 node:
mq ls id folder name args info res. age state time ─────── ────── ──── ────────────────────────────── ──── ──────────── ───── ─────── ───── 7839868 ./ gpaw python gpu_example.py False +3 96:epyc96:1h 21:05 done 19:41 7839870 ./ gpaw python gpu_example.py True +3 4:a100G4:1h 20:56 done 4:59
For reference, the runtime is 40 minutes with second fastest node Xeon56, so the current master version of GPAW provides a factor of 4 to 8 node-to-node speedup, depending on the point of comparison. Note that the XC-corrections are being performed on CPU a the moment in master, so the GPU code will speed up further by 20-60% depending on the system, when certain merge requests are merged.
You may investigate the outputs of the calculations yourself, or you can observe the files already at Niflheim:
sdiff /home/niflheim/kuisma/benchmarks/relax_gpu_True.txt /home/niflheim/kuisma/benchmarks/relax_gpu_False.txt|less