Transition Guide

Transition Guide

We are in a transition period which will make hundreds of GPU nodes on TX-GAIA available to the Supercloud community. This guide will summarize recent changes, and what changes you need to make for a smooth transition. Below we have the two major changes to the system, Changes to Nodes and Job Submission and Changes to Modules, as well as a few easily fixed FAQs that we've recieved since the transition. If nothing on this page answers your question, send us an email at supercloud@mit.edu.

Changes to Nodes and Job Submission

As of the February downtime, we will no longer have the 25 Xeon-E5,2650 nodes, and as of March downtime the AMD opteron nodes are no longer part of the normal Slurm partition. We have added ~200 of the new GAIA nodes to the system. These will be the default nodes going forward. The GAIA nodes have a different operating system than the login node and the opteron nodes (this should not affect most users). You can see the stats for these new nodes on the Systems and Software page. These nodes are both on the normal and gpu partition; we are okay with you using these nodes for CPU-only jobs. During this transition period, we highly recommend you test your code on these new nodes, as they will be the bulk of the system going forward.

The CPU type for these new nodes are xeon-g6,6248. They are default for Slurm and for the LLsub commands, and can be requested directly by selecting “xeon-g6” for the CPU type.

If you would like to request a GPU on the new nodes, for now we recommend that you request 20 CPU cores to go with 1 GPU. This may change in the future. For example:

LLsub -i -s 20 -g volta:1

If you are running a batch job using Slurm options, if you do not set the CPU type (--constraint) then your job will submit to the new nodes. These new nodes will also be the default for the GPU partition.

UPDATE: As of the August downtime, we no longer have a separate partition called "gpu". We still have the same Xeon-G6 nodes with GPUs, they are all on the "normal" partition. You do not need to specify a partion when you submit a job.

Changes to Modules

We are moving toward a new module naming scheme.

The new module names are roughly of the format softwarename/version. For example, we have anaconda/2019b and anaconda/2020a available on the new GAIA nodes.

If there is a package that you need that is not yet available on the new nodes, send us email at supercloud@mit.edu.

FAQ

We've had a few questions regarding some changes and have some fixes or temporary workarounds for many of them. Here are a few:

The new nodes don't have software/library ______ that was available before.

If there is a missing library or piece of software, please send us email at supercloud@mit.edu. Please be as specific as you can.

Software I've built previously is no longer working. What should I do?

You may need to rebuild any software you've built in your home directory, this may include some Julia or Python packages.

In an Interactive Job I get the error "bash: module: command not found" when I try to use a module command. How can I use modules?

This has to do with the way the OS on the new nodes loads things when it starts up. You can run source /etc/profile at the command line when you start your interactive job as you would have in a submission script to use the module command until we’ve got this fixed.

My batch jobs are no longer running as expected and I get errors saying "source: not found" and "module: not found" at the top of my log file. What can I do to fix this?

This is a very simple fix. You should change the first line of your submission script to be: #!/bin/bash.

How can I get the Julia-1.1.1 kernel to show up in Jupyter?

If your Julia 1.1 kernel is missing in Jupyter, it's not too hard to get it back. Open a terminal window and log into supercloud. Load the julia/1.1.1 module and start Julia. Run using IJulia. Then run the line kernelpath = IJulia.installkernel("Julia", "--project=@."). You can check if this worked by running readdir(".local/share/jupyter/kernels/"). You should see "julia-1.1" listed there. Finally, restart Jupyter and your Julia 1.1 kernel should be available. An example of this is below.

[studentx@login-1 ~]$ module load julia/1.1.1 
[studentx@login-1 ~]$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.1.1 (2019-05-16)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using IJulia

julia> kernelpath = IJulia.installkernel("Julia", "--project=@.")

[ Info: Installing Julia kernelspec in /home/gridsan/studentx/.local/share/jupyter/kernels/julia-1.1 "/home/gridsan/studentx/.local/share/jupyter/kernels/julia-1.1"

julia> readdir(".local/share/jupyter/kernels/")
2-element Array{String,1}:
 "julia-0.6"
 "julia-1.1"

What is this message I get when I try to run MPI with the new OpenMPI module?

If you get something like the following message whenever you run MPI:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: d-7-13-2
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[d-7-13-2:77333] 11 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[d-7-13-2:77333] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

You can prevent this by using some extra options with your mpirun command. The easiest way to do this is to set an environment varibale with the options and then passing that to mpirun:

OPENMPI_OPTS="${OPENMPI_OPTS} --mca pml ob1 --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 "
OPENMPI_OPTS="${OPENMPI_OPTS} --mca btl_openib_receive_queues P,128,64,32,32,32:S,2048,1024,128,32:S,12288,1024,128,32:S,65536,1024,128,32 "

mpirun $OPENMPI_OPTS cmd

Where cmd is what you are running with mpirun.