Skip to main content

Debugging

A collection of issues I've consistently encountered over time, and the fixes for them.

Exit code 137

Usually indicates some sort of Out-Of-Memory (OOM) error. Was the case on my HPC job when my np.load was allocating close to 100GB in memory and getting the process killed. The tricky thing here is that due to being killed there is very little in terms of execution trace for debugging.

On the Swing cluster, while total node memory might be alrge ~1TB, job memory is allocated based on fraction of cores. So requesting ncpus=32 only gets 120GB. You can check with:

qstat -f $PBS_JOBID | grep mem

channel 3: open failed

While working with Jupyter Notebooks on HPC clusters, I've gotten this confusing message many times after a kernel dies. I start seeing annoying lines consistently printed to my ssh terminal reading:

channel 3: open failed: connect failed: Connection refused

Without any other information, it can be hard to know how to get this error to stop printing, which can clutter the terminal.

Recently I learned what is happening is that Jupyter notebooks in VSCode in order to work must connect to Jupyer servers,aka a Python process running jupyter notebook or jupyter lab. If you run locally, VSCode starts a local server in the background. In HPC environments, you do this by pasting in the remote Jupyter server URL you get from running jupyter lab --no-browser --port=8866 to VSCode e.g. http://localhost:8866/?token=abcd. VScode will save the URL to a remote server list so it can quickly reconnect. However, if the URL becomes stale, VSCode does not know that and will continue attempting to connect, resulting in the error.

The easy fix is to open up the command palette Ctrl + Shift + P and click the option Jupyter: Clear Remote Server List.

$PATH Altered

When certain commands like qsub or qstat suddenly stop working because they cannot be found, the likely culprit is an altered $PATH environment variable. The $PATH is a list of directories where the system looks for executable programs when you type a command. For example, if you type ls the system look at $PATH and find it inside of /usr/bin.

Check by running echo $PATH. Specifically for pbsnodes commands, it probably came from /opt/pbs/bin being removed.

In a clean shell on LCRC, my path list is:

PATH=/home/dma/miniconda3/bin:/home/dma/miniconda3/condabin:/usr/lpp/mmfs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/pbs/bin:/home/dma/.local/bin:/home/dma/bin

PyTorch CPU Contention

Torch by default in any process sets the number of threads to the number of system cores. This can unknowingly lead to massive CPU contention if you have multiple torch processes running.

I found out about this bug while trying to parallelize torch inference with multiprocessing. Turns out that torch operations in worker processes by default sets num_threads to the total number of cores on the system. Consequently, the time for model inference actually increases across all processes as they end up fighting over the cpu resources.

Specifically what happened was my job was allocated 128 ncpus. Then with each process that I created, it set the torch number of threads by default to 128. So each of say four processes attempted to perform multithreaded torch operations using 128 threads at the same time. As a result, there was a massive slow down that was near ~40x.

The fix was simply at worker process initialization to run:

torch.set_num_threads(8) # a small number

Slow HPC GPFS Writes

For the preprocessing script for the SAR floodmapping dataset with 8000 tiles, I needed to make it robust and efficient for hundreds of gigabytes of patches. A strategy I devised is to have parallel workers load and patch the tiles in chunks, and save each chunk to file. The chunks would then be streamed to a final memory mapped array. This strategy allowed for producing 300GB+ .npy files with limited RAM.

While the parallel patching and chunking was fast, I encountered a bottleneck with subsequently writing the chunks to a large memmapped array. The writes would happen sequentially. The problematic code would take close to two hours to finish for two chunk files with a total combined size of 30GB:

final_arr = np.lib.format.open_memmap(output_file, mode="w+", dtype=np.float32, shape=(total_patches, S2_DATASET_CHANNELS, size, size))

for tmp_file in chunk_files:
with np.load(tmp_file) as arr:
n_patches = arr.shape[0]

# this part is incredibly slow for large 10GB+ chunk files!!!!
final_arr[patch_offset:patch_offset + n_patches, ...] = arr
patch_offset += n_patches

One possibility for this bottleneck is the networking bandwidth of GPFS on the Argonne clusters. I was making page by page writes to the filesystem by using my project directory /lcrc/project/hydrosm/dma/data/preprocess. Instead, I realized that I can greatly speedup the latency by using the local scratch space (~1TB in size) which is attached to the compute node. Since the harddrive is directly attached to the compute node and does not use GPFS, it is much much faster. With findmnt I learned that /scratch is mounted as a high performance XFS filesystem.

The difference was staggering. While the original implementation for around 30GB of data without using the scratch disk took roughly 1200s (20 minutes) of runtime, the new implementation using the scratch disk took roughly 20s of runtime. This was a speedup of 60x! I managed to save myself 72 hours of potential wait time processing the final large dataset.

Key Idea

The point of the scratch directory is to have a locally mounted harddrive on the node for fast job related data storage. Since it is physically mounted, it has much better performance than using the GPFS storage, which is separate from the node and lives on a dedicated storage server cluster like /lcrc. The network latency of communicating between servers adds a lot of overhead compared to simply storing files temporarily on scratch disk.

Here is a good summary:

All nodes have their own local storage mounted as /scratch. The /scratch storage is fast - faster than system-wide storage such as home folders and lab storage - which make it ideal for holding intermediate data files. This will also lower the load on the system-wide storage and the local network. Using local /scratch is a win-win for everyone.

With scratch directories, you do not need to worry about deleting files and cleaning up during the job, as it is automatically cleared once the job ends.

I asked Jeremy whether there is a specific HPC doc that would teach you these things, but he said that he learned this all through trial and error. The docs might only allude to these things, so it would talk about the scratch directory, but only by looking more into why it is there, do you learn about these problems.

At this point I realize I need to learn about the HPC system. More specifically the scratch space, GPFS filesystems.

No Speedup from Multiple GPUs

Do Not Forget

One of the most headache inducing bugs I encountered with a simple one line fix...

This serves as a strong anecdote of pursuing bugs through source code. Took me many hours of profiling and isolation to find an implicit spawn instead of fork multiprocessing context in conjunction with torch dataloader workers, leading to unwanted memory duplication and OOM.

While investigating why my distributed training runs for my SAR flood segmentation model was not speeding up training relative to single GPU runs, I learnt a couple things about memory management. First, I found that in my Dataset initialization I was copying the memmapped array fully into RAM with np.ascontiguousarray which led to some obscure SIGKILL OOM errors. As you can see I intended it to work well in the fully in-RAM use case:

class FloodSampleSARDataset(Dataset):
def __init__(self, sample_dir, channels=[True] * 8, typ="train", random_flip=False, transform=None, seed=3200, mmap_mode=None):
self.sample_dir = Path(sample_dir)
self.channels = channels + [True] * 6 # always keep label, tci, nlcd, scl
self.typ = typ
self.random_flip = random_flip
self.transform = transform
self.seed = seed
self.mmap_mode = mmap_mode
base = np.load(self.sample_dir / f"{typ}_patches.npy", mmap_mode=mmap_mode)

# One-time channel selection to avoid per-sample advanced indexing copies
if not all(self.channels):
# copies entire dataset as contiguous block in memory!
base = np.ascontiguousarray(base[:, self.channels, :, :])
self.dataset = base

What I forgot was the case where mmap='r' and the resulting memory mapped array would be loaded fully into memory via np.ascontiguousarray(base[:, self.channels, :, :]).

Second, my mem-mapped approach with large datasets greater than RAM was problematic - it would be extremely slow and unusable due to being mem-mapped from GPFS. When switched to reading from scratch disk, the improved latency made it match the performance of RAM (or at least just marginally slower). I just needed to incur a partial cost of copying the data over to scratch at the beginning of dataset initialization.

Finally, I found that the MAIN REASON why there was no speedup from distributed training. It was primarily caused by a counteracting explosion in the dataloading times from accidentally spawning rather than forking worker processes.

Key Idea

Random mem-mapped array reads from GPFS is expensive due to higher latency, and switching to mem-mapping on scratch disk can lead to 10x speedup, and make mem-mapped training only marginally slower than in-memory training.

The dataloading bottleneck in the distributed case was discovered via profiling on the single GPU vs the multi GPU. I found that on single GPU the forward and backward passes were the dominant bottleneck, while with the multi GPU the data loading was the bottleneck. The distributed training as expected reduced the forward and backward pass latency by 1N\frac{1}{N}, but the dataloading time exploded. The reason for the dataloading slow down? The num_workers=0 as opposed to num_workers=4 made it such that there was no prefetching of batches in the background as the GPUs did the forward and backward passes.

Why can we not use num_workers>0 in a DDP setting? Setting the num_workers parameter to any positive integer crashes the job due to OOM. For my particular case, my dataset initialization puts 60GB of data into RAM. This is fine with num_workers=0 as 4 python processes for 4 GPUs incurs 60 * 4 = 240GB total RAM consumption. However, when num_workers>0, at iterator creation time child worker processes are spawned each with access to the underlying data:

# at iterator instantiation child workers are spawned
for batch_i, (X, y, _) in enumerate(train_loader):
# we crash before we even get to here

At dataloader instantiation DataLoader(dataset, num_workers=4) no workers are spawned, but at iter(dataloader) they are spawned. Note that iter(dataloader) is called implicitly in for batch in dataloader.

What was happening was that within each rank process (from 1 to 4), the child worker processes for dataloading must also have access to the in-memory array from the dataset class, but processes have separate memory spaces. In the single GPU case I was running inside a default multiprocessing context of mp.get_context("fork"), and hence the dataloader defaulted to forking the worker processes. In linux child processes are forked with Copy-On-Write so that no pages need to be copied unnecessarily by child processes from the parent, hence the processes shared the 60GB. However, in aDDP setting we are running from within a spawned process from mp.spawn, the multiprocessing context by default is set to mp.get_context("spawn"). The drawback here is that spawned processes do not inherit the parent process image. In this case, the worker processes must initialize the dataset from scratch, copying all of the 60GB into its own memory. This fact was confirmed with python process monitoring in htop -u dma. I observed a massive explosion of memory to 480GB+ as each spawned worker process attempted to load the full 60GB array into memory at startup before the processes were killed (my job was only allocated 480GB of memory).

After a lot of head scratching, the fix was actually just inserting a multiprocessing_context kwarg to switch the multiprocessing start method from spawn (which it was fetching from the current context) back to fork:

train_loader = DataLoader(train_set,
batch_size=cfg.train.batch_size,
num_workers=0,
persistent_workers=False,
pin_memory=True,
sampler=DistributedSampler(train_set, seed=cfg.seed, drop_last=True),
multiprocessing_context='fork',
shuffle=False)

Now, by properly setting the batch size so that processes * num_workers * prefetch_factor batches fits in the 8GB shared memory, the distributed runs worked without memory bloat. With some adjustments of the batch size and the number of workers to optimize the dataloading and forward/backward times (these were the dominant bottlenecks), I got the expected 4x speedup with 4 GPUs - from the original ~58s per training epoch to ~12s per training epoch for each rank.

Key Idea

Worker processes created using fork start method share the parent's memory, while worker processes created using spawn start method must reinitialize the dataset in memory. Check the multiprocessing context to make sure that you are using the correct method for memory efficiency.

See Job Memory Allocations

A common issue encountered in HPC workflows is OOM errors which can be hard to debug. For example, I was in the middle of making a distributed training script work with single node multi-GPUs. I kept getting my spawned processes killed with zero traceback due to SIGKILL, which was frustrating. Presumably this could be caused by a OOM error in one of the processes.

The first thing to do is use qstat to see the job allocation info including memory allocated, ncpus, and ngpus.

# -f flag provides detailed resource info
qstat -f

# grab resources for your specific job
qstat -f JOBID | grep Resource_List

From within the job, you can also do:

qstat -f $PBS_JOBID | grep Resource_List