Folding@home | NVIDIA NGC - Kiến Thức Cho Người lao Động Việt Nam

Folding@home GPU Container

Folding@home is a distributed computing project for simulating protein
dynamics, including the process of protein folding and the movements of
proteins implicated in a variety of diseases. It brings together citizen
scientists who volunteer to run simulations of protein dynamics on their
computers. Insights from this data are helping scientists to better
understand biology, and providing new opportunities for developing
therapeutics.

Overview

Running the Folding@home container is straightforward, however special care
must be taken to manage and return Work Units on time.

Familiarity with Linux and containers is assumed. Due to the prerequisites
and setup complexity this does not make an ideal “hello-world” container –
the standard Folding@home Linux clients work great and have slightly less
overhead.

The Folding@home container is similar to a database container needing
persistent storage mounted into /fah and careful lifecycle management to
avoid losing or wasting work. The config.xml also contains client state, so
must be managed that way.

CUDA 9.2 is used as a base for greater compatibility – for the details, see:
CUDA Compatibility

This document uses Docker as the example runtime but others are also
supported. Read the Other Runtimes section for Singularity
and other runtimes.

Operating the Folding@home Container

Each of these is explained in more detail below, but they are included here
for clarity. RFC 2119 meanings.

MUST mount read-writable persistent storage to /fah of the running
container. Running containers MUST NOT share the same mounted directory,
but directories SHOULD be reused to avoid lost Work Units.
MUST create and preload a tuned config.xml in each persistent
storage directory before running the container for the first time.
MUST run the container as a uid:gid, specified with with --user or
equivalent, so that the running container has read-write permissions to
the persistent storage mounted in /fah.
SHOULD NOT run containers as root.
SHOULD NOT expose ports to internet without firewall rules, encryption, and
strong passwords.

Folding@home Websites

Feedback and Issues

Read the README and CONTRIBUTING at
https://github.com/foldingathome/containers/ for design goals,
architecture, guidelines for contributing, and other information.

Please raise any bugs or issues with the containers on GitHub:
https://github.com/foldingathome/containers/issues

Prerequisites

Setup User Configuration

Pick your Username –
FAQ.
Setup your
Passkey.
This will give bonus points after completing 10 Work Units on time.
Join a team or create your own –
FAQ.
If running inside of a company, make sure management has signed off on both
your participation with company resources, and the team/user names used.

These values will be used in your config.xml later.

Technical Requirements

Docker 19.03 or later (for single node).
Persistent storage for each running container.

For NVIDIA GPUs

Updated NVIDIA GPU Driver – v396 or later. 440+ recommended, avoid .run files.
NVIDIA Container Runtime –
https://github.com/NVIDIA/nvidia-container-runtime

Running on Single machines

Before scaling up containers on a cluster or cloud, it’s important to be
familiar with the /fah storage requirements, life cycle, and
usage that will complete Work Units and help the research on Folding@home.
That starts with one machine.

Single Machine Setup

Once the prerequisites are met, it’s time to run the
container.

See example config files and be sure to
set your user/passkey/team.

# Make a directory for persistent storage
mkdir $HOME/fah

# Edit config.xml based on an example config below, use vi or other editor.
vi $HOME/fah/config.xml

Over time config.xml will also have client state, and will be rewritten by
the client.

Start Folding on a Single Machine

# Run container with GPUs, name it "fah0", map user and /fah volume
docker run --gpus all --name fah0 -d --user "$(id -u):$(id -g)" \
  --volume $HOME/fah:/fah fah-gpu:VERSION

Monitoring Logs on a Single Machine

# Dump output so far this run
docker logs fah0

# Tail the log
docker logs -f fah0

Stopping Container on a Single Machine

# Stop container once Work Units finish (preferred), may take hours
docker exec fah0 FAHClient --send-command finish

# Stop container after checkpoint, usually under 30 seconds.
# Be sure to start it again to let it finish before the Work Units expire.
docker exec fah0 FAHClient --send-command shutdown
# The container can also just be killed, but that's not as nice.

Running on Clusters

There are a lot of container orchestrators, so the requirements are as
simple as possible:

The container orchestration needs to be able to allocate and manage GPUs.
Run one container per machine/VM – each client can manage many GPUs and
CPU cores, and should have a config.xml tuned for the host/VM size.
Each running container must have it’s own seperate persistent storage
directory mounted into the /fah directory of the container. They should
be reused, but two containers should never be using the same directory.

Cluster Storage Setup

Create a root folder on the cluster storage, e.g. .../root-dir/ and create
subdirectories based on one of these methods:

Method 1: For smaller clusters, having one directory per host is simple. When
run the containers can mount .../root-dir/$hostname/ to /fah for the job
running on hostname.

Method 2: For larger clusters, having a pool of directories that can be reused
based on how many clients are run. Running them takes more careful management
but mounting .../root-dir/$jobname/ to the /fah folder of jobs named
fold00 … fold99 is the general idea.

Before running any clients make sure to copy your customized config.xml
to all the subfolders.

Other methods are valid, as long as they meet the requirements above.

Start Folding on a Cluster

Based on the storage setup, run one container per subfolder, mounting it
into /fah.

Monitoring Logs on a Cluster

Your container orchestrator should have commands equivalent to
docker logs ... and docker exec ... to perform the same functions.

# See how many Work Untis have been returned by all clients
grep points .../root-dir/*/log.txt .../root-dir/*/logs/*.txt

Stopping Container on a Cluster

How containers are stopped on the cluster will effect how many Work Units are
late or lost.

# prefered shutdown
command exec container-id FAHClient --send-command finish

# Stop container after checkpoint, usually under 30 seconds.
command exec container-id FAHClient --send-command shutdown
# The container can also just be killed, but that's not as nice.

The goal is to avoid accumulating a lot of subdirectories with unfinished
Work Units.

Running the Folding@home container with low priority on a cluster where it
gets preempted and resumed will work fine. The max-units configuration
option may also be useful in combination with low priority to use idle
capacity where preemption is not available.

Example Config Files

For the latest example config files see:
https://github.com/FoldingAtHome/containers/tree/master/fah-gpu#example-config-files

The config options used for running the client in containers are slightly
different than the ones used in a standalone install.
These are the interesting ones:

user, passkey, team – user and team. Set them to your values.
exit-when-done – have container exit once a finish is sent to it.
power – run 100% of the time but idle priority.
web-enable, disable-viz, gui-enabled – disable unnecessary things.
slots … – SMP and GPU slots. Each GPU slot also takes 1 CPU core.
Each SMP slot can be set to use N cores. The “cpus” tag can be left out on
1-cpu low core count machines and it will autoconfig.

Client help on all the options is available with:

docker run --rm fah-gpu:VERSION --help

Other Runtimes

While this README focused on Docker, it is not the only container runtime.

Singularity

A full Singularity HOWTO is beyond this document currently. These commands
should help someone familiar with Singularity get started on a single machine:

mkdir fah && cd fah

# Create/Copy config.xml as described above

singularity build fah.sif docker://nvcr.io/hpc/foldingathome/fah-gpu:7.6.13
singularity instance start --nv -B$(pwd):/fah fah.sif fah_instance
singularity exec instance://fah_instance /bin/bash -c "coproc /usr/bin/FAHClient"
tail -f log.txt