Hardware requirements#

Server / PC installations#

See "Appendix A. Specifications" for information about hardware used for performance measurements.

General considerations#

Be warned, that not all algorithms in the SDK have GPU or NPU implementations. If the desired algorithm doesn’t have a GPU or NPU implementation, a fallback to the CPU implementation has to be made. In this case, one should take care of possible memory transfers and latency they cause. Please see the algorithm implementation matrix for details.

Neural network	CPU	CPU AVX2	NPU Atlas	GPU
FaceDet_v1_first.plan	yes			yes
FaceDet_v1_second.plan	yes	yes		yes
FaceDet_v1_third.plan	yes	yes		yes
FaceDet_v2_first.plan	yes			yes
FaceDet_v2_second.plan	yes	yes		yes
FaceDet_v2_third.plan	yes	yes		yes
FaceDet_v3__.plan	yes	yes	yes	yes
FaceDet_v3_redetect__.plan	yes	yes		yes
model_subjective_quality__.plan	yes	yes		yes
ags_angle_estimation_flwr_.plan	yes	yes		yes
angle_estimation_flwr_.plan	yes	yes	yes	yes
ags_estimation_flwr_.plan	yes	yes	yes	yes
attributes_estimation_.plan	yes	yes		yes
childnet_estimation_flwr_.plan	yes	yes		yes
portrait_style__.plan	yes	yes		yes
background__.plan	yes	yes		yes
emotion_recognition__.plan	yes	yes		yes
glasses_estimation_flwr_.plan	yes	yes		yes
eyes_estimation_flwr8_.plan	yes	yes		yes
eye_status_estimation_flwr_.plan	yes	yes		yes
eyes_estimation_ir_.plan	yes	yes		yes
gaze__.plan	yes	yes		yes
red_eye__.plan	yes	yes		yes
gaze_ir__.plan	yes	yes		yes
overlap_estimation_flwr_.plan	yes	yes		yes
mouth_estimation__.plan	yes	yes		yes
mask_clf__.plan	yes	yes		yes
ppe_estimation__.plan	yes	yes		yes
orientation_.plan	yes	yes		yes
LNet_precise__.plan	yes	yes		yes
LNet_ir_precise__.plan	yes	yes		yes
slnet__.plan	yes	yes		yes
liveness_model__.plan	yes	yes		yes
depth_estimation_.plan	yes	yes		yes
faceflow_model_1_.plan	yes	yes		yes
faceflow_model_2_.plan	yes	yes		yes
ir_liveness_universal_.plan	yes	yes		yes
ir_liveness_ambarella_.plan	yes	yes		yes
hs_shoulders_liveness_estimation_flwr_.plan	yes	yes		yes
hs_head_liveness_estimation_flwr_.plan	yes	yes		yes
eyebrow_estimation__.plan	yes	yes		yes
flying_faces_liveness__.plan	yes	yes		yes
rgbm_liveness_.plan	yes	yes		yes
rgbm_liveness_pp_hand_frg_.plan	yes	yes		yes
natural_light_.plan	yes	yes		yes
head_wear__.plan	yes	yes		yes
fisheye__.plan	yes	yes		yes
human_keypoints__.plan	yes	yes		yes
human__.plan	yes	yes		yes
human_redetect_.plan	yes	yes		yes
human_attributes__.plan	yes	yes		yes
reid102_.plan	yes	yes		yes
reid103_.plan	yes	yes		yes
reid104_.plan	yes	yes		yes
reid105_.plan	yes	yes		yes
reid106_.plan	yes	yes		yes
reid107_.plan	yes	yes		yes
cnn54b_.plan	yes	yes		yes
cnn54m_.plan	yes	yes		yes
cnn56b_.plan	yes	yes		yes
cnn56m_.plan	yes	yes		yes
cnn57b_.plan	yes	yes	yes	yes
cnn58b_.plan	yes	yes		yes
cnn59b_.plan	yes	yes		yes
cnn59m_.plan	yes	yes	yes	yes
cnn60b_.plan	yes	yes		yes
oneshot_rgb_liveness_model_1.plan	yes	yes		yes
oneshot_rgb_liveness_model_2.plan	yes	yes		yes
oneshot_rgb_liveness_model_3.plan	yes	yes		yes
oneshot_rgb_liveness_model_4.plan	yes	yes		yes
oneshot_rgb_liveness_model_5.plan	yes	yes		yes
oneshot_rgb_liveness_model_6.plan	yes	yes		yes
oneshot_rgb_liveness_model_7.plan	yes	yes		yes
oneshot_rgb_liveness_model_8.plan	yes	yes		yes
oneshot_rgb_liveness_model_9.plan	yes	yes		yes
crowd__.plan	yes	yes		yes

CPU requirements#

For NN with "*_cpu.plan" in names, CPU should support at least the SSE4.2 instruction set.

For NN with "*_cpu-avx2.plan" in names, AVX2 instruction set support is required for the best performance.

Only 64-bit CPUs are supported.

If in doubt, consider checking your CPU specifications at the following websites:

Intel CPU: http://ark.intel.com
AMD CPU: http://products.amd.com.

GPU requirements#

For GPU acceleration an NVIDIA GPU is required. The following architectures are supported:

Pascal or newer
Compute Capability - 6.1 or higher

A minimum of 6GB or dedicated video RAM is required. 8 GB or more VRAM recommended.

The number of actually created threads while using GPU#

The total number of threads can be calculated by such expression:

totalNumberOfThreads = numThreads + 2*numGpuDevices + 1 (and 1 optional),

where

numThreads is the value of setting <param name="numThreads" type="Value::Int1" x="12" />. Description can be found in "Configuration Guide - Runtime settings";
numGpuDevices is the number of GPU devices;
One of threads for CUDA in runtime;
And besides 1 optional thread depending on internal settings LUNA-SDK API;

Example: if numThreads==4 and there are 2 GPU devices in system the total number of threads will be 9 where 4 - are numThreads, 2 + 2 for every GPU and 1 thread for CUDA.

For decreasing of threads number can be set the environment variable CUDA_VISIBLE_DEVICES=-1.

NPU requirements#

Huawei Atlas NPU was utilized with the following drivers and additional SW installed:

Drivers:

Version = 20.2.0
ascendhal_version = 4.0.0
aicpu_version = 1.0
tdt_version = 1.0
log_version = 1.0
prof_version = 2.0
dvppkernels_version = 1.1
tsfw_version = 1.0
required_firmware_firmware_version = 1.0

Firmware:

Version = 1.76.22.3.220
firmware_version = 1.0

Toolkit:

Version = 1.76.22.3.220

RAM requirements#

System memory consumption differs depending on a usage scenario and is proportional to the number of worker threads. This is true for both CPU (think system RAM) and GPU (think VRAM) execution modes.

For example, in CPU execution mode 1GB RAM is enough for a typical pipeline, which consists of a face detector and a face descriptor extractor running on a single core (one worker thread) and processing 1080p input images with 10-12 faces on average. If this setup is scaled up to 8 worker threads, overall memory consumption grows up to 8GB.

It is recommended to assume at least 1GB of free RAM per worker thread.

Storage requirements#

FaceEngine requires 1GB of free space to install. This includes model data for both CPU and GPU execution modes that should be redistributed with your application. If only one execution mode is planned, reduce space requirements by half.

Approaches to software design targeting different hardware#

When performing inference on different hardware, several key differences should be taken into account to reach maximum possible performance:

CPU#

Key points:

Memory used by the inference engine is physically located on the same chips where OS and business logic data reside. Source data (images/video frames) also reside there.
The CPU is general-purpose hardware, not tailored for many operations specific to NN inference.

Implications:

No memory transfers ever performed, memory access latency is low. the CPU is easily saturated with work.
Both memory and CPU may receive additional pressure from background processes.

Recommendations:

Don’t expect profit from batching. If the software isn’t expected to ever run/support GPU or NPU, don’t implement it at all. Instead, consider culling computation-heavy algorithms early (e.g. check head pose and AGS score before attempting to extract a descriptor in order to avoid the extraction for bad faces).
Use tools like taskset() to isolate different types of workload on process level on servers.
Consider running a separate SDK process per node on NUMA systems. Note, that SDK itself is not NUMA-aware.

GPU/NPU#

Key points:

Memory used by the inference engine is physically located on the device and source data (images/video frames) is on the host memory.
While servers typically use DDR memory, GPU/NPU devices prefer GDDR, which offers higher throughput at the cost of higher latency.
GPU/NPU devices process excessive amounts of data in hundreds/thousands of threads without external interference. In addition, they implement specialized instructions for many typical NN inference operations.
GPU/NPU are fed with work by the CPU.

Implications:

Memory transfers should be taken into account. Such transfers typically take place by means of the PCI-e bus and the bus may become the performance bottleneck. GPU/NPU generally needs much more input data to saturate it with work.

Recommendations:

Batch multiple source images together and do inference for the entire match at once. This helps to saturate both the bus and the device. See recommended batch sizes in chapter Appendix A. Specifications.
Take care of memory residence. While SDK will do an implicit memory transfer for you, in some cases it is beneficial to do this yourself. E. g. Both Tesla and Atlas cards implement on-board hardware accelerated decoders for JPEG and h264 formats. If your software utilizes these decoders, don’t transfer the decoder output to the host memory. Instead, pass the device pointer to the SDK directly. Note, that SDFK Image class can wrap an existing memory pointer at no cost.
Take care of device work scheduling. The general rule of thumb:
- Don’t acces the same device from multiple threads/processes, this may involve kernel level locks or be unsupported at all
- Access different devices from different threads/processes. This way work scheduling is less likely to be CPU-bound.
- Workload isolation recommendations for the CPU also apply here.

SDK algorithms are device-bound. To support multiple devices in one process, you are required to create each algorithm implementation you need on a per-device basis and bind it to the corresponding device as shown in the example below:

int32_t npuDeviceIndex = 1;
fsdk::LaunchOptions launchOptions;
launchOptions.deviceClass = fsdk::DeviceClass::NPU_ASCEND;
launchOptions.npuDevice = npuDeviceIndex;

auto result = faceEngine->createDetector(
      detectorType,
      fsdk::SensorType::Visible,
      &launchOptions
    );
ASSERT_TRUE(result.isOk());

auto detector = result.getValue();

GPU specific recommendations

GPUs tend to be harder to saturate with work. Consider bigger batches.

NPU specific recommendations

Atlas 300I NPU is designed such that there are 4 different NPU devices per accelerator card. This means that you have to design your software for multi-device scenarios from the ground up to achieve the best performance. The card has a PCI-e x8 bus connector and each NPU device consumes x2 lanes from it; the bus is likely to become the bottleneck. Atlas 300I NPU is saturated with work quite easily; batching makes sense for some particularly lightweight NNs mostly. Memory operations on the device (copy, clears) are particularly slow.

Requirements for GPU acceleration#

Recommended versions of CUDA

For Win64 - CUDA Toolkit 11.4
For Linux(Ubuntu, CentOS) - CUDA Toolkit 11.4
For Jetson(TX2, AGX Xavier, Xavier NX, Nano) - CUDA Toolkit 10.2

The most current version of these release notes can be found online at http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html.

Note 1: For Win64 and Linux (Ubuntu, CentOS) there are additional requirements - Compute Capability 6.1 or higher.

CUDA version on Linux can be found using command below:

$nvidia-smi

Cuda version on Windows can be found in Control Panel\Programs\Programs and Features as in figure below

We recommend to use suggested version of CUDA for your operating system. But if your version is older than required, we can't give guaranties, that it will work successfully. More details about CUDA Compatibility, can be found online at https://docs.nvidia.com/deploy/cuda- compatibility/index.html.

Embedded installations#

CPU requirements#

Supported CPU architectures:

ARMv7-A;
ARMv8-A (ARM64).

Android for embedded#

One more step to online activation process, in addition to information about LUNA SDK licensing, described in VisionLabs LUNA SDK Licensing, paragraph License activation.

Besides the common steps for online-activation, described in document VisionLabs LUNA SDK Licensing, for Android for embedded systems, execute a native licensed binary for Android for embedded with root permissions at least once.