Skip to content

Hardware requirements#

Server / PC installations#

See "Appendix A. Specifications" for information about hardware used for performance measurements.

General considerations#

Be warned, that not all algorithms in the SDK have GPU or NPU implementations. If the desired algorithm doesn’t have a GPU or NPU implementation, a fallback to the CPU implementation has to be made. In this case, one should take care of possible memory transfers and latency they cause. Please see the algorithm implementation matrix for details.

Neural network CPU CPU AVX2 NPU Atlas GPU
FaceDet_v1_first.plan yes yes
FaceDet_v1_second.plan yes yes yes
FaceDet_v1_third.plan yes yes yes
FaceDet_v2_first.plan yes yes
FaceDet_v2_second.plan yes yes yes
FaceDet_v2_third.plan yes yes yes
FaceDet_v3__.plan yes yes yes yes
FaceDet_v3_redetect__.plan yes yes yes
model_subjective_quality__.plan yes yes yes
headpose_v3_.plan yes yes yes yes
ags_v3_.plan yes yes yes yes
attributes_estimation_.plan yes yes yes
portrait_style__.plan yes yes yes
background__.plan yes yes yes
emotion_recognition__.plan yes yes yes
glasses_estimation_v2_.plan yes yes yes
eyes_estimation_flwr8_.plan yes yes yes
eye_status_estimation_flwr_.plan yes yes yes
eyes_estimation_ir_.plan yes yes yes
gaze__.plan yes yes yes
red_eye__.plan yes yes yes
gaze_ir__.plan yes yes yes
overlap_estimation_v1_.plan yes yes yes
mouth_estimation__.plan yes yes yes
face_occlusion_v1_.plan yes yes yes
mask_clf__.plan yes yes yes
ppe_estimation__.plan yes yes yes
orientation_.plan yes yes yes
LNet_precise__.plan yes yes yes
LNet_ir_precise__.plan yes yes yes
slnet__.plan yes yes yes
liveness_model__.plan yes yes yes
depth_estimation_.plan yes yes yes
ir_liveness_universal_.plan yes yes yes
ir_liveness_ambarella_.plan yes yes yes
eyebrow_estimation__.plan yes yes yes
flying_faces_liveness__.plan yes yes yes
rgbm_liveness_.plan yes yes yes
rgbm_liveness_pp_hand_frg_.plan yes yes yes
natural_light_.plan yes yes yes
head_wear__.plan yes yes yes
fisheye__.plan yes yes yes
human__.plan yes yes yes
human_redetect_.plan yes yes yes
human_attributes__.plan yes yes yes
reid_102.plan (deprecated) yes yes yes
reid_103.plan (deprecated) yes yes yes
reid_104.plan (deprecated) yes yes yes
reid_105.plan yes yes yes
reid_106.plan yes yes yes
reid_107.plan yes yes yes
reid_108.plan yes yes yes
reid_109.plan yes yes yes
reid_110.plan yes yes yes
reid_112.plan yes yes yes
reid_113.plan yes yes yes
cnn54b_.plan yes yes yes
cnn54m_.plan yes yes yes
cnn56b_.plan yes yes yes
cnn56m_.plan yes yes yes
cnn57b_.plan yes yes yes yes
cnn58b_.plan yes yes yes
cnn59m_.plan yes yes yes yes
cnn60b_.plan yes yes yes
cnn62b_.plan yes yes yes
cnn65b_.plan yes yes yes
oneshot_rgb_liveness_model_1.plan yes yes yes
oneshot_rgb_liveness_model_2.plan yes yes yes
oneshot_rgb_liveness_model_3.plan yes yes yes
oneshot_rgb_liveness_model_4.plan yes yes yes
crowd__.plan yes yes yes
depth_liveness_v2_.plan yes yes yes yes
vlTracker_detection_.plan yes yes yes yes
vlTracker_template_.plan yes yes yes yes
vlTracker_update_.plan yes yes yes yes

CPU requirements#

For NN with "*_cpu.plan" in names, CPU should support at least the SSE4.2 instruction set.

For NN with "*_cpu-avx2.plan" in names, AVX2 instruction set support is required for the best performance.

Only 64-bit CPUs are supported.

If in doubt, consider checking your CPU specifications at the following websites:

GPU requirements#

For GPU acceleration an NVIDIA GPU is required. The following architectures are supported:

A minimum of 6GB or dedicated video RAM is required. 8 GB or more VRAM recommended.

The number of actually created threads while using GPU#

The total number of threads can be calculated by such expression:

totalNumberOfThreads = numThreads + 2*numGpuDevices + 1 (and 1 optional), 

where

  • numThreads is the value of setting <param name="numThreads" type="Value::Int1" x="12" />. Description can be found in "Configuration Guide - Runtime settings";
  • numGpuDevices is the number of GPU devices;
  • One of threads for CUDA in runtime;
  • And besides 1 optional thread depending on internal settings LUNA-SDK API;

Example: if numThreads==4 and there are 2 GPU devices in system the total number of threads will be 9 where 4 - are numThreads, 2 + 2 for every GPU and 1 thread for CUDA.

For decreasing of threads number can be set the environment variable CUDA_VISIBLE_DEVICES=-1.

NPU requirements#

Huawei Atlas NPU was utilized with the following drivers and additional SW installed:

Drivers:

  • Version = 20.2.0
  • ascendhal_version = 4.0.0
  • aicpu_version = 1.0
  • tdt_version = 1.0
  • log_version = 1.0
  • prof_version = 2.0
  • dvppkernels_version = 1.1
  • tsfw_version = 1.0
  • required_firmware_firmware_version = 1.0

Firmware:

  • Version = 1.76.22.3.220
  • firmware_version = 1.0

Toolkit:

  • Version = 1.76.22.3.220

RAM requirements#

System memory consumption differs depending on a usage scenario and is proportional to the number of worker threads. This is true for both CPU (think system RAM) and GPU (think VRAM) execution modes.

For example, in CPU execution mode 1GB RAM is enough for a typical pipeline, which consists of a face detector and a face descriptor extractor running on a single core (one worker thread) and processing 1080p input images with 10-12 faces on average. If this setup is scaled up to 8 worker threads, overall memory consumption grows up to 8GB.

It is recommended to assume at least 1GB of free RAM per worker thread.

Storage requirements#

FaceEngine requires 1GB of free space to install. This includes model data for both CPU and GPU execution modes that should be redistributed with your application. If only one execution mode is planned, reduce space requirements by half.

Approaches to software design targeting different hardware#

When performing inference on different hardware, several key differences should be taken into account to reach maximum possible performance:

CPU#

Key points:

  • Memory used by the inference engine is physically located on the same chips where OS and business logic data reside. Source data (images/video frames) also reside there.
  • The CPU is general-purpose hardware, not tailored for many operations specific to NN inference.

Implications:

  • No memory transfers ever performed, memory access latency is low. the CPU is easily saturated with work.
  • Both memory and CPU may receive additional pressure from background processes.

Recommendations:

  • Don’t expect profit from batching. If the software isn’t expected to ever run/support GPU or NPU, don’t implement it at all. Instead, consider culling computation-heavy algorithms early (e.g. check head pose and AGS score before attempting to extract a descriptor in order to avoid the extraction for bad faces).
  • Use tools like taskset() to isolate different types of workload on process level on servers.
  • Consider running a separate SDK process per node on NUMA systems. Note, that SDK itself is not NUMA-aware.

GPU/NPU#

Key points:

  • Memory used by the inference engine is physically located on the device and source data (images/video frames) is on the host memory.
  • While servers typically use DDR memory, GPU/NPU devices prefer GDDR, which offers higher throughput at the cost of higher latency.
  • GPU/NPU devices process excessive amounts of data in hundreds/thousands of threads without external interference. In addition, they implement specialized instructions for many typical NN inference operations.
  • GPU/NPU are fed with work by the CPU.

Implications:

  • Memory transfers should be taken into account. Such transfers typically take place by means of the PCI-e bus and the bus may become the performance bottleneck. GPU/NPU generally needs much more input data to saturate it with work.

Recommendations:

  • Batch multiple source images together and do inference for the entire match at once. This helps to saturate both the bus and the device. See recommended batch sizes in chapter Appendix A. Specifications.
  • Take care of memory residence. While SDK will do an implicit memory transfer for you, in some cases it is beneficial to do this yourself. E. g. Both Tesla and Atlas cards implement on-board hardware accelerated decoders for JPEG and h264 formats. If your software utilizes these decoders, don’t transfer the decoder output to the host memory. Instead, pass the device pointer to the SDK directly. Note, that SDFK Image class can wrap an existing memory pointer at no cost.
  • Take care of device work scheduling. The general rule of thumb:

    • Don’t acces the same device from multiple threads/processes, this may involve kernel level locks or be unsupported at all
    • Access different devices from different threads/processes. This way work scheduling is less likely to be CPU-bound.
    • Workload isolation recommendations for the CPU also apply here.

SDK algorithms are device-bound. To support multiple devices in one process, you are required to create each algorithm implementation you need on a per-device basis and bind it to the corresponding device as shown in the example below:

int32_t npuDeviceIndex = 1;
fsdk::LaunchOptions launchOptions;
launchOptions.deviceClass = fsdk::DeviceClass::NPU_ASCEND;
launchOptions.npuDevice = npuDeviceIndex;

auto result = faceEngine->createDetector(
      detectorType,
      fsdk::SensorType::Visible,
      &launchOptions
    );
ASSERT_TRUE(result.isOk());

auto detector = result.getValue();

GPU specific recommendations

GPUs tend to be harder to saturate with work. Consider bigger batches.

NPU specific recommendations

Atlas 300I NPU is designed such that there are 4 different NPU devices per accelerator card. This means that you have to design your software for multi-device scenarios from the ground up to achieve the best performance. The card has a PCI-e x8 bus connector and each NPU device consumes x2 lanes from it; the bus is likely to become the bottleneck. Atlas 300I NPU is saturated with work quite easily; batching makes sense for some particularly lightweight NNs mostly. Memory operations on the device (copy, clears) are particularly slow.

Requirements for GPU acceleration#

Recommended versions of CUDA

The most current version of these release notes can be found online at http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html.

Note 1: For Win64 and Linux (Ubuntu, CentOS) there are additional requirements - Compute Capability 6.1 or higher.

CUDA version on Linux can be found using command below:

$nvidia-smi

CUDA version on Windows can be found in Control Panel\Programs\Programs and Features as in figure below

CUDA version on Win
CUDA version on Win

We recommend to use suggested version of CUDA for your operating system. But if your version is older than required, we can't give guaranties, that it will work successfully. More details about CUDA Compatibility, can be found online at https://docs.nvidia.com/deploy/cuda- compatibility/index.html.

Embedded installations#

CPU requirements#

Supported CPU architectures:

  • ARMv7-A;
  • ARMv8-A (ARM64).

Android for embedded#

One more step to online activation process, in addition to information about LUNA SDK licensing, described in VisionLabs LUNA SDK Licensing, paragraph License activation.

Besides the common steps for online-activation, described in document VisionLabs LUNA SDK Licensing, for Android for embedded systems, execute a native licensed binary for Android for embedded with root permissions at least once.