Constructing High-Load Installation
On this page, you will find an overview of the product architecture and principles for constructing a high-load installation. We will discuss the general SDK architecture and components, show how to implement vertical and horizontal scaling, and provide environment recommendations.
- Overview
- Vertical Scaling (Scaling Up)
- Horizontal Scaling (Scaling Out)
- Mixing Strategies
- Environment Recommendations
Overview
Basic Architecture
The backend of Face SDK consists of a Web Service and Face Core.
The Web Service handles incoming HTTP requests, processes them via Face Core, and returns HTTP responses. Face Core directly processes the requests.
This combination of the Web Service and Face Core is called a worker. By default, one worker is launched on a machine with the Face SDK.
A worker processes requests in a single-threaded mode, meaning it handles one request at a time. If multiple HTTP requests are sent to one worker simultaneously, they will queue up in a FIFO (First In, First Out) order.
CPU and GPU
The Web Service always uses CPU resources while Face Core can use both CPU and GPU. The use of GPU offers significantly higher performance due to better parallel processing capabilities.
Performance of a worker can be increased by selecting more powerful CPU/GPU units or using multiple units simultaneously, as a worker can utilize multiple units at once.
The resource requirements for one worker are the following:
Mode | RAM Required | Additional Requirements |
---|---|---|
CPU | 3GB RAM | - |
GPU | 3GB RAM | 3GB GPU Memory |
External Components
For the Liveness and Identification features, the following additional components are required:
- Database
- File storage
- Milvus vector database
For more details, see the Liveness and Identification architectures.
For database and storage scaling recommendations, refer to the documentation of the respective components.
Vertical Scaling (Scaling Up)
By default, one worker is launched, and it processes requests in a single-threaded mode. In case there are many incoming requests, they will queue up.
To parallelize processing on a single server, you can launch multiple workers, see the workers
parameter of the webServer
configuration. In this case, a master process will handle incoming requests, distribute them to workers, collect responses, and return them. It will also manage the workers.
This setup allows parallel processing of incoming HTTP requests. The number of simultaneously processed requests will equal the number of launched workers. However, since workers share computational resources of one instance, the processing time for a single request may increase, refer to Performance Results.
The number of workers that can be launched on one instance is limited by RAM for CPU usage and by RAM and GPU memory for GPU usage, see the resource requirements.
Here are the AWS instance examples:
Instance Size | Memory (GiB) | Max Worker Count |
---|---|---|
c7.large |
4 | 1 |
c7.2xlarge |
16 | 5 |
Instance Size | Memory (GiB) | GPU Memory (GiB) | Max Worker Count |
---|---|---|---|
g4dn.xlarge |
16 | 16 | 5 |
g4dn.2xlarge |
32 | 16 | 5 |
Since the GPU memory size does not change, the maximum number of workers remains the same when changing the instance size. For GPU usage, some operations still run on the CPU, making a low-performance CPU a bottleneck. In our example, the g4dn.2xlarge
is preferred for better performance despite having the same worker count as the g4dn.xlarge
. The reason is the CPU performance, which is insufficient for handling 5 workers.
To determine the optimal number of workers, both for CPU and GPU usage, we recommend testing with a load profile matching your business scenario, see the Testing Framework page for details.
If using only one instance, it's possible to host external components on it as well. This is done in the Advanced: Full installation type. However, external components will occupy RAM and consume CPU resources. We recommend testing to ensure desired performance and stable operation of the chosen setup.
Horizontal Scaling (Scaling Out)
If the performance of a single instance is insufficient for processing the required request flow, horizontal scaling can be used by adding more instances. In this case, you will need a load balancer to distribute requests across instances.
We recommend installing external components on separate instances and carefully following the scaling recommendations from the component manufacturers.
Mixing Strategies
In some cases, you might need to combine both vertical and horizontal scaling to achieve the desired performance and efficiency.
You can start with scaling up a single instance by increasing the number of workers based on the memory and CPU/GPU capabilities as detailed in the Vertical Scaling section.
Once the vertical scaling limit is reached, you can add more instances to handle the additional load. Use a load balancer to distribute incoming requests evenly across all instances, ensuring no single instance becomes a bottleneck. Install external components on separate instances.
Environment Recommendations
1. Determine the desired processing time for a single request.
2. Select an instance that allows achieving the desired processing time. For selection, we recommend testing with typical requests matching your business scenario.
3. Determine the load profile, as this will influence the required number of workers and, consequently, the resources and scaling scheme.
Consider peak values and their duration, as averaging over a long period may underestimate the number of requests per unit of time that need to be processed. As a result, a configuration based on the average value may not handle peak loads.
4. Determine the number of workers required to process the desired request flow (target throughput) at a known processing time for a single request (latency) by the formula:
Worker count = target throughput × latency
Info
For example, to process 15 requests per second with a processing time of 0.8 seconds per request, the worker number is calculated the following way:
15 requests per second × 0.8 seconds per request = 12 workers
Depending on the number of workers, you can choose the appropriate scaling scheme.
The load profile is likely to vary cyclically over time. For example, the main request flow may occur during the day, with significantly fewer requests at night. Or there may be peak hours, for example, at the beginning of the working day, when a significantly larger flow of requests arrives than at other times. In such cases, with horizontal scaling, it might be beneficial to adjust the number of instances according to the current load. A monitoring system for load and instance management is necessary.
We recommend monitoring CPU and GPU utilization. As a rule of thumb, if utilization is 80% or higher, a new instance should be launched. However, consider the instance launch time and the rate of load increase. It may be necessary to lower the threshold.
If the load distribution over time is known, the necessary number of instances can be launched according to a schedule. For example, if the peak is at 8 AM and the launch takes 10 minutes, start the deployment at 7:45 AM.
Similar ideas apply to reducing the number of instances when the load decreases.
You can find an example implementation of a scaling scheme for AWS on GitHub:
Make sure to check the allowed number of database connections, as the default setting is often not very high.