By default, the web service starts with one worker process. Each worker handles documents in a single-threaded mode that means that it can process only one request at a time. So, if you submit many requests at once, they queue up and are processed one by one.
Your infrastructure planning should be based on your requirements for:
- The processing speed for one request.
- The number of parallel requests per unit of time.
The request processing speed depends on:
- CPU/GPU performance.
- The number of allocated CPUs/GPUs per 1 worker. Query stages are parallelized by design, so 1 worker running on 4 CPUs/GPUs will execute the request faster than 1 worker running on 1 CPU/GPU.
The number of parallel requests defines the number of running workers: if the number of workers is significantly fewer than the number of parallel requests per unit of time, the requests queue up and wait for their execution. This directly affects the request processing speed.
The FaceAPI SDK is not a general HTTP web server that handles hundreds of requests per second. Typical face processing takes up to a few seconds, so we can call it CPU intensive. Considering that a typical instance has 1-4 workers, we need to carefully manage workers' time.
One of the main sources of wasting workers' processing time is a slow clients problem. When the web service receives a request from a client with a slow internet connection, this request goes to a free worker. The worker will be bottlenecked by the speed of the client connection, and it will be blocked until the slow client finishes sending a large ID image. Being blocked means that this worker process can't handle any other request in the meantime, it’s just there, idle, waiting to receive the entire request, so it can start really processing it.
We recommend using the SDK behind a reverse proxy server, for example, nginx, traefik, envoy. This will provide duties segregation: Loadbalancing, TLS management, slow clients buffering.
Typical Load Balancers from cloud providers, such as ELB/ALB from AWS, do not buffer slow clients. So, you still need to use some proxy server between LB and the FaceAPI.
To improve the throughput performance, you can launch more workers according to the planned load. The web service can spawn multiple workers under one service instance. You may set the workers number based on the hardware limitations you have. Also, you can use N web server instances with multiple workers under the loadbalancer to increase the performance even more.