Performance Optimization
The backend of Document Reader SDK consists of a Web Service and Document Reader Core.
The Web Service handles incoming HTTP requests, processes them via Document Reader Core, and returns HTTP responses. Document Reader Core directly processes the requests.
This combination of the Web Service and Document Reader Core is called a worker. By default, one worker is launched on a machine with the Document Reader SDK.
For Server-Side Verification, the following additional components are required:
- Database
- File storage
For more details, see the Architectures page.
A worker processes requests in a single-threaded mode, meaning it handles one request at a time. If multiple HTTP requests are sent to one worker simultaneously, they will queue up in a FIFO (First In, First Out) order and are processed one by one.
Your infrastructure planning should be based on your requirements for:
- The processing speed for one request.
- The number of parallel requests per unit of time.
The request processing speed depends on:
- The document image quality.
- The processing scenario, for example,
FullProcess
takes longer thanMrz
. - CPU performance.
- The number of allocated CPUs per 1 worker. Query stages are parallelized by design, so 1 worker running on 4 CPUs will execute the request faster than 1 worker running on 1 CPU.
The number of parallel requests defines the number of running workers: if the number of workers is significantly fewer than the number of parallel requests per unit of time, the requests queue up and wait for their execution. This directly affects the request processing speed.
Vertical Scaling (Scaling Up)
By default, one worker is launched, and it processes requests in a single-threaded mode. In case there are many incoming requests, they will queue up.
To parallelize processing on a single server, you can launch multiple workers, see the workers
parameter. In this case, a master process will handle incoming requests, distribute them to workers, collect responses, and return them. It will also manage the workers.
In case of Server-Side Verification implemented, it looks the following way:
This setup allows parallel processing of incoming HTTP requests. The number of simultaneously processed requests will equal the number of launched workers. However, since workers share computational resources of one instance, the processing time for a single request may increase.
The number of workers that can be launched on one instance is limited by RAM for CPU usage, see the resource requirements.
Here are the AWS CPU Instance examples:
Instance Size | Memory (GiB) | Max Worker Count |
---|---|---|
c7.large |
4 | 1 |
c7.2xlarge |
16 | 5 |
To determine the optimal number of workers, we recommend testing with a load profile matching your business scenario.
Horizontal Scaling (Scaling Out)
If the performance of a single instance is insufficient for processing the required request flow, horizontal scaling can be used by adding more instances. In this case, you will need a load balancer to distribute requests across instances.
We recommend installing external components on separate instances and carefully following the scaling recommendations from the component manufacturers. In case of Server-Side Verification implemented, it looks the following way:
Mixing Strategies
In some cases, you might need to combine both vertical and horizontal scaling to achieve the desired performance and efficiency.
You can start with scaling up a single instance by increasing the number of workers based on the memory and CPU capabilities as detailed in the Vertical Scaling section.
Once the vertical scaling limit is reached, you can add more instances to handle the additional load. Use a load balancer to distribute incoming requests evenly across all instances, ensuring no single instance becomes a bottleneck. Install external components on separate instances.
Environment Recommendations
1. Determine the desired processing time for a single request.
2. Select an instance that allows achieving the desired processing time. For selection, we recommend testing with typical requests matching your business scenario.
3. Determine the load profile, as this will influence the required number of workers and, consequently, the resources and scaling scheme.
Consider peak values and their duration, as averaging over a long period may underestimate the number of requests per unit of time that need to be processed. As a result, a configuration based on the average value may not handle peak loads.
4. Determine the number of workers required to process the desired request flow (target throughput) at a known processing time for a single request (latency) by the formula:
Worker count = target throughput × latency
Info
For example, to process 15 requests per second with a processing time of 0.8 seconds per request, the worker number is calculated the following way:
15 requests per second × 0.8 seconds per request = 12 workers
Depending on the number of workers, you can choose the appropriate scaling scheme.
The load profile is likely to vary cyclically over time. For example, the main request flow may occur during the day, with significantly fewer requests at night. Or there may be peak hours, for example, at the beginning of the working day, when a significantly larger flow of requests arrives than at other times. In such cases, with horizontal scaling, it might be beneficial to adjust the number of instances according to the current load. A monitoring system for load and instance management is necessary.
We recommend monitoring CPU utilization. As a rule of thumb, if utilization is 80% or higher, a new instance should be launched. However, consider the instance launch time and the rate of load increase. It may be necessary to lower the threshold.
If the load distribution over time is known, the necessary number of instances can be launched according to a schedule. For example, if the peak is at 8 AM and the launch takes 10 minutes, start the deployment at 7:45 AM.
Similar ideas apply to reducing the number of instances when the load decreases.
You can find an example implementation of a scaling scheme for AWS on GitHub: AWS EC2 Regula Forensics Demo.
Make sure to check the allowed number of database connections, as the default setting is often not very high.