Constructing High-Load Installation
This page provides the principles for constructing a high-load Document Reader SDK Web Service installation. We will discuss the general SDK architecture and components, show how to implement vertical and horizontal scaling, and provide environment recommendations.
Overview
The backend of Document Reader SDK consists of the Web Service and Document Reader Core.
The Web Service handles incoming HTTP requests, processes them via the Document Reader Core, and returns HTTP responses. The Document Reader Core directly processes the requests.
This combination of the Web Service and Document Reader Core is called a worker. By default, the single worker is launched on a machine with the Document Reader SDK. See the following scheme.
For the Server-Side Verification feature support, the following additional components are required:
- Database
- File storage
For more details, see the Architecture page.
A worker processes requests in a single-threaded mode, meaning it handles one request at a time. If multiple HTTP requests are sent to the same worker simultaneously, they will queue up in a FIFO (First In, First Out) order and are processed one by one.
Your infrastructure planning should be based on your requirements for:
- The processing speed for one request.
- The number of parallel requests per unit of time.
The request processing speed depends on:
- The document image quality.
- The processing scenario, for example,
FullProcess
takes longer thanMrz
. - CPU performance.
- The number of allocated CPUs per 1 worker. Query stages are parallelized by design and a worker always tries to occupy all available hardware resources. So, 1 worker running on 4 CPUs will execute the request faster than 1 worker running on 1 CPU.
The number of parallel requests defines the number of running workers: if the number of workers is significantly lower than the number of parallel requests per unit of time, the requests queue up and wait for their execution. This directly affects the request processing speed.
Scaling Types
There are several scaling strategies, allowing to increase the system performance:
Vertical Scaling (Scaling Up)
By default, a single worker is launched, and it processes all requests in a single-threaded mode. In case there are many incoming requests, they will queue up.
To parallelize processing on a single server, you can launch multiple workers, see the workers
Web Server parameter. In this case, the single master process will handle all incoming requests, distribute them to separate workers, collect responses, and return them to the client. The master process will also manage the workers.
In case of Server-Side Verification implemented, it looks the following way:
This setup allows parallel processing of incoming HTTP requests. The number of simultaneously processed requests will equal the number of launched workers. However, since workers share computational resources of one server instance, the processing time for a single request may increase.
The number of workers that can be launched on one instance is limited by available RAM for CPU usage. For details, see the resource requirements.
See the sample values for the AWS CPU Instance:
Instance Size | Memory (GiB) | Max Worker Count |
---|---|---|
c7.large |
4 | 1 |
c7.2xlarge |
16 | 5 |
To determine the optimal number of workers, we recommend testing with a load profile matching your business scenario.
Horizontal Scaling (Scaling Out)
If the performance of a single instance is insufficient for processing the required request flow, the horizontal scaling method can be used by adding more instances to the same setup. In this case, you will need a properly set up load balancer to distribute requests across individual instances.
We recommend installing external components (the Database and Storage) on separate instances and carefully following the scaling recommendations from the component manufacturers. In case of Server-Side Verification implemented, it looks the following way:
Mixing Strategies
In some cases, you might need to combine both vertical and horizontal scaling to achieve the desired performance and efficiency.
You can start with scaling up a single instance by increasing the number of workers based on the memory and CPU capabilities as detailed in the Vertical Scaling section.
Once the vertical scaling limit is reached, you can add more instances to handle the additional load. Use a load balancer to distribute incoming requests evenly across all instances, ensuring no single instance becomes a bottleneck. Install external components on separate instances.
Environment Recommendations
1. Determine the desired processing time for a single request.
2. Select an instance that allows achieving the desired processing time. For selection, we recommend testing with typical requests matching your business scenario.
3. Determine the load profile, as this will influence the required number of workers and, consequently, the resource and scaling scheme.
Consider peak values and their duration, as averaging over a long period may underestimate the number of requests per unit of time that need to be processed. As a result, a configuration based on the average value may not handle peak loads.
4. Determine the number of workers required to process the desired request flow (target throughput) at a known processing time for a single request (latency) by the formula:
Worker count = target throughput × latency
Info
For example, to process 15 requests per second with a processing time of 0.8 seconds per request, the worker number is calculated the following way:
15 requests per second × 0.8 seconds per request = 12 workers
Depending on the number of workers, you can choose the appropriate scaling scheme.
The load profile is likely to vary cyclically over time. For example, the main request flow may occur during the day, with significantly fewer requests at night. Or there may be peak hours, for example, at the beginning of the working day, when a significantly larger flow of requests arrives than at other times. In such cases, with horizontal scaling, it might be beneficial to adjust the number of instances according to the current load. A monitoring system for the load and instance management is necessary.
We recommend monitoring CPU utilization. As a rule of thumb, if utilization is 80% or higher, a new instance should be launched. However, consider the instance launch time and the rate of load increase. It may be necessary to lower the threshold.
If the load distribution over time is known, the necessary number of instances can be launched according to a schedule. For example, if the peak is at 8 AM and the launch takes about 10 minutes, start the deployment beforehand at 7:45 AM.
Similar ideas apply to reducing the number of instances when the load decreases.
You can find an example implementation of a scaling scheme for AWS on GitHub: AWS EC2 Regula Forensics Demo.
Make sure to check the allowed number of database connections, as the default setting is often not very high.
Performance Testing
For defining the optimal server configuration and concrete hardware/software parameters, you need to periodically evaluate the system performance. To facilitate the capacity testing process, Regula prepared the dedicated GitHub sample project and the testing techniques guide.