Constructing High-Load Installation

This page provides the principles for constructing a high-load Document Reader SDK Web Service installation. We will discuss the general SDK architecture and components, show how to implement vertical and horizontal scaling, and provide environment recommendations.

Overview
Scaling Types
Environment Recommendations
Performance Testing

Overview

The backend of Document Reader SDK consists of the Web Service and Document Reader Core.

The Web Service handles incoming HTTP requests, processes them via the Document Reader Core, and returns HTTP responses. The Document Reader Core directly processes the requests.

This combination of the Web Service and Document Reader Core is called a worker. By default, the single worker is launched on a machine with the Document Reader SDK. See the following scheme.

Single worker structure scheme with Web Service and Document Reader Core components

For the Server-Side Verification feature support, the following additional components are required:

Database
File storage

Single worker structure scheme for Server-Side Verification with Web Service, Document Reader Core, Database, and Data Storage components

For more details, see the Architecture page.

A worker processes requests in a single-threaded mode, meaning it handles one request at a time. If multiple HTTP requests are sent to the same worker simultaneously, they will queue up in a FIFO (First In, First Out) order and are processed one by one.

Your infrastructure planning should be based on your requirements for:

The processing speed for one request.
The number of parallel requests per unit of time.

The request processing speed depends on:

The document image quality.
The processing scenario, for example, FullProcess takes longer than Mrz.
CPU performance.
The number of allocated CPUs per 1 worker. Query stages are parallelized by design and a worker always tries to occupy all available hardware resources. So, 1 worker running on 4 CPUs will execute the request faster than 1 worker running on 1 CPU.

The number of parallel requests defines the number of running workers: if the number of workers is significantly lower than the number of parallel requests per unit of time, the requests queue up and wait for their execution. This directly affects the request processing speed.

Scaling Types

There are several scaling strategies, allowing to increase the system performance:

Vertical Scaling (Scaling Up)
Horizontal Scaling (Scaling Out)
Mixing Strategies

Vertical Scaling (Scaling Up)

By default, a single worker is launched, and it processes all requests in a single-threaded mode. In case there are many incoming requests, they will queue up.

To parallelize processing on a single server, you can launch multiple workers, see the workers Web Server parameter. In this case, the single master process will handle all incoming requests, distribute them to separate workers, collect responses, and return them to the client. The master process will also manage the workers.

Multiple workers structure scheme in case of Vertical Scaling

In case of Server-Side Verification implemented, it looks the following way:

Multiple workers structure scheme for Server-Side Verification

This setup allows parallel processing of incoming HTTP requests. The number of simultaneously processed requests will equal the number of launched workers. However, since workers share computational resources of one server instance, the processing time for a single request may increase.

The number of workers that can be launched on one instance is limited by available RAM for CPU usage. For details, see the resource requirements.

See the sample values for the AWS CPU Instance:

Instance Size	Memory (GiB)	Max Worker Count
`c7.large`	4	1
`c7.2xlarge`	16	5

To determine the optimal number of workers, we recommend testing with a load profile matching your business scenario.

Horizontal Scaling (Scaling Out)

If the performance of a single instance is insufficient for processing the required request flow, the horizontal scaling method can be used by adding more instances to the same setup. In this case, you will need a properly set up load balancer to distribute requests across individual instances.

Multiple workers structure scheme in case of Horizontal Scaling, including Load Balancer

We recommend installing external components (the Database and Storage) on separate instances and carefully following the scaling recommendations from the component manufacturers. In case of Server-Side Verification implemented, it looks the following way:

Multiple workers structure scheme for Server-Side Verification in case of Horizontal Scaling

Mixing Strategies

In some cases, you might need to combine both vertical and horizontal scaling to achieve the desired performance and efficiency.

You can start with scaling up a single instance by increasing the number of workers based on the memory and CPU capabilities as detailed in the Vertical Scaling section.

Once the vertical scaling limit is reached, you can add more instances to handle the additional load. Use a load balancer to distribute incoming requests evenly across all instances, ensuring no single instance becomes a bottleneck. Install external components on separate instances.

Structure scheme in case of mixing Vertical and Horizontal Scaling strategies

Environment Recommendations

1. Determine the desired processing time for a single request.

2. Select an instance that allows achieving the desired processing time. For selection, we recommend testing with typical requests matching your business scenario.

3. Determine the load profile, as this will influence the required number of workers and, consequently, the resource and scaling scheme.

Consider peak values and their duration, as averaging over a long period may underestimate the number of requests per unit of time that need to be processed. As a result, a configuration based on the average value may not handle peak loads.

4. Determine the number of workers required to process the desired request flow (target throughput) at a known processing time for a single request (latency) by the formula: Worker count = target throughput × latency

Info

For example, to process 15 requests per second with a processing time of 0.8 seconds per request, the worker number is calculated the following way: 15 requests per second × 0.8 seconds per request = 12 workers

Depending on the number of workers, you can choose the appropriate scaling scheme.

The load profile is likely to vary cyclically over time. For example, the main request flow may occur during the day, with significantly fewer requests at night. Or there may be peak hours, for example, at the beginning of the working day, when a significantly larger flow of requests arrives than at other times. In such cases, with horizontal scaling, it might be beneficial to adjust the number of instances according to the current load. A monitoring system for the load and instance management is necessary.

We recommend monitoring CPU utilization. As a rule of thumb, if utilization is 80% or higher, a new instance should be launched. However, consider the instance launch time and the rate of load increase. It may be necessary to lower the threshold.

If the load distribution over time is known, the necessary number of instances can be launched according to a schedule. For example, if the peak is at 8 AM and the launch takes about 10 minutes, start the deployment beforehand at 7:45 AM.

Similar ideas apply to reducing the number of instances when the load decreases.

You can find an example implementation of a scaling scheme for AWS on GitHub: AWS EC2 Regula Forensics Demo.

Make sure to check the allowed number of database connections, as the default setting is often not very high.

Performance Testing

For defining the optimal server configuration and concrete hardware/software parameters, you need to periodically evaluate the system performance. To facilitate the capacity testing process, Regula prepared the dedicated GitHub sample project and the testing techniques guide.