AntiFragile
Overview
The purpose of the tests was to evaluate the system’ response to various failure scenarios and ability to restore full operation after an incident.
Various problems can occur during system operation, we limited ourselves to those which are most representative:
- Sudden stoppage of a step-executor (non-core connector service to external system) for 1 minute
- Planned stoppage of a step-executor
- Three sudden stoppages and consecutive shutdown of a step-executor
- Stopping and restoring RabbitMQ operation
- Stopping and restoring MongoDB operation
- Simulation of node (EC2) failure in an EKS cluster
For test environment spec please refere to performance test page
Results
- System handled failures with minimal disruptions
- Commands and orders were mostly processed after restoring services
- Errors due to unavailability were typical and recoverable
- Auto-scaling effectively managed node failures
Meat & Bones
Step-executor failure
Overview
Sudden stoppage (kill) of the step-executor (kubectl delete -now) for 1 minute (from 15:02:30 to 15:05:07), restart at 15:06:20.
Purpose
Simulation of system behavior when one service stops functioning and cannot restart.
Result
No disruptions occured in system operation, except for delays in relevant execution steps due to the unavailability of the step-executor service.
The queue of commands that actuate the step-executor started to grow, but after the restart, all the messages were processed:

Conclusions
The service was stopped precisely between the consecutive starts of this execution step (triggered at 1-second intervals), and the commands for subsequent starts began to queue up.
The lack of redelivery via rabbit-mq indicates that the operation was not interrupted and the command triggering the step-executor did not return to the queue.
Step-executor shutdown
Overview
Gracefull stoppage of the IMS(Inventory Management System) step-executor (kubectl scale –replicas=0) from 15:20:50 to 15:21:30.
Purpose
Simulation of system behavior during planned unavailability of a step executor.
Result
3 orders stuck at various steps of fulfillment process:

The asset was created, but the callback did not return to COM:

Despite the callback not being successfully executed, the asset was saved in IMS:

Execution-step Configure CFS x2 (ID: o-7bb18d29-9da2-4aac-a2b7-a9accaf2e233):

Configure CFS x1 (ID: o-d3639f4c-6caa-4d36-98f0-443ade50e024):

Observation of the dashboard indicates that the orders are stuck in the FULFILLING status (i.e., the fulfillment of the order was not completed):

Conclusions
To verify the final result of the test, the IMS system state should be checked and actions should be appropriately compensated as follows:
- Manual correction of the payload and skipping the step.
- Repeating the step to recreate the product and remove the old, unused asset.
- Potential fixes on the IMS side: retrying the callback or adding DLQ to the callback queue.
Step-executor recurring failure
Overview
Three sudden stoppages of the IMS step-executor, followed by scaling the step-executor down to zero (kill 1: 10:20:40, kill 2: 10:22:08, kill 3 + scale down: 10:22:59): (kubectl delete -now) + scale down.
Purpose
Simulation of system behavior when it suddenly stops functioning and the administrator decides to restart it (e.g., after a failed deployment).
Result
The mock reported errors about not being able to send callback during the unavailability of the IMS step-executor.
After restarting the IMS step-executor, commands from the queue began to create assets, but the callbacks arrived too quickly, causing errors.
No messages were observed to be stuck in the queue of commands actuating the IMS step-executor.
Screenshot from the queue commissioning the execution of callbacks that were supposed to inform about the creation of a new asset:

Order stuck at the ConfigureCFS execution step because the step-executor did not receive the callback:

Callback sending started (highlighted TraceID):

Callback sending ended with Connection Refused error (same TraceID):

The IMS step-executor was brought back to life after an hour, after which it immediately processed the pending execution-step start commands:

Restarting the IMS step-executor resulted in the consumption of all pending execution-step start commands that accumulated during the service’s unavailability.
At the same time, after the SE restart, IMS was not yet ready to accept incoming traffic, which caused the rejection of callbacks.
Potential fix: the IMS step-executor should wait for its HTTP server (listening for callbacks) to start before beginning to execute start commands from the RabbitMQ queue - this needs to be corrected on the step-executor side.
Expected behavior: readiness for communication using HTTP should be simultaneous with readiness for communication via RabbitMQ.
Introduction of consistent mechanisms to compensate for callback errors (e.g., retry policy, DLQ, monitoring, or others) can also mitigate the above problem.
RabbitMQ failure
Overview
Stopping and restarting RabbitMQ according to the following specification:
- Test start (beginning of jmeter operation): 13:28:20
- RabbitMQ stop: 13:30:43
- Sending rate: 200 orders/min
- Stopping was done by suspending RabbitMQ communication with consumers, i.e.:
- rabbitmqctl suspend_listeners
- rabbitmqctl close_all_connections --vhost / "Closed by request"
- Resuming RabbitMQ operation: 13:32:56 (time of the last error reported by the incoming-orders service)
- Test end: 13:33:20
Purpose
Simulation of system behavior during a major infrastructure failure.
Result
1001 orders were sent, 557 were received, 13 orders got stuck in the FULFILLING status, completed orders: 544 (status COMPLETED). Performance report is attached below.
Incorrect orders are present – the system did not accept them as they could not be saved in the input buffer queue.

All orders accepted by the incoming-orders service were recorded in the system as orders visible from the GUI.
Screenshot depicting the incoming-to-orders queue — a queue using the publisher-confirms mechanism to ensure data write guarantees in case of failure. This mechanism ensures that the acknowledgment of receipt of the message arrives only after obtaining a guarantee that the message has been successfully delivered or persisted:

Overview of all queues:

Conclusions
Operations requiring synchronous communication through RabbitMQ (e.g., updating orders) are sensitive to infrastructure unavailability.
Orders that did not require synchronous communication patiently waited for the restoration of service connections to RabbitMQ.
Due to infrastructure availability interruptions, it was not possible to process HTTP requests coming to the IMS step-executor as a result of a callback (the execution step could not be completed). Repeating the step for ConfigureCFS resulted in the creation of another process on the step-executor side (with a different ID).
Services that lost connections to the message broker automatically regained them after its availability was restored. No service restarts were required.
MongoDB failure
Overview
Simulation of the sudden stoppage and restart of MongoDB (permissions for connections to all services were removed via the MongoAtlas admin panel, and then restored after a moment).
Purpose
Simulation of system behavior during a major infrastructure failure.
Results
- Test duration: 16:41:35 - 16:46:30
- MongoDB unavailability: 16:43:00 - 16:44:15
- Sending rate: 200 orders/min
- During the test period, 1001 orders were sent. The test concluded with the following results:
- Completed: 996
- Fulfilling: 5 (current errors: 6)
- Messages appeared in the DLQ as a result of write/read errors during processing in the core elements.
- Entries related to processing errors due to database unavailability appeared in the fulfillment error collection. Since the database is directly controlled by either core services or step-executors, the errors are handled in a standard way and end up in a separate error handling domain.

Reported error cases had one of two causes:
- Timeout exception during order update - stalled transaction
- Timeout exception during order fetching
The input queue (order-capture) during the unavailability of consumers on the core order service side began to queue incoming orders. After the bringing MongoDB back to operation they were processed at a very rapid pace.

Conclusions
- Execution step start commands were not lost.
- No observations were made that the fulfillment service started sending messages to the DLQ.
- The unavailability of the MongoDB database causes the suspension of order processing by services using it. This results from the blocking of threads waiting to obtain a connection to the database.
- Threads started at the moment of database access interruption end with a write or read error or, in the case of core services, are rolled back to the queue and processed again.
- Services that lost connections to the database automatically regained them after its availability was restored. No service restarts were required.
K8s cluster node failure
Overview
- Number of nodes (EC2) in the cluster: 4
- Setting the number of service pod replicas to two (2)
- Running the performance test (traffic rate: 200 req/min, duration: 5 min)
- Stopping the EC2 machine - terminate command
- Auto Scaling Group (EKS configuration) creates a new EC2 instance after detecting the node stop in the cluster
Purpose
Simulate node failure by terminating it and triggering the cluster auto-scaling mechanism.
Result
- Number of replicas: 2 pods each
- Test: 200/min - 5min disaster-recovery-infra.jmx
- Queue state: all empty
- Database state: empty
- Test start: 10:20:30
- Node kill: 10:21:00
- Pod termination start: 10:21:50
- Test stop: 10:25:26
- Queue discharge time after test completion: 10:31:00

Order main statuses count
| Status | Count |
|---|---|
| Completed | 893 |
| Error | 0 |
| Fulfilling | 24 |
Other order statuses count
| Status | Count |
|---|---|
| CREATED | 1 |
| PREPROCESSING | 1 |
| VALIDATING | 2 |
| PLAN BUILDING | 1 |
Error types & count
| Step | Error | Count |
|---|---|---|
| ConfigureCfsIMS | Connection timed out executing POST request | 4 |
| WriteConflict to MongoDB | 1 | |
| CreateProductIMS | Connection timed out executing POST request | 6 |
| WriteConflict to MongoDB | 1 | |
| UpdateProductIMS | WriteConflict to MongoDB | 6 |
| Unexpected end of file from server | 1 | |
| Connection timed out executing PUT request | 1 |
Description of rejected orders (not accepted by the system) and potential reasons

Conclusions
- The cluster auto-scaling mechanism initiated a new instance after detecting the node failure
- The response and full availability restoration time of the cluster was approximately 3 minutes
- Pods that were not affected by the failure continued standard processing
- Newly created pods took over processing
- Pods that failed introduced typical errors resulting from unavailability, e.g., timeout, end of file. These errors are repairable from the CDOM or administration console