Som's Tech blog

I plan to use this platform to share my knowledge, experiences, and technical expertise with the aim of motivating other software engineers. Whether you’re new to coding or an experienced developer looking for new ideas, I welcome you to join me in this journey of discovery and creativity.

Optimize, Retry, Succeed: Enhancing AWS SQS Message Processing with Java

In this blog, I’ll delve into asynchronous retry mechanisms and illustrate the process of implementing retries through the inherent functions of a broker, leveraging AWS SQS. This method will optimize the message processing in success and failure cases.

While crafting our integration points with remote APIs, it’s crucial to account for potential error scenarios, categorized as functional and non-functional. Non-functional errors, such as temporary network outages, infrastructure glitches, or sluggish resources, are often retriable.

To address these failures, we can implement two types of retry mechanisms: synchronous and asynchronous.

Synchronous retries involve blocking the thread and immediate retry attempts with intervals of “sleep” time in between. Implementing custom algorithms or relying on libraries like Spring Retry, which come with advanced policies like exponential backoffs (e.g., retry after 3 seconds, next retry after 6 seconds), is common.

Although synchronous retries can be costly, they are sometimes necessary. For instance, if the second operation depends on input from the first, blocking the flow with synchronous retries becomes unavoidable.

In cases where the remote API operates asynchronously, and the sequence of API operations is less critical, opting for asynchronous retry mechanisms enhances scalability. There are various ways to implement asynchronous retries, and in this article, I’ll focus on leveraging Message Queues.

Queueing is a straightforward concept: publishers send messages to a queue, and consumers retrieve them. However, modern queue brokers (such as RabbitMQ, ActiveMQ, certain JMS providers like JBoss, AWS SQS, etc.) enhance this concept with advanced features like TTLs, Dead Letter Exchanges or Queues, Routing, and others. These features allow us to address non-functional requirements without the need for additional code.

AWS Simple Queue Service (SQS) comes with additional features, and one particularly useful feature we’ll explore is the ‘Visibility Timeout.’ This timeout setting plays a crucial role in managing how messages are processed within your application.

Determining the appropriate visibility timeout depends on the duration your application requires to process and delete a message. For instance, if your application takes 10 seconds to process a message and the visibility timeout is set to 15 minutes, you face a significant delay before attempting to process the message again in case of a previous failure. On the other hand, if the visibility timeout is set to only 2 seconds, a potential issue arises where another consumer might receive a duplicate message while the original consumer is still actively processing it.

To ensure an optimal balance, consider the following strategies:

  1. If you have a clear understanding or can reasonably estimate the processing time for a message, set the visibility timeout to the maximum time required for processing and deleting the message. Refer to the ‘Configuring the Visibility Timeout’ section for more details.

  2. If the exact processing time is unknown, implement a heartbeat mechanism for your consumer process. Start with an initial visibility timeout (e.g., 2 minutes), and as long as your consumer is actively working on the message, continually extend the visibility timeout by 2 minutes every minute.

We can harness this mechanism within a hypothetical retry scenario, comprising the following stages:

  1. A producer places a message onto the queue.
  2. The consumer retrieves and initiates processing of the message.
  3. Due to external factors like a web service or database failure, the consumer encounters processing issues.
  4. The consumer proceeds to the next message in the queue, leaving the failed one intact.
  5. After the visibility timeout elapses, the unresolved message becomes visible once more to the same consumer and other consumers.
  6. This initiates a recurring cycle, leading to the retry of the same message as the process begins anew.

An important consideration in our setup is the potential for infinite loops, a situation we must proactively avoid. To address this concern, the implementation should incorporate a ‘maximum retries’ feature. Tracking the re-delivery count of each message can be achieved either through message headers or an external system. Upon each consumption, the re-delivery counter is incremented by 1. When the consumer identifies that a message has reached its maximum retries, it intelligently redirects the message to an alternate queue, often referred to as a ‘dead letter queue.’ This enables the potential replaying of failed messages at a later time.

While JMS 2.0 introduces a managed header attribute for broker re-deliveries, it’s crucial to note that this functionality may not be universally available across all non-JMS broker implementations. In such cases, maintaining a Map of message_id versus delivery_count within the consumer application becomes necessary. For distributed consumer fleets, this Map may need relocation to a centralized repository, such as Redis or an RDBMS. Fortunately, AWS SQS conveniently provides a managed attribute at the message level, known as ‘ApproximateReceiveCount,’ which we leverage in our sample retry setup.

SQS further offers a Redrive Policy that incorporates a ‘Maximum Receives’ attribute, specifying the maximum number of receives before a message is redirected to a dead letter queue. This policy serves as an effective tool to prevent loops without the need for managing custom counters. However, this article takes a more generic approach to address the problem.

Refer to the figure below for a visual representation of the message flow in our sample scenario.

The visibility timeout for an AWS SQS Queue can be configured to range from 0 seconds up to 12 hours, providing a broad spectrum of retry frequencies. For instance, setting the Time-to-Live (TTL) to 1 minute results in a retry occurring every minute.

In this configuration, there are no inherent retry mechanisms built into the broker, such as exponential backoffs. However, consumers can calculate and apply backoffs to the message TTL after its consumption. The AWS API facilitates the dynamic adjustment of the visibility timeout on a per-message basis.

The following Java code illustrates a consumer fetching a message from an SQS queue. It’s important to note that this code includes an endless loop because long polling is enabled in the queue, achieved by setting the ‘Receive Message Wait Time’ parameter to 5 seconds. If long polling is not used, and the consumer is continuously polling, it’s advisable to implement scheduling mechanisms in the code, such as sleeps, timers, etc., to prevent unnecessary costs.

private final static int maxRetries = 5;
private final static int queueTtl = 30; //seconds
try {....} // your service call and business logic
else if (e.getResponse().getStatusInfo().getFamily() == Response.Status.Family.SERVER_ERROR) {
// retry for 5XX error
int previousDeliveries = Integer.parseInt(deleteMessage.getAttributes().get("ApproximateReceiveCount"));
if (previousDeliveries > maxRetries) {
LOGGER.error("Cannot reach the Service. Max retries reached ");

error = e.getResponse().toString();

SendMessageRequest sendMessageStandardQueue = new SendMessageRequest()
.withQueueUrl(sqsDLQueueUrl)
.withMessageBody(dlqMessage)
.withMessageAttributes(messageAttributes);
sqsClient.sendMessage(sendMessageStandardQueue);
sqsClient.deleteMessage(receiveSqsUrl, deleteMessage.getReceiptHandle());

}

else {sqsClient.changeMessageVisibility(sqsReceiveQueueUrl, deleteMessage.getReceiptHandle(),
previousDeliveries + queueTtl);
}

In this article, I have endeavored to elucidate asynchronous retry mechanisms and illustrate the process of implementing retries through a broker’s native functions. While I showcased the demonstration using the serverless AWS SQS product, it’s essential to note that these concepts are applicable to other brokers as well.

Leave a Comment