Retry

1. Introduction

There are currently three types of retries: Exception Retry, Backup Request and Connection Failed Retry. Among them, Connection Failed Retry is a network-level problem, since the request is not sent, the framework will retry by default. Here we only present the use of the first two types of retries:

  • Exception Retry: Improve the overall success rate of the service.
  • Backup Request: Reduce delay jitter of request.

Because many requests are not idempotent, these two types of retries are not used as the default policy.

1.1 Attention

  • Confirm that your service is idempotent before enable retry.
  • Exception Retry will increase overall latency.

2. Retry Policy

Only one of the Exception Retry and Backup Request policies can be configured at method granularity.

  • Exception Retry

The default is for timeout retry only, and it can be configured to support specific exception or Resp retry.

Configuration ItemDefault valueDescriptionLimit
MaxRetryTimes2The first request is not included. If it is configured as 0, it means to stop retrying.Value: [0-5]
MaxDurationMS0Including the time-consuming of the first failed request and the retry request. If the limit is reached, the subsequent retry will be stopped. 0 means unlimited. Note: if configured, the configuration item must be greater than the request timeout.
EERThreshold10%If the method-level request error rate exceeds the threshold, retry stops.Value: (0-30%]
ChainStop-Chain Stop is enabled by default. If the upstream request is a retry request, it will not be retried.>= v0.0.5 as the default policy.
DDLStopfalseIf the timeout period of overall request chain is reached, the retry request won’t be sent with this policy. Notice, Kitex doesn’t provide build-in implementation, use retry.RegisterDDLStop(ddlStopFunc) to register is needed.
BackOffNoneRetry waiting strategy, NoneBackOff by default. Optional: FixedBackOff, RandomBackOff.
RetrySameNodefalseBy default, Kitex selects another node to retry. If you want to retry on the same node, set this parameter to true.
  • Backup Request
Configuration ItemDefault valueDescriptionLimit
RetryDelayMS-Duration of waiting for initiating a Backup Requset when the first request is not returned. This parameter must be set manually. It is suggested to set as TP99.
MaxRetryTimes1The first request is not included. If it is configured as 0, it means to stop retrying.Value: [0-2]
EERThreshold10%If the method-level request error rate exceeds the threshold, retry stops.Value: (0-30%]
ChainStopfalseChain Stop is enabled by default. If the upstream request is a retry request, it will not be retried after timeout.>= v0.0.5 as the default policy.
RetrySameNodefalseBy default, Kitex selects another node to retry. If you want to retry on the same node, set this parameter to true.

3. How to use

3.1 Enable by Code Configuration

Note: Dynamic configuration (see 3.3) cannot take effect if retry is enabled by code configuration.

3.1.1 Exception Retry Configuration

  • Configuration e.g.
// import "github.com/cloudwego/kitex/pkg/retry"
fp := retry.NewFailurePolicy()
fp.WithMaxRetryTimes(3) // set the maximum number of retries to 3
xxxCli := xxxservice.NewClient("destServiceName", client.WithFailureRetry(fp))
  • Strategy selection:
fp := retry.NewFailurePolicy()

// Number of retries. The default value is 2, excluding the first request.
fp.WithMaxRetryTimes(xxx)

// Total time consuming. Including the time-consuming for the first failed request and retry request. If the duration limit is reached, the subsequent retries are stopped.
fp.WithMaxDurationMS(xxx)

// Disable `Chain Stop`
fp.DisableChainRetryStop()

// Enable DDL abort
fp.WithDDLStop()

// Backoff policy. No backoff strategy by default.
fp.WithFixedBackOff(fixMS int) // Fixed backoff
fp.WithRandomBackOff(minMS int, maxMS int) // Random backoff

// Set errRate for retry circuit breaker
fp.WithRetryBreaker(errRate float64)

// Retry on the same node
fp.WithRetrySameNode()
3.1.1.1 Retry with Specific Result(Exception/Resp)

v0.4.0 is supported.

It can be configured to support specific result to retry, and the result can be request failure or Resp. Because the business may set status information in Resp, and return retry for a certain type. This is collectively referred to as exception retry.

  • Configuration e.g.
// import "github.com/cloudwego/kitex/pkg/retry"

var opts []client.Option
opts = append(opts, client.WithSpecifiedResultRetry(yourResultRetry))

xxxCli := xxxservice.NewClient(targetService, opts...)
  • retry.ShouldResultRetry Definition

In order to judge error and resp at specific method granularity, rpcinfo is provided as an input argument. The method can be obtained through ri.To().Method().

// ShouldResultRetryit is used for specifying which error or resp need to be retried
type ShouldResultRetry struct {
   ErrorRetry func(err error, ri rpcinfo.RPCInfo) bool
   RespRetry  func(resp interface{}, ri rpcinfo.RPCInfo) bool
}
  • Specific Exception/Resp Implementation e.g.

    • Resp:

      Resp of Thrift and KitexProtobuf protocol correspond to *XXXResult in the generated code, not the real business Resp. To get the real Resp, you need to assert interface{ GetResult() interface{} }.

    • Error:

      The error returned by the peer, kitex will be uniformly encapsulated as kerrors.ErrRemoteOrNetwork. For Thrift and KitexProtobuf, the following examples can get the Error Msg returned by the peer. For gRPC, if the peer returns an error constructed by status.Error, and the local can use status.FromError(err) to get *status.Status. Pay attention to Status needs to be provided by Kitex, and the package path is github.com/cloudwego/kitex/pkg/remote/trans/nphttp2/status.

// retry with specify Resp for one method
respRetry := func(resp interface{}, ri rpcinfo.RPCInfo) bool {
   if ri.To().Method() == "mock" {
        // Notice: you should test with your code, this is only a demo, thrift gen-code of Kitex has GetResult() interface{}
        if respI, ok1 := resp.(interface{ GetResult() interface{} }); ok1 {
            if r, ok2 := respI.GetResult().(*xxx.YourResp); ok2 && r.Msg == retryMsg {
              return true
            }
			}
   }
   return false
}
// retry with specify Error for one method
errorRetry := func(err error, ri rpcinfo.RPCInfo) bool {
   if ri.To().Method() == "mock" {
      if te, ok := errors.Unwrap(err).(*remote.TransError); ok && te.TypeID() == -100 {
         return true
      }
   }
   return false
}
// client option
yourResultRetry := &retry.ShouldResultRetry{ErrorRetry:errorRetry , RespRetry: respRetry}
opts = append(opts, client.WithSpecifiedResultRetry(yourResultRetry))

In particular, for Thrift’s Exception, although the rpc call layer returns an error, the internal processing of the framework is actually regarded as a one-time cost RPC request (because there is an actual return). If you want to judge it, you need to pay attention to two points:

  1. Judge by resp instead of error.
  2. If the method retry is successful, namely, GetSuccess() != nil, you need to reset Exception to nil. Because the retry uses the XXXResult, and the Resp and Exception correspond to the two fields of XXXResult. Exception has been set for the first, and the second successfully set to Resp. However, the framework layer will not reset Exception, and the user needs to reset it by himself.

e.g.

respRetry := func(resp interface{}, ri rpcinfo.RPCInfo) bool {
    if ri.To().Method() == "testException" {
        teResult := resp.(*stability.TestExceptionResult)
        if teResult.GetSuccess() != nil {
           teResult.SetStException(nil)
        } else if teResult.IsSetXXException() && teResult.XxException.Message == xxx {
           return true
        }
    }
    return false
}

3.1.2 Backup Request Configuration

  • Retry Delay recommendations

It is recommended to configure as TP99, then 1% request will trigger Backup Request.

  • Configuration e.g.
// If the first request is not returned after XXX ms, the backup request will be initiated and the `Chain Retry Stop` is enabled
bp := retry.NewBackupPolicy(xxx)
xxxCli := xxxservice.NewClient("targetService", client.WithBackupRequest(bp))
  • Strategy selection:
bp := retry.NewBackupPolicy(xxx)

// Number of retries. The default value is 1, excluding the first request.
bp.WithMaxRetryTimes(xxx)

// Disable `Chain Stop`
bp.DisableChainRetryStop()

// Set errRate for retry circuit breaker
bp.WithRetryBreaker(errRate float64)

// Retry on the same node
bp.WithRetrySameNode()

3.1.3 Method Granularity Configuration Retry

v0.4.0 is supported.

The sample configuration of 3.1.1, 3.1.2 will take effect for all methods. If you want to configure retry only for some methods, or configure on Failure Retry or BackupRequest for different methods respectively, configure as follows:

  • Configuration e.g.
// import "github.com/cloudwego/kitex/pkg/retry"
methodPolicies := client.WithRetryMethodPolicies(map[string]retry.Policy{
   "method1":        retry.BuildFailurePolicy(retry.NewFailurePolicy()),
   "method2":       retry.BuildFailurePolicy(retry.NewFailurePolicyWithResultRetry(yourResultRetry))})

// other methods do backup request except above methods
otherMethodPolicy := client.WithBackupRequest(retry.NewBackupPolicy(10))
var opts []client.Option
opts = append(opts, methodPolicies, otherMethodPolicy)

xxxCli := xxxservice.NewClient(targetService, opts...)

If both WithFailureRetry or WithBackupRequest are configured, methods not configured in WithRetryMethodPolicies will be executed according to the WithFailureRetry or WithBackupRequest policy. But WithFailureRetry and WithBackupRequest cannot be configured at the same time because they will take effect on all client methods.

3.1.4 Request Level Configuration Retry(callopt)

v0.4.0 is supported.

  • Configuration e.g.
import (
    "github.com/cloudwego/kitex/pkg/retry"
)
// demo1: call with failure retry policy, default retry error is Timeout
resp, err := cli.Mock(ctx, req, callopt.WithRetryPolicy(retry.BuildFailurePolicy(retry.NewFailurePolicy())))

// demo2: call with customized failure retry policy
resp, err := cli.Mock(ctx, req, callopt.WithRetryPolicy(retry.BuildFailurePolicy(retry. NewFailurePolicyWithResultRetry(retry.AllErrorRetry()))))

// demo3: call with backup request policy
bp := retry.NewBackupPolicy(10)
bp.WithMaxRetryTimes(1)
resp, err := cli.Mock(ctx, req, callopt.WithRetryPolicy(retry.BuildBackupRequest(bp)))

3.2 Circuit Breaker Reuse

When circuit breaker is enabled for a service, you can reuse the breaker’s statistics to reduce additional CPU consumption. Note that the error rate threshold for retries must be lower than the threshold for a service.

  • Configuration e.g.
// 1. Initialize kitex's built-in cbsuite
cbs := circuitbreak.NewCBSuite(circuitbreak.RPCInfo2Key)// 2. Initialize retryContainer, passing in ServiceControl and ServicePanel
retryC := retry.NewRetryContainerWithCB(cs.cbs.ServiceControl(), cs.cbs.ServicePanel())

var opts []client.Option
// 3. Set retryContainer
opts = append(opts, client.WithRetryContainer(retryC))
// 4. Set Service circuit breaker
opts = append(opts, client.WithMiddleware(cbs.ServiceCBMW()))

// 5. Initialize Client and pass in the configuration option
cli, err := xxxservice.NewClient(targetService, opts...)

3.3 Dynamic open or adjust strategy

If you want to adjust the policy in combination with remote configuration, dynamic open retry, or runtime, you can take effect through the NotifyPolicyChange method of retryContainer. Currently, the open source version of Kitex does not provide a remote configuration module, and users can integrate their own configuration center. Note: If it is turned on through code configuration, dynamic configuration cannot take effect.

  • Configuration e.g.
retryC := retry.NewRetryContainer()
// demo
// 1. define your change func
// 2. exec yourChangeFunc in your config module
yourChangeFunc := func(key string, oldData, newData interface{}) {
    newConf := newData.(*retry.Policy)
    method := parseMethod(key)
    retryC.NotifyPolicyChange(method, policy)
}


// configure retryContainer
cli, err := xxxservice.NewClient(targetService, client.WithRetryContainer(retryC))

4. Tracking

Kitex records the retry times and previous request time in rpcInfo. You can report or output a retry request based on the retry Tag in Client’s metric or log through:

var retryCount string
var lastCosts string

toInfo := rpcinfo.GetRPCInfo(ctx).To()
if retryTag, ok := toInfo.Tag(rpcinfo.RetryTag); ok {
   retryCount = retryTag
   if lastCostTag, ok := toInfo.Tag(rpcinfo.RetryLastCostTag); ok {
      lastCosts = lastCostTag
   }
}

5. Downstream identification

If using TTHeader as the transport protocol, you can determine if the downstream handler is currently a retry request and decide whether to continue processing.

retryReqCount, exist := metainfo.GetPersistentValue(ctx,retry.TransitKey)

For example, retryReqCount = 2, which means the second retry request (excluding the first request), then the business degradation strategy can be adopted(non-retry requests do not have this information).

Question: Chain Stop is enabled by default, is it necessary for services to identify retry requests?

Answer:Chain Stop means that the retry request on the chain will not be retried. Assuming that there is a request chain A->B->C, A sends a retry request to B, while during B->C, if a timeout occurs or Backup is configured, B will not send a retry request to C. If the service can identify the retry request, it can directly decide whether to continue the request to C. In short, Chain Stop avoids retry amplification caused by B sending a retry request to C. The service’s own control can completely avoid requests from B to C.


Last modified December 2, 2022 : chore: add lark footer link (#460) (f41a884)