Currently, there are three types of retries:
This document introduces the usage of the first three types of retries. Because many business requests are not idempotent, these three types of retries are not set as the default strategy.
To improve the overall success rate of the service, retries are by default only applied to timeout errors. However, users can also specify retries for specific errors or responses. The specific usage is explained below.
To reduce service latency fluctuation, if the request is not returned within the set time, it will be sent again. The process ends when any request finishes, whether successfully or unsuccessfully.
Mixed Retry function with both Failure Retry and Backup Request functions. Advantages compared to the first two types of retries:
The default is for timeout retry only, and it can be configured to support specific exception or Resp retry.
Configuration Item | Default value | Description | Limit |
---|---|---|---|
MaxRetryTimes | 2 | The first request is not included. If it is configured as 0, it means to stop retrying. | Value: [0-5] |
MaxDurationMS | 0 | Including the time-consuming of the first failed request and the retry request. If the limit is reached, the subsequent retry will be stopped. 0 means unlimited. Note: if configured, the configuration item must be greater than the request timeout. | |
EERThreshold | 10% | If the method-level request error rate exceeds the threshold, retry stops. | Value: (0-30%] |
ChainStop | - | Chain Stop is enabled by default. If the upstream request is a retry request, it will not be retried. | >= v0.0.5 as the default policy. |
DDLStop | false | If the timeout period of overall request chain is reached, the retry request won’t be sent with this policy. Notice, Kitex doesn’t provide build-in implementation, use retry.RegisterDDLStop(ddlStopFunc) to register is needed. | |
BackOff | None | Retry waiting strategy, NoneBackOff by default. Optional: FixedBackOff , RandomBackOff . | |
RetrySameNode | false | By default, Kitex selects another node to retry. If you want to retry on the same node, set this parameter to true. |
In order to be specific to the method-level error or resp to judge, provide rpcinfo as parameter, you can get the method through ri.To().Method().
// ShouldResultRetryit is used for specifying which error or resp need to be retried
type ShouldResultRetry struct {
ErrorRetry func(err error, ri rpcinfo.RPCInfo) bool
RespRetry func(resp interface{}, ri rpcinfo.RPCInfo) bool
// disable the default timeout retry in specific scenarios (e.g. the requests are not non-idempotent)
NotRetryForTimeout bool
}
Configuration Item | Default value | Description | Limit |
---|---|---|---|
RetryDelayMS | - | Duration of waiting for initiating a Backup Request when the first request is not returned. This parameter must be set manually. It is suggested to set as TP99. | |
MaxRetryTimes | 1 | The first request is not included. If it is configured as 0, it means to stop retrying. | Value: [0-2] |
EERThreshold | 10% | If the method-level request error rate exceeds the threshold, retry stops. | Value: (0-30%] |
ChainStop | false | Chain Stop is enabled by default. If the upstream request is a retry request, it will not be retried after timeout. | >= v0.0.5 as the default policy. |
RetrySameNode | false | By default, Kitex selects another node to retry. If you want to retry on the same node, set this parameter to true. |
Configuration Item | Default Val | Description | Legal value | 对比 Failure Retry | 对比 Backup Request |
---|---|---|---|---|---|
RetryDelayMS | - | Duration of waiting for initiating a Backup Request when the first request is not returned. This parameter must be set manually. It is suggested to be set as TP99. | —— | As long as there is a result for Backup Request, no new Backup Request will be sent regardless of the result.Mixed Retry waits for valid results to be replied. | |
MaxRetryTimes | 1 | The maximum number of retries for both Backup request and Failure Retry, excluding the first request. If configured to 0, it means to stop retrying. | [0-3] | Legal val: [0-2]Default Val: 2 | Legal val: [0-5]Default Val: 1 |
MaxDurationMS | 0 | The cumulative maximum time consumption, including the first failed request and the retry requests. If time consumption reaches the limit, the subsequent retries will be stopped, and the unit is ms. 0 means no limit. Note: If configured, the configuration item must be greater than the request timed out time. | Minimum: RPCTimeoutTheoretical Maximum:RPCTimeout * (MaxRetryTimes+1) | Default Val: 0 | —— |
StopPolicy.CBPolicy | 10% | Retry Percentage LimitNote: Single instance - method granularity. If it exceeds the threshold, stop retrying. | (0-30%] | Same | Same |
ChainStop | true | Chain Stop is enabled by default. If the upstream request is a retry request, it will not be retried. | Same | Same | |
DDLStop | false | If the timeout period of overall request chain is reached, the retry request won’t be sent with this policy. Notice, Kitex doesn’t provide build-in implementation, use retry.RegisterDDLStop(ddlStopFunc) to register is needed. | Same | —— | |
BackOff | None | Retry waiting strategy, NoneBackOff by default. Optional: FixedBackOff , RandomBackOff . | Same | —— |
It is same as Failure Retry, please see 2.1.1.
Note: Dynamic configuration (refer to “Dynamic open or adjust strategy”) won’t take effect if retry is enabled by code configuration, since code conf has high priority.
The retry layer is implemented before middlewares, so every retry will execute all middlewares. When next
returns an error in the middleware (such as timeout, network exception, etc.), it does NOT mean that the entire method call has failed (only this retry attempt fails, and another one may succeed). Please handle it appropriately according to business needs.
The “Custom Fallback” feature can be used to handle responses (fallback policy is executed after all middlewares).
// import "github.com/cloudwego/kitex/pkg/retry"
fp := retry.NewFailurePolicy()
fp.WithMaxRetryTimes(3) // 配置最多重试3次
xxxCli := xxxservice.NewClient("psm", client.WithFailureRetry(fp))
fp := retry.NewFailurePolicy()
// Number of retries. The default value is 2, excluding the first request.
fp.WithMaxRetryTimes(xxx)
// The cumulative maximum time consumption, including the first failed request and the retry requests. If time consumption reaches the limit, the subsequent retries will be stopped
fp.WithMaxDurationMS(xxx)
// Disable `Chain Stop`
fp.DisableChainRetryStop()
// Enable DDL Stop
fp.WithDDLStop()
// Backoff policy. No backoff policy by default.
fp.WithFixedBackOff(fixMS int) // Fixed backoff
fp.WithRandomBackOff(minMS int, maxMS int) // Random backoff
// Set errRate for retry circuit breaker
fp.WithRetryBreaker(errRate float64)
// Retry on the same node
fp.WithRetrySameNode()
Support version v0.4.0 (github.com/cloudwego/kitex). It can be configured to support retry with specified result which can be error or Resp. Because the business may set status information in the Resp and retry for some status, it supports specified Resp retry, which are collectively referred to as failure retry here.
There are two ways to specify, exception/Resp retry:
import "github.com/cloudwego/kitex/pkg/retry"
// Method 1: Configure through WithFailureRetry option.
var opts []client.Option
opts = append(opts, client.WithFailureRetry(
retry.NewFailurePolicyWithResultRetry(yourShouldResultRetry)))
xxxCli := xxxservice.NewClient(TargetServiceName, opts...)
interface{ GetResult() interface{} }
.
Error:
The error returned by the peer, kitex will be uniformly encapsulated as kerrors.ErrRemoteOrNetwork
. For Thrift and KitexProtobuf, the following examples can get the Error Msg returned by the peer. For gRPC, if the peer returns an error constructed by status.Error
, and the local can use status.FromError(err)
to get *status.Status
. Pay attention to Status
needs to be provided by Kitex, and the package path is github.com/cloudwego/kitex/pkg/remote/trans/nphttp2/status
.// retry with specify Resp for one method
respRetry := func(resp interface{}, ri rpcinfo.RPCInfo) bool {
if ri.To().Method() == "mock" {
// Notice: you should test with your code, this is only a demo, thrift gen-code of Kitex has GetResult() interface{}
if respI, ok1 := resp.(interface{ GetResult() interface{} }); ok1 {
if r, ok2 := respI.GetResult().(*xxx.YourResp); ok2 && r.Msg == retryMsg {
return true
}
}
}
return false
}
// retry with specify Error for one method
errorRetry := func(err error, ri rpcinfo.RPCInfo) bool {
if ri.To().Method() == "mock" {
if te, ok := errors.Unwrap(err).(*remote.TransError); ok && te.TypeID() == -100 {
return true
}
}
return false
}
// client option
yourResultRetry := &retry.ShouldResultRetry{ErrorRetry:errorRetry , RespRetry: respRetry}
opts = append(opts, client.WithSpecifiedResultRetry(yourResultRetry))
In particular, for cases where there is a Thrift Exception definition (you define an Exception for the IDL Method), although the rpc call layer returns an error, the internal processing of the framework is actually regarded as a one-time cost RPC request (because there is an actual return). If you want to judge it, you need to pay attention to two points:
GetSuccess() != nil
, you need to reset Exception to nil. Because the retry uses the XXXResult, and the Resp and Exception correspond to the two fields of XXXResult. Exception has been set for the first, and the second successfully set to Resp. However, the framework layer will not reset Exception, and the user needs to reset it by himself.Example:
respRetry := func(resp interface{}, ri rpcinfo.RPCInfo) bool {
if ri.To().Method() == "testException" {
teResult := resp.(*stability.TestExceptionResult)
if teResult.GetSuccess() != nil {
teResult.SetStException(nil)
} else if teResult.IsSetXXException() && teResult.XxException.Message == xxx {
return true
}
}
return false
}
Support version v0.5.2 (github.com/cloudwego/kitex)
Because Kitex does timeout retry by default after enabling failure retry, and will also do timeout retry by default after configuring specifying Error/Resp retry, but some business interface requests are non-idempotence, and should not be retried, so provide configuration when specifying Error/Resp retry, you can choose not to do timeout retry.
Why is the timeout error not directly handed over to the user-defined judgment?? Most users need to do timeout retry, only some non-idempotence requests can not retry for timeout. Considering most user scenarios, providing the configuration to disable it
// set NotRetryForTimeout as true
yourResultRetry := &retry.ShouldResultRetry{NotRetryForTimeout: true, ErrorRetry: errorRetry}
opts = append(opts, client.WithSpecifiedResultRetry(yourResultRetry))
The retry layer is implemented before middlewares, so every retry will execute all middlewares.
Do pay attention to avoid concurrent read/write on the request/response in middlewares:
Before calling next
: if you’re going to modify the request, please introduce a lock (sync.Mutex, sync.Once, etc.) or make a copy to avoid concurrent read/write
Check the error return by next
if the error is kerrors.ErrRPCFinish
, it means the result of this attempt is discarded (another attempt is already decoding the response), the middleware should return directly
If it’s another error (such as timeout, network exception, etc.), it does NOT mean that the entire method call has failed (only this retry attempt fails, and another one may succeed). Please handle it appropriately according to business needs.
The “Custom Fallback” feature can be used to handle responses (fallback policy is executed after all middlewares).
The recommended configuration is TP99, then 1% of requests will trigger a backup request.
// If the request not reply after XXX ms, the backup request will be sent
bp := retry.NewBackupPolicy(xxx)
var opts []client.Option
opts = append(opts, client.WithBackupRequest(bp))
xxxCli := xxxservice.NewClient("psm", opts...)
bp := retry.NewBackupPolicy(xxx)
// Number of retries. The default value is 1, excluding the first request.
bp.WithMaxRetryTimes(xxx)
// Disable `Chain Stop`
bp.DisableChainRetryStop()
// Set errRate for retry circuit breaker
bp.WithRetryBreaker(errRate float64)
// Retry on the same node
bp.WithRetrySameNode()
Support version v0.12.0(github.com/cloudwego/kitex)。
The retry layer is implemented before middlewares, so every retry will execute all middlewares.
Do pay attention to avoid concurrent read/write on the request/response in middlewares:
Before calling next
: if you’re going to modify the request, please introduce a lock (sync.Mutex, sync.Once, etc.) or make a copy to avoid concurrent read/write
Check the error return by next
if the error is kerrors.ErrRPCFinish
, it means the result of this attempt is discarded (another attempt is already decoding the response), the middleware should return directly
If it’s another error (such as timeout, network exception, etc.), it does NOT mean that the entire method call has failed (only this retry attempt fails, and another one may succeed). Please handle it appropriately according to business needs.
The “Custom Fallback” feature can be used to handle responses (fallback policy is executed after all middlewares).
The recommended configuration is TP99, then 1% of requests will trigger a backup request.
import "github.com/cloudwego/kitex/pkg/retry"
// If the request not reply after XXX ms, the backup request will be sent
mp := retry.NewMixedPolicy(xxx)
mp.WithMaxRetryTimes(3)
xxxCli := xxxservice.NewClient("psm", client.WithMixedRetry(mp))
mp := retry.NewMixedPolicy(xxx)
// Number of retries. The default value is 1, excluding the first request.
mp.WithMaxRetryTimes(xxx)
// The cumulative maximum time consumption, including the first failed request and the retry requests. If time consumption reaches the limit, the subsequent retries will be stopped
mp.WithMaxDurationMS(xxx)
// Disable `Chain Stop`
mp.DisableChainRetryStop()
// Enable DDL Stop
mp.WithDDLStop()
// Backoff policy. No backoff policy by default.
mp.WithFixedBackOff(fixMS int) // Fixed backoff
mp.WithRandomBackOff(minMS int, maxMS int) // Random backoff
// Set errRate for retry circuit breaker
mp.WithRetryBreaker(errRate float64)
ShouldResultRetry is the same as failure retry. Below is a configuration example(ShouldResultRetry is defined in 3.1.1.3):
import "github.com/cloudwego/kitex/pkg/retry"
// Method 1: Configure through WithSpecifiedResultRetry option.
xxxCli := xxxservice.NewClient(TargetServiceName, client.WithMixedRetry(retry.NewMixedPolicy(100)), client.WithSpecifiedResultRetry(errRetry)))
// Method 2: Configure through NewMixedPolicyWithResultRetry.
var opts []client.Option
opts = append(opts, client.WithMixedRetry(
retry.NewMixedPolicyWithResultRetry(100, yourShouldResultRetry)))
xxxCli := xxxservice.NewClient(TargetServiceName, opts...)
Support version v0.4.0 (github/cloudwego/kitex).
Different methods can be configured Failure Retry or Backup Request or Mixed Retry separately.
Note: If WithFailureRetry or WithBackupRequest or WithMixedRetry are configured at the same time, the methods that WithRetryMethodPolicies is not configured will be executed according to the WithFailureRetry or WithBackupRequest or WithMixedRetry policy.
import "github.com/cloudwego/kitex/pkg/retry"
methodPolicies := client.WithRetryMethodPolicies(map[string]retry.Policy{
"method1": retry.BuildFailurePolicy(retry.NewFailurePolicy()),
"method2": retry.BuildFailurePolicy(retry.NewFailurePolicyWithResultRetry(yourResultRetry))})
// other methods do backup request except above methods
otherMethodPolicy := client.WithBackupRequest(retry.NewBackupPolicy(10))
var opts []client.Option
opts = append(opts, methodPolicies, otherMethodPolicy)
xxxCli := xxxservice.NewClient(targetService, opts...)
import (
"github.com/cloudwego/kitex/pkg/retry"
)
// demo1: call with failure retry policy, default retry error is Timeout
resp, err := cli.Mock(ctx, req, callopt.WithRetryPolicy(retry.BuildFailurePolicy(retry.NewFailurePolicy())))
// demo2: call with customized failure retry policy
resp, err := cli.Mock(ctx, req, callopt.WithRetryPolicy(retry.BuildFailurePolicy(retry.NewFailurePolicyWithResultRetry(retry.AllErrorRetry()))))
// demo3: call with backup request policy
bp := retry.NewBackupPolicy(10)
bp.WithMaxRetryTimes(1)
resp, err := cli.Mock(ctx, req, callopt.WithRetryPolicy(retry.BuildBackupRequest(bp)))
List of supported config sources: https://github.com/cloudwego/kitex/issues/973
If you need to combine remote configuration to dynamically enable retry or runtime adjustment policies, you can take effect through the NotifyPolicyChange method of retryContainer. Users can integrate their own configuration center based on the Kitex remote configuration interface. Note: If it has been enabled through code configuration, dynamic configuration cannot take effect.
// Available since v0.7.2 [Recommended]
retryC := retry.NewRetryContainerWithPercentageLimit()
// Available for Kitex < 0.7.2
retryC := retry.NewRetryContainer()
// demo
// 1. define your change func
// 2. exec yourChangeFunc in your config module
yourChangeFunc := func(key string, oldData, newData interface{}) {
newConf := newData.(*retry.Policy)
method := parseMethod(key)
retryC.NotifyPolicyChange(method, policy)
}
// configure retryContainer
cli, err := xxxservice.NewClient(targetService, client.WithRetryContainer(retryC))
When the circuit breaker configuration of the service is enabled, it is possible to reuse the fuse statistics to reduce additional CPU consumption.
Note:
The circuit break threshold for retry must be lower than the fuse threshold for the service.
When reusing a circuit breaker, you can strictly restrict the percentage limit of retry requests
// 1. Initialize the built-in cbsuite of kitex.
cbs := circuitbreak.NewCBSuite(circuitbreak.RPCInfo2Key)
// 2. Initialize retryContainer, pass in ServiceControl and ServicePanel
retryC := retry.NewRetryContainerWithCB(cs.cbs.ServiceControl(), cs.cbs.ServicePanel())
var opts []client.Option
// 3. Configure retryContainer
opts = append(opts, client.WithRetryContainer(retryC))
// 4. Configure Service circuit breaker
opts = append(opts, client.WithMiddleware(cbs.ServiceCBMW()))
// 5. Client initialization, passing configuration options
cli, err := xxxservice.NewClient(targetService, opts...)
Kitex records the number of retries and the time-consuming of previous requests in rpcinfo of first request for retry requests, which can be reported or output according to the retry tag in the metric or log on the Client side. Note that it is not available in the custom Client Middleware because this information is only available in the first request rpcinfo.
Get way:
var retryCount stringvar lastCosts string
toInfo := rpcinfo.GetRPCInfo(ctx).To()if retryTag, ok := toInfo.Tag(rpcinfo.RetryTag); ok {
retryCount = retryTag
if lastCostTag, ok := toInfo.Tag(rpcinfo.RetryLastCostTag); ok {
lastCosts = lastCostTag
}
}
If TTHeader is used as the transport protocol, the downstream handler can determine whether the current request is a retry request and decide whether to continue processing by itself. Note that this judgment should be placed before the client call, not in the customized middleware of client-side.(in customized middleware of server-side is OK).
retryReqCount, exist := metainfo.GetPersistentValue(ctx,retry.TransitKey)
For example, retryReqCount = 2, indicating that the second retry request (excluding the first request), the business degradation strategy is adopted to return part or mock data (non-retry requests do not have this information).
Chain Stop does not do retry for chain retry requests, such as A -> B -> C, A sends a retry request to B. If B-> C times out or Backup is configured, B will not send a retry request to C. If the users recognizes the retry request by itself, it can directly decide whether to continue the request to C. In short, Chain stop avoids the retry amplification caused by B sending a retry request to C, but the user can completely avoid requests from B to C.
In the client Middleware, through’metainfo. GetPersistentValue (ctx, retry. TransitKey) ’to get the retry information, it is impossible to distinguish whether it is an upstream retry request or a local retry request. If the user wants to determine whether it is a local retry request in the custom Client Middleware, it can be judged as follows:
localRetryReqCount, exist := rpcinfo.GetRPCInfo(ctx).To().Tag(rpcinfo.RetryTag)
Unrestricted retries will cause an avalanche of requests. circuit breaker must be enabled for retries. When the proportion of retry requests reaches the threshold, CB will be triggered. The default threshold is 10%. The statistical caliber is the retry rate (retry requests/total number of requests) of a single method instance within 10 seconds.
Why did you see the retry rate of less than 10% but it was breaked? Circuit breaker is based on the rate of specific methods downstream of the current client’s request statistics, not on the granularity of the current service and all methods. It needs to be viewed specifically at the same granularity.
The circuit breaker looks at the failure rate of a single method of a single instance of the Client within a period of time. Combined with monitoring, it determines whether the circuit breaker meets expectations.
In addition, to trigger fusing in low versions, the corresponding method QPS should > 20 for a single instance at the end. When the QPS is not high, retry fusing always fails to take effect. Therefore, the judgment of the request amount is removed and only the error rate is looked at. However, since the statistical window of the fuse error rate is 10s, when the QPS is very low or the first request timeout within 10s, the failure rate in the statistical window is high and the fuse is triggered. To avoid this problem, adjust the strategy in v0.0.8 to request volume > 10 && within 10s then the fuse can be triggered. Suggested version > 0.0.8.
The failure rate of client single instance and single method within the 10s window.