项目架构:
部分组件如下:
SpringCloudAlibaba(Nacos+Gateway+OpenFeign)+SpringBoot2.x+Redis
问题背景:
最近由于用户量增大,在高峰时期,会导致用户服务偶尔Redis出现连接超时的情况,
例如:从Redis中获取手机验证码 ,登录成功后,将token存入Redis,以及涉及到使用Redis的场景都会出现RedisConnectionFailureException
异常日志:
237614 2021-03-02 17:24:42.595 ERROR [d03f845825644cee8753539f24d840ad] [http-nio-7122-exec-32] c.l.c.b.e.GlobalExceptionHandler -java.net.SocketTimeoutException: Read timed out; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: Read timed out 237615 org.springframework.data.redis.RedisConnectionFailureException: java.net.SocketTimeoutException: Read timed out; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: Readtimed out 237616 at org.springframework.data.redis.connection.jedis.JedisExceptionConverter.convert(JedisExceptionConverter.java:65) 237617 at org.springframework.data.redis.connection.jedis.JedisExceptionConverter.convert(JedisExceptionConverter.java:42) 237618 at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44) 237619 at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42) 237620 at org.springframework.data.redis.connection.jedis.JedisConnection.convertJedisAccessException(JedisConnection.java:135) 237621 at org.springframework.data.redis.connection.jedis.JedisStringCommands.convertJedisAccessException(JedisStringCommands.java:751) 237622 at org.springframework.data.redis.connection.jedis.JedisStringCommands.get(JedisStringCommands.java:67) 237623 at org.springframework.data.redis.connection.DefaultedRedisConnection.get(DefaultedRedisConnection.java:260) 237624 at org.springframework.data.redis.connection.DefaultStringRedisConnection.get(DefaultStringRedisConnection.java:398) 237625 at org.springframework.data.redis.core.DefaultValueOperations$1.inRedis(DefaultValueOperations.java:57) 237626 at org.springframework.data.redis.core.AbstractOperations$ValueDeserializingRedisCallback.doInRedis(AbstractOperations.java:60) 237627 at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:228) 237628 at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:188) 237629 at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:96) 237630 at org.springframework.data.redis.core.DefaultValueOperations.get(DefaultValueOperations.java:53) 237631 at com.xxxx.xxx.xxx.utils.RedisUtil.get(RedisUtil.java:242)
Maven相关的Redis依赖:
<!-- redis --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-redis</artifactId> <exclusions> <exclusion> <groupId>io.lettuce</groupId> <artifactId>lettuce-core</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> </dependency>
Redis配置(单节点配置,没有做分布式部署)
spring: redis: pool: maxActive: 300 maxIdle: 100 maxWait: 1000 host: xxxxxxxxx port: 6379 password: timeout: 2000 database: 5
排查过程:
这里分析可能的原因如下:
原因1.代码中是否有keys *类似的查询,由于Redis是单线程的,数据量大,单个命令执行时间过长,导致Redis客户端请求超时,keys *类似的查询非常消耗Redis的性能;
原因2.Redis配置文件配置的 timeout 超时时间过短,上一个请求还没有执行结束,下一个请求无法获执行,最终超时导致请求失败;
原因3.Redis连接池配置的链接数太小,通过Prometheus 监控发现用户服务 高峰时间请求量最高为180,考虑是否是连接数太小导致无法获取Redis连接,从而失败;
针对原因1:
这边排查了项目中的代码,没有类似keys * 查询,因此排除了这个可能行
针对原因2:
这边在观察了在出现 RedisConnectionFailureException时候,确认当前服务器Redis连接数峰值为15,配置文件中配置的超时时间配置为2000ms,由于确认原因1中的没有非常耗时的查询
所以这种可能行也被排除了;
由于以上原因1和原因2都排除了,这里考虑原因3,是连接数的问题
查看配置发现最大连接数是300,远大于峰值180,配置数据似乎没问题,
于是,在开发环境测试该配置,由于项目中使用的是Jedis连接池,没有使用lettuce连接池(注意:SpringBoot2.x对应的Spring-Boot-Data-Redis依赖默认使用的连接池是lettuce,如果要使用Jedis连接池,需要排除默认连接池配置,引入Jedis连接池,见上面的Maven依赖)
进一步追踪源码发现
配置连接数相关的类为:
package org.apache.commons.pool2.impl; public class GenericObjectPoolConfig<T> extends BaseObjectPoolConfig<T> { public static final int DEFAULT_MAX_TOTAL = 8; public static final int DEFAULT_MAX_IDLE = 8; public static final int DEFAULT_MIN_IDLE = 0; private int maxTotal = 8; private int maxIdle = 8; private int minIdle = 0; ... }
加载该配置类的时机是在项目启动初始化连接池的时候
package org.springframework.data.redis.connection.jedis; import java.time.Duration; import java.util.Optional; import javax.net.ssl.HostnameVerifier; import javax.net.ssl.SSLParameters; import javax.net.ssl.SSLSocketFactory; import org.apache.commons.pool2.impl.GenericObjectPoolConfig; import org.springframework.lang.Nullable; /** * Default implementation of {@literal JedisClientConfiguration}. * * @author Mark Paluch * @author Christoph Strobl * @since 2.0 */ class DefaultJedisClientConfiguration implements JedisClientConfiguration { private final boolean useSsl; private final Optional<SSLSocketFactory> sslSocketFactory; private final Optional<SSLParameters> sslParameters; private final Optional<HostnameVerifier> hostnameVerifier; private final boolean usePooling; private final Optional<GenericObjectPoolConfig> poolConfig; private final Optional<String> clientName; private final Duration readTimeout; private final Duration connectTimeout; DefaultJedisClientConfiguration(boolean useSsl, @Nullable SSLSocketFactory sslSocketFactory, @Nullable SSLParameters sslParameters, @Nullable HostnameVerifier hostnameVerifier, boolean usePooling, @Nullable GenericObjectPoolConfig poolConfig, @Nullable String clientName, Duration readTimeout, Duration connectTimeout) { this.useSsl = useSsl; this.sslSocketFactory = Optional.ofNullable(sslSocketFactory); this.sslParameters = Optional.ofNullable(sslParameters); this.hostnameVerifier = Optional.ofNullable(hostnameVerifier); this.usePooling = usePooling; this.poolConfig = Optional.ofNullable(poolConfig); this.clientName = Optional.ofNullable(clientName); this.readTimeout = readTimeout; this.connectTimeout = connectTimeout; }
Debug发现加载后仍然使用的是默认的连接数
public static final int DEFAULT_MAX_TOTAL = 8;
public static final int DEFAULT_MAX_IDLE = 8;
public static final int DEFAULT_MIN_IDLE = 0;
private int maxTotal = 8;
private int maxIdle = 8;
private int minIdle = 0;
这里可能就是问题所在,配置文件中配置的最大连接数未生效,于是发现配置中这段配置已经失效
redis: pool: maxActive: 300 maxIdle: 100 maxWait: 1000
需要改为
redis: jedis: pool: maxActive: 300 maxIdle: 100 max-wait: 1000ms
修改后重启生效,如配置的数据一致