K8s容器服务启动预热优化思路与总结。
项目实战-服务启动预热
1、背景介绍
authority-api在2023年2月9日下午16点40分进行了线上发布。发布的时候有业务反馈调用auth接口出现超时的情况。下图为authority项目当时的499超时请求。
经过查看发现超时的请求全部为新启动的POD。而且过一会之后服务恢复稳定,接口响应也恢复正常。
authority-api-v1-7bdbf7d9c-j6hrp
authority-api-v1-7bdbf7d9c-d97tk
以authority-api-v1-7bdbf7d9c-d97tk为例做分析:
[082d24b4e4644e8d8706e2ba5b2e4d9b] 2023-02-09 16:46:18 - [INFO] [SlowLogAspect:67 logController] 请求开始 controller HealthController.healthCheck []
[082d24b4e4644e8d8706e2ba5b2e4d9b] 2023-02-09 16:46:18 - [INFO] [SlowLogAspect:73 logController] 请求结束,controller response {"errMsg":"ok","errorMsg":"ok","status":0,"ts":1675932378516,"version":0}, elapse[16ms]
2023-02-09 16:46:18健康检查通过
nginx日志:
work@authority-api-v1-7bdbf7d9c-d97tk:~$ less /mnt/logs/nginx/access-2023-02-09.log
remote_addr=[172.17.3.120] http_x_forward=[172.20.240.117, 172.20.240.117, 10.170.20.168,10.178.25.154] time=[2023-02-09T16:46:19+08:00] request=[POST /users/search?with=roles.permissions&_octo=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiI5OGE2YmI3MmJlYmM0YjY0YjIyNWViYjJlYzI0YjAxOSIsImlhdCI6MTY3NTkzMTE5NywiZXhwIjoxNjc1OTM0Nzk3fQ.ZoBDoNVPVZYmvhZWEnc551LP4RCHFQlgCtMpeUZlBtE HTTP/1.1] status=[200] byte=[179] elapsed=[0.103] refer=[-] body=[search=] ua=[Java/1.8.0_74] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932379.6971457172.20.240.117, 172.20.240.117, 10.170.20.168,10.178.25.15447] msec=[1675932379.697] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[application/json|-|-] upstream_response_time=[0.102] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[f4f9de521b231417c8480aa32ee905b3]
remote_addr=[172.17.10.26] http_x_forward=[-] time=[2023-02-09T16:46:22+08:00] request=[POST /api/v1/user/list HTTP/1.1] status=[200] byte=[437] elapsed=[3.517] refer=[-] body=[[59089]] ua=[okhttp/3.8.1] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932382.935650-35] msec=[1675932382.935] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[*/*|gzip|-] upstream_response_time=[3.518] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[71804753-1dd8-9336-a4b0-f558e2661c79]
remote_addr=[172.17.3.120] http_x_forward=[172.20.240.117, 172.20.240.117, 10.170.20.172,10.178.25.136] time=[2023-02-09T16:46:22+08:00] request=[POST /users/search?with=roles.permissions&_octo=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiI5OGE2YmI3MmJlYmM0YjY0YjIyNWViYjJlYzI0YjAxOSIsImlhdCI6MTY3NTkzMTE5NywiZXhwIjoxNjc1OTM0Nzk3fQ.ZoBDoNVPVZYmvhZWEnc551LP4RCHFQlgCtMpeUZlBtE HTTP/1.1] status=[200] byte=[179] elapsed=[0.015] refer=[-] body=[search=] ua=[Java/1.8.0_74] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932382.9831457172.20.240.117, 172.20.240.117, 10.170.20.172,10.178.25.136302] msec=[1675932382.983] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[application/json|-|-] upstream_response_time=[0.016] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[6128250dc00d2af915fec4682005a4ca]
remote_addr=[172.17.7.254] http_x_forward=[10.178.33.82, 10.178.25.161,10.178.25.154] time=[2023-02-09T16:46:23+08:00] request=[POST /users/search?search=id%3A57801&filter=id%3Bname%3Busername%3Buser_id%3Bmobile HTTP/1.1] status=[499] byte=[0] elapsed=[5.033] refer=[-] body=[-] ua=[okhttp/3.3.1] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932383.940178910.178.33.82, 10.178.25.161,10.178.25.15411] msec=[1675932383.940] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[*/*|gzip|-] upstream_response_time=[5.034] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[17859928592f6c1374e2cfdfd081ffb4]
remote_addr=[172.17.7.254] http_x_forward=[10.178.42.106, 10.178.25.136,10.178.25.154] time=[2023-02-09T16:46:25+08:00] request=[POST /users/search?with=roles%3Bhr&trashed=true&search=id%3A67730%3Bid%3A50654%3Bid%3A74590 HTTP/1.1] status=[499] byte=[0] elapsed=[4.999] refer=[-] body=[-] ua=[okhttp/3.3.1] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932385.076193210.178.42.106, 10.178.25.136,10.178.25.154100] msec=[1675932385.076] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[*/*|gzip|-] upstream_response_time=[4.998] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[fc9b822972501e8eb0832b262002f52e]
remote_addr=[172.17.3.120] http_x_forward=[10.178.18.204, 10.178.25.154,10.178.25.136] time=[2023-02-09T16:46:25+08:00] request=[POST /api/v1/partner/area/subordinate/batch HTTP/1.1] status=[200] byte=[361] elapsed=[3.204] refer=[-] body=[[56900]] ua=[Java/1.8.0_222] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932385.586181110.178.18.204, 10.178.25.154,10.178.25.136274] msec=[1675932385.586] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[*/*|-|-] upstream_response_time=[3.204] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[2f1e322ec3abd614c6c9467a84b6b0d9]
remote_addr=[172.17.3.120] http_x_forward=[10.178.18.204, 10.178.25.154,10.178.25.136] time=[2023-02-09T16:46:25+08:00] request=[POST /api/v1/partner/area/subordinate/batch HTTP/1.1] status=[200] byte=[355] elapsed=[2.851] refer=[-] body=[[77462]] ua=[Java/1.8.0_222] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932385.683181110.178.18.204, 10.178.25.154,10.178.25.136294] msec=[1675932385.683] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[*/*|-|-] upstream_response_time=[2.851] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[804f6723d61767b573981699a20ad5ef]
remote_addr=[172.17.3.120] http_x_forward=[10.178.18.109, 10.178.25.154,10.178.25.161] time=[2023-02-09T16:46:25+08:00] request=[GET /api/v1/partner/area/subordinate?user_id=81304 HTTP/1.1] status=[200] byte=[341] elapsed=[4.555] refer=[-] body=[-] ua=[Java/1.8.0_222] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932385.685140610.178.18.109, 10.178.25.154,10.178.25.161182] msec=[1675932385.685] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[*/*|-|-] upstream_response_time=[4.555] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[72c90611a47c393063a7fbff76a0d371]
remote_addr=[172.17.7.254] http_x_forward=[10.178.33.82, 10.178.25.136,10.178.25.161] time=[2023-02-09T16:46:25+08:00] request=[POST /users/search?search=id%3A42605&filter=id%3Bname%3Busername%3Buser_id%3Bmobile HTTP/1.1] status=[200] byte=[308] elapsed=[4.586] refer=[-] body=[-] ua=[okhttp/3.3.1] cookie=[-] gzip=[-] log_id=[authority-api-v1-7bdbf7d9c-d97tk71675932385.780178910.178.33.82, 10.178.25.136,10.178.25.161186] msec=[1675932385.780] http_host=[authority-api.production.svc.renrenaiche.cn] http_accept=[*/*|gzip|-] upstream_response_time=[4.586] sent_http_set_cookie=[-] session_id=[-] rrc_tg=[-] x-request-id=[549791788fc3279b1e05f22bdff1fc11]
remote_addr=[172.17.7.254] http_x_forward=[10.178.33.82, 10.178.25.154,10.178.25.136] time=[2023-02-09T16:46:26+08:00] request=[POST /users/search?sear:
从NGINX中可以看到接口响应慢甚至超时。
而服务中大多数接口数据保存在缓存中,响应不应该过慢。
所以初步判断是Authority-api的QPS比较高,服务启动之后大量的请求打入,服务缺少必要的预热,导致响应时间长,而QPS比较高,短时间大量创建数据库链接等,加剧服务响应慢。
2、服务压测
2.1、压测环境准备
生产环境启动新版本POD
按每秒50并发进行压测
2.2、users/search接口压测
初始版本:
压测结束之后再次按同样线程组再次进行并发压测
2.3 多线程接口预热
此时我们增加必要的预热操作之后再重复进行上边的压测操作
第一次压测之后再次进行请求
2.4 主线程单次接口预热
再次请求
2.5 多线程接口预热+连接池参数调整
再次请求
3、压测结论
根据压测报告,目前多线程预热的预期结果优于主线程单次预热的情况,所以预热采取多线程预热的方式,就绪探针中配置预热接口。
增加服务启动预热效果如下:
启动异常 | 增加服务预热 | 常规情况 | |
---|---|---|---|
前200个请求平均响应时长 | 6.65s | 0.322s | 0.003s |
前500个请求平均响应时长 | 4.74s | 0.143s | 0.003s |
499请求499请求 | 5条 | 0条 | 0条 |
4、优化点
4.1、Tomcat优化
目前公司使用的是SpringBoot1.4.1
在org.springframework.boot.autoconfigure.web.ServerProperties中查看Tomcat配置
打印断点测试初始值都为0
高版本Tomcat
min-spare-threads:最小备用线程数,tomcat启动时的初始化的线程数。默认10
max-threads:Tomcat可创建的最大的线程数,每一个线程处理一个请求,超过这个请求数后,客户端请求只能排队,等有线程释放才能处理。(建议这个配置数可以在服务器CUP核心数的200~250倍之间)默认200
引用:https://www.cnblogs.com/lys_013/p/13185940.html?ivk_sa=1024320u
4.2、dispatcherServlet 是懒加载的
SpringBoot在启动后,首次调用接口的时候是比较慢的,造成这种结果的原因是 DispatcherServlet 没有预热的原因,在SpringBoot启动的时候 DispatcherServlet 并没有进行初始化,而在第一次接口请求的时候,才会进行初始化操作。
https://blog.csdn.net/qq_39595769/article/details/120887883
dispatcherServlet会在就绪探针的时候会被调用到,所以此处不是导致项目启动超时的根本原因,优化之后可能会加快项目启动时间,预计优化效果有限
4.3、数据库链接是懒加载的
只能在项目启动之后自己查一次库做预热
https://blog.csdn.net/yb2020/article/details/128099065
5、针对auth-api的优化
城市缓存应该放到项目启动之后调用,防止并发问题