将设为首页浏览此站
开启辅助访问 天气与日历 收藏本站联系我们切换到窄版

易陆发现论坛

 找回密码
 开始注册
查看: 211|回复: 3
收起左侧

处理过程osd down掉了,服务状态正常HEALTH_WARN 2 osds down; Reduced data availability: 29 pgs

[复制链接]
发表于 2021-10-27 10:26:02 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有帐号?开始注册

x
ceph health detail . w* P0 W- L9 I+ h, J3 {, h" b
HEALTH_WARN 2 osds down; Reduced data availability: 29 pgs inactive
& Q4 c, K9 M. R6 WOSD_DOWN 2 osds down
: o1 E# W: E: p# F* E    osd.6 (root=hdd,host=hdd-ceph1) is down
9 {7 l# _: Y, t) A    osd.11 (root=hdd,host=hdd-ceph1) is down# s% g& f, j+ C% j% x* j# s
PG_AVAILABILITY Reduced data availability: 29 pgs inactive# h6 ~; Q1 W  `" d2 f1 Z$ [
    pg 11.1 is stuck inactive for 81874.695636, current state unknown, last acting []
$ }! m0 D; Z4 m, \. S    pg 11.5 is stuck inactive for 81874.695636, current state unknown, last acting []
2 K4 ?" W5 D+ |; H) J    pg 11.22 is stuck inactive for 81874.695636, current state unknown, last acting []
: v7 y% Z4 L7 z9 l* @    pg 11.31 is stuck inactive for 81874.695636, current state unknown, last acting []9 [) r* I- u& z/ `. l: N. c; p' H3 B1 j
    pg 11.3b is stuck inactive for 81874.695636, current state unknown, last acting []4 r. w* K& o% S( l% |
    pg 11.5d is stuck inactive for 81874.695636, current state unknown, last acting []/ @+ L7 ^1 L( f( U" L2 Q
    pg 11.5f is stuck inactive for 81874.695636, current state unknown, last acting [], k+ P' p) E# ~. w4 y
    pg 11.63 is stuck inactive for 81874.695636, current state unknown, last acting []
$ h7 r) J; G/ \& I% @    pg 11.67 is stuck inactive for 81874.695636, current state unknown, last acting []
2 c# E3 h  |- L- g3 N3 K0 Y, F    pg 11.79 is stuck inactive for 81874.695636, current state unknown, last acting []8 F& i; S& }" x7 B. I% {5 V# Q* {
    pg 11.83 is stuck inactive for 81874.695636, current state unknown, last acting []
: }, M# J9 _7 z. R2 K& d7 A. ]+ }    pg 11.90 is stuck inactive for 81874.695636, current state unknown, last acting []
5 {2 S. {0 _; I' f" E% U+ @    pg 11.91 is stuck inactive for 81874.695636, current state unknown, last acting []
- ]4 T' c4 ~, [2 R9 q1 \    pg 11.93 is stuck inactive for 81874.695636, current state unknown, last acting []. g. A" i6 Q5 J
    pg 11.a1 is stuck inactive for 81874.695636, current state unknown, last acting []
" l  ]  I! N# m: B    pg 11.a4 is stuck inactive for 81874.695636, current state unknown, last acting []  ]" ^0 J3 H7 Y9 b* I5 ^% u5 `
    pg 11.aa is stuck inactive for 81874.695636, current state unknown, last acting []
3 ^' K/ p  ?$ Z$ W    pg 11.b3 is stuck inactive for 81874.695636, current state unknown, last acting []* `8 |5 Y( p6 d9 J
    pg 11.b6 is stuck inactive for 81874.695636, current state unknown, last acting []0 z  W8 g1 }% h" I0 B
    pg 11.b8 is stuck inactive for 81874.695636, current state unknown, last acting []
3 W9 k3 v3 W, N" v    pg 11.ca is stuck inactive for 81874.695636, current state unknown, last acting []9 z6 j' i! t. Q
    pg 11.cf is stuck inactive for 81874.695636, current state unknown, last acting []( Z' S3 ]; a$ Q% J& F4 Z+ M
    pg 11.da is stuck inactive for 81874.695636, current state unknown, last acting []
6 t1 c5 |7 |( y9 k$ ?0 N
5 _- X$ l( u7 m8 u+ p/ b
5 Q  |9 n/ w% o0 u
1 M( G9 b  K7 R9 f  e
3 J. W8 a+ c8 k+ A. p- o    pg 11.e6 is stuck inactive for 81874.695636, current state unknown, last acting []
: ~3 u0 s# f7 E7 K/ A% ?) w    pg 11.e8 is stuck inactive for 81874.695636, current state unknown, last acting []
$ I1 A: Z1 t9 @+ p# X6 H! I    pg 11.ec is stuck inactive for 81874.695636, current state unknown, last acting []
3 @  M* m) G& o    pg 11.ef is stuck inactive for 81874.695636, current state unknown, last acting []! W& ]- N- C  C3 \5 ^
    pg 11.fa is stuck inactive for 81874.695636, current state unknown, last acting []  x4 D2 n9 G3 d5 |
    pg 11.fb is stuck inactive for 81874.695636, current state unknown, last acting []/ z8 d, l2 l6 e& a  l
( t/ G1 r8 }$ Q, Q8 f
pg好像卡住了。: Z: L, A/ i3 e' l/ D: k8 ]- V

- q6 e, W6 V) |. G* R5 @
4 Y9 y, N2 ?5 N3 R4 f日志刷新很快:
8 R. F2 }" B) I3 g$ v3 I
( z- C3 n- ]5 O( L2021-10-27 10:22:59.527 7f9b2d0f5700  1 osd.6 pg_epoch: 23810 pg[10.1c( v 4595'732 (0'0,4595'732] local-lis/les=13109/131=[13109,23810)/3 crt=4595'732 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [8] -> [14,8], acting [8] -> [14,ting 4611087854031667199! G  V' ]" K/ p" M/ t1 P) t8 b5 O
2021-10-27 10:22:59.528 7f9b2e0f7700  1 osd.6 pg_epoch: 23739 pg[8.49d( v 4595'444 (0'0,4595'444] local-lis/les=13081/130081,23739)/5 crt=4595'444 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [14] -> [], acting [14] -> [], acting10878540316671991 r) b/ _" E* Y* p; d
2021-10-27 10:22:59.539 7f9b2e8f8700  1 osd.6 pg_epoch: 23770 pg[8.339( v 4595'513 (0'0,4595'513] local-lis/les=13099/13113099,23770)/9 crt=4595'513 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [14,8] -> [14], acting [14,8] -> [1acting 4611087854031667199; K7 l$ Z, h& C  {9 @
2021-10-27 10:22:59.540 7f9b2e0f7700  1 osd.6 pg_epoch: 23746 pg[8.49d( v 4595'444 (0'0,4595'444] local-lis/les=13081/13013081,23746)/5 crt=4595'444 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [] -> [14], acting [] -> [14], acti1087854031667199- i& I4 O6 k" o1 W5 K
2021-10-27 10:22:59.541 7f9b2c8f4700  1 osd.6 pg_epoch: 23808 pg[8.4aa( v 4595'411 (0'0,4595'411] local-lis/les=13114/131i=[13112,23808)/11 crt=4595'411 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [22,8,14] -> [22,8], acting [2254031667199 upacting 4611087854031667199
& ?4 a! @! W, }% ~; e% v2021-10-27 10:22:59.544 7f9b2c8f4700  1 osd.6 pg_epoch: 23810 pg[8.4aa( v 4595'411 (0'0,4595'411] local-lis/les=13114/1310 pi=[13112,23810)/10 crt=4595'411 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [22,8] -> [22,8,14], acting 87854031667199 upacting 46110878540316671994 T9 g/ ?' u4 Q# r, u8 v: O/ R$ ~
2021-10-27 10:22:59.545 7f9b2e8f8700  1 osd.6 pg_epoch: 23773 pg[8.339( v 4595'513 (0'0,4595'513] local-lis/les=13099/131=[13099,23773)/9 crt=4595'513 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [14] -> [14,8], acting [14] -> [1upacting 4611087854031667199
3 p) M' B! u3 r4 E% ]$ a& y9 s2021-10-27 10:22:59.546 7f9b2e0f7700  1 osd.6 pg_epoch: 23749 pg[8.49d( v 4595'444 (0'0,4595'444] local-lis/les=13081/130i=[13081,23749)/5 crt=4595'444 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [14] -> [23,14], acting [14] -> 99 upacting 4611087854031667199
/ @" G2 Y; I0 X6 _2021-10-27 10:22:59.566 7f9b2d0f5700  1 osd.6 pg_epoch: 23838 pg[10.1c( v 4595'732 (0'0,4595'732] local-lis/les=13109/13113109,23838)/3 crt=4595'732 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [14,8] -> [14], acting [14,8] -> [1acting 4611087854031667199
+ A0 A( c& Y5 W! q" G2021-10-27 10:22:59.566 7f9b2d0f5700  1 osd.6 pg_epoch: 23838 pg[10.1c( v 4595'732 (0'0,4595'732] local-lis/les=13109/13113109,23838)/3 crt=4595'732 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
# s9 S2 l, O) }2021-10-27 10:22:59.597 7f9b2c8f4700  1 osd.6 pg_epoch: 23838 pg[8.4aa( v 4595'411 (0'0,4595'411] local-lis/les=13114/131pi=[13112,23838)/11 crt=4595'411 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [22,8,14] -> [22,14], acting [87854031667199 upacting 4611087854031667199
" t  U  G2 J9 N7 Z5 @$ `  g2021-10-27 10:22:59.597 7f9b2c8f4700  1 osd.6 pg_epoch: 23838 pg[8.4aa( v 4595'411 (0'0,4595'411] local-lis/les=13114/131pi=[13112,23838)/11 crt=4595'411 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
1 x! g& x. x/ M6 T! p2021-10-27 10:22:59.606 7f9b2d0f5700  1 osd.6 pg_epoch: 23739 pg[8.765( v 4595'410 (0'0,4595'410] local-lis/les=13094/130094,23739)/3 crt=4595'410 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [14] -> [], acting [14] -> [], acting10878540316671998 B4 O6 i3 C4 ?
2021-10-27 10:22:59.612 7f9b2e8f8700  1 osd.6 pg_epoch: 23808 pg[8.339( v 4595'513 (0'0,4595'513] local-lis/les=13099/1313099,23808)/9 crt=4595'513 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [14,8] -> [8], acting [14,8] -> [8],g 4611087854031667199
5 J, J5 g% ?3 D4 G6 @) s2021-10-27 10:22:59.617 7f9b2e8f8700  1 osd.6 pg_epoch: 23810 pg[8.339( v 4595'513 (0'0,4595'513] local-lis/les=13099/131=[13099,23810)/9 crt=4595'513 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [8] -> [14,8], acting [8] -> [14,ting 4611087854031667199
) B; ?, ~% g, J' m3 N2 o+ S2021-10-27 10:22:59.620 7f9b2d8f6700  1 osd.6 pg_epoch: 23808 pg[8.38b( v 4595'408 (0'0,4595'408] local-lis/les=13099/13113099,23808)/12 crt=4595'408 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [23,14] -> [23], acting [23,14] -> upacting 4611087854031667199
4 D8 ~  i$ ?7 O+ L. j2021-10-27 10:22:59.622 7f9b2d0f5700  1 osd.6 pg_epoch: 23746 pg[8.765( v 4595'410 (0'0,4595'410] local-lis/les=13094/13013094,23746)/3 crt=4595'410 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [] -> [14], acting [] -> [14], acti1087854031667199) ~1 s  y$ [- G) k; H
2021-10-27 10:22:59.624 7f9b2d8f6700  1 osd.6 pg_epoch: 23810 pg[8.38b( v 4595'408 (0'0,4595'408] local-lis/les=13099/131i=[13099,23810)/12 crt=4595'408 lcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [23] -> [23,14], acting [23] ->199 upacting 4611087854031667199$ @/ p6 ~; P/ r5 F" e
5 C7 t7 S: y4 j. K3 \# _
 楼主| 发表于 2021-10-27 10:28:57 | 显示全部楼层
[root@ceph1 ~]# systemctl status ceph-osd@6.service
4 U( K' x$ e! w6 e9 ~ceph-osd@6.service - Ceph object storage daemon osd.6
1 V; U; b! y  u4 C: \0 b   Loaded: loaded ([url=]/usr/lib/systemd/system/ceph-osd@.service[/url]; enabled-runtime; vendor preset: disabled)/ D$ n( i! @, S2 Z: c4 w; q+ d
   Active: active (running) since Wed 2021-10-27 10:22:44 CST; 11min ago
: I0 I* `% q2 [6 v' ]) ?4 K  Process: 6169 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
+ `0 L9 H. ~0 C$ X- ] Main PID: 6175 (ceph-osd)
, M( Z* h0 H- G: y- ~, i+ E# R% N   CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@6.service% j! x" G, X1 t2 y2 @+ F9 Z/ Q
           └─6175 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
9 M3 _/ Z/ f  ?% MOct 27 10:22:44 ceph1 systemd[1]: Starting Ceph object storage daemon osd.6...
) ]5 X) H. V5 P/ {Oct 27 10:22:44 ceph1 systemd[1]: Started Ceph object storage daemon osd.6.
. T! A% m# f/ b' Y: [6 xOct 27 10:22:55 ceph1 ceph-osd[6175]: 2021-10-27 10:22:55.875 7f9b532a0a80 -1 osd.6 23718 log_to_monitors {default=true}
$ _$ E7 D% m0 v! f, O; D4 Z2 D- oOct 27 10:29:37 ceph1 ceph-osd[6175]: 2021-10-27 10:29:37.277 7f9b45125700 -1 osd.6 25440 set_numa_affinity unab...ectory
; z( E# r9 t; d% k+ D, u) eHint: Some lines were ellipsized, use -l to show in full.. w8 I% g2 ?2 u8 z( R7 Z2 B6 w& g  L

5 e2 B) W: p. t$ o& U. h状态正常,但是ceph health deatil却不正常:3 M0 W+ {2 S+ s4 m. v' R1 n
2 osds down7 S% ]! m0 Y8 z+ |6 K. B6 D
            Reduced data availability: 29 pgs inactive
- |  D# X/ h+ }# \( p
* l) E6 a0 l" \9 R2 c2 u
( W& H( s, x# }6 b$ I4 w9 A看到网上的处理过程:0 ]4 P) I) Y0 @5 B$ g5 O

! F8 _6 [$ N1 u* f! Q) i8 I/ T7 D* T
显示这3个 pg 卡住了
运行 pg query, 查看该 pg 的具体信息, 显示:
[root@controller ~]# ceph pg 3.1e queryError ENOENT: i don't have pgid 3.1e
! _( v1 e' m& ?
无法找到该 pg id.
运行 pg dump_stuck unclean, 显示:
分析
看来是这几个 pgid 彻底找不到了. 我的 osd pool 有三个, 分别叫 l1 (1副本), l2 (2副本), l3 (3副本).
估计可能是之前写入 1 副本的数据由于硬盘挂掉导致的数据丢失.
既然是1副本, 也不要求数据可靠性了. 并且本身存储的也是一些下载到一半的数据, 也没什么关系
修正
通过阅读 CEPH 官方 PG troubleshooting 文档 , 发现了解决方案:
POOL SIZE = 1
If you have the osd pool default size set to 1, you will only have one copy of the object. OSDs rely on other OSDs to tell them which objects they should have. If a first OSD has a copy of an object and there is no second copy, then no second OSD can tell the first OSD that it should have that copy. For each placement group mapped to the first OSD (see ceph pg dump), you can force the first OSD to notice the placement groups it needs by running:
ceph osd force-create-pg <pgid>
即, 多 osd 副本可以互相通知 pg 信息, 但是单副本就会丢, 为了恢复这个pg, 我们可以强行创建它.
[root@controller ~]# ceph osd force-create-pg 3.1eError EPERM: This command will recreate a lost (as in data lost) PG with data in it, such that the cluster will give up ever trying to recover the lost data.  Do this only if you are certain that all copies of the PG are in fact lost and you are willing to accept that the data is permanently destroyed.  Pass --yes-i-really-mean-it to proceed.
* C+ y1 f" `- d
运行创建命令, 提示, 运行会永久的丢失该 pg 的数据, 需要加上 --yes-i-really-mean-it.
[backcolor=rgb(245, 245, 245) !important][url=]https://common.cnblogs.com/images/copycode.gif[/url]
4 \9 R. M4 r; g- H' H% O% D& }) {[root@controller ~]# ceph osd force-create-pg 3.1e --yes-i-really-mean-itpg 3.1e now creating, ok[root@controller ~]# ceph osd force-create-pg 3.b --yes-i-really-mean-itpg 3.b now creating, ok[root@controller ~]# ceph osd force-create-pg 3.4 --yes-i-really-mean-itpg 3.4 now creating, ok 3 W# H2 P4 p+ q2 o
3 ]1 z) w! d/ z- r
感觉有些可行性,于是直接走最后一步:* l) H: Q7 r/ b/ R

( L9 f. P$ B$ K. B9 [) D5 ?4 A8 t[root@ceph1 ~]# ceph osd force-create-pg 11.1 --yes-i-really-mean-it
* D- Q2 {+ t+ Q% J$ l' Q. rpg 11.1 now creating, ok
4 K/ l8 ?0 h  F. J& l; ~3 A[root@ceph1 ~]#
6 |+ y. e4 r0 d6 u* i7 Z" i3 ][root@ceph2 ~]# ceph -s
8 S6 y* ?3 r; l* R  cluster:
6 H/ H+ K5 |# ~8 d    id:     4d8d7309-ad9e-4566-bee5-69b8d805dd579 d$ Z8 u. t: {# C4 l) V; N
    health: HEALTH_OK/ p% Q% F5 Y7 n7 ?

7 v* e- E7 f) q7 U) y  services:0 l4 @4 k3 g' N1 }
    mon: 5 daemons, quorum ceph1,ceph2,ceph3,compute1,compute2 (age 28m). g5 H/ a- E: g* v% I) J
    mgr: ceph2(active, since 22h), standbys: compute3, compute2, compute1, ceph3, ceph1
. W( `7 I. \1 L# c    osd: 36 osds: 36 up (since 47s), 36 in (since 10m)
' G* v) H* u! S3 U8 F6 l, A5 C $ ?) I# ?" z6 t0 n
  data:# c3 P: x5 p0 K# k! [
    pools:   9 pools, 5888 pgs" M! C5 W# V1 C
    objects: 143.19k objects, 549 GiB) X, P% ]" x; X- e' P# a. T
    usage:   5.9 TiB used, 127 TiB / 133 TiB avail5 j: d0 }; q9 c9 l8 H) F3 i
    pgs:     5888 active+clean
1 ?( L/ F) A7 r: |2 ~" \ 7 h5 y. g9 n: {6 l3 C/ {
  io:
0 ^2 N. w' n; \. n3 }    client:   258 KiB/s rd, 35 KiB/s wr, 224 op/s rd, 6 op/s wr
& I6 R7 c! v# U# n; d  b8 n- D
9 s0 h: ], r7 s/ Q6 f$ i8 t, S- g4 o/ D. F- Y" |4 t

8 W9 v- H. w) P6 U1 @没想到ok状态了。& A" w' {& K- s. U/ I8 S  W
9 u, C/ l8 j- S4 p' q8 a
 楼主| 发表于 2021-10-27 10:33:21 | 显示全部楼层
[root@ceph1 ~]# ceph osd force-create-pg 11.5 --yes-i-really-mean-it
, A# g. ~  P6 M; @6 y9 Qpg 11.5 now creating, ok
& w& y" w; J6 j$ U5 g[root@ceph1 ~]# ceph osd force-create-pg 11.22 --yes-i-really-mean-it
6 t1 {" `9 s. W0 z) t" b4 kpg 11.22 now creating, ok' T; m2 n. L9 k  P0 Q

. O, {2 C/ _) W% J/ H9 }
! c3 x# r" d* D7 h
( l* z5 G1 H0 W8 T4 j1 C& lceph 修复成功。
- a) c( @9 J0 ]+ R/ N2 r
 楼主| 发表于 2021-10-27 10:40:56 | 显示全部楼层
[root@ceph1 ~]# ceph pg dump_stuck
9 R3 O; f5 K1 B; }6 Q+ adegraded    inactive    stale       unclean     undersized  
您需要登录后才可以回帖 登录 | 开始注册

本版积分规则

关闭

站长推荐上一条 /4 下一条

如有购买积分卡请联系497906712

QQ|返回首页|Archiver|手机版|小黑屋|易陆发现 点击这里给我发消息

GMT+8, 2021-11-28 20:35 , Processed in 0.044371 second(s), 22 queries .

Powered by LR.LINUX.cloud bbs168x X3.2 Licensed

© 2012-2022 Comsenz Inc.

快速回复 返回顶部 返回列表