admin 发表于 2022-3-1 19:01:12

HEALTH_WARN 1 daemons have recently crashed 解决过程

ceph 出现告警,解决流程:
# ceph health detail
HEALTH_WARN 1 daemons have recently crashed
RECENT_CRASH 1 daemons have recently crashed
    osd.29 crashed on host compute08 at 2022-03-01 10:31:17.079004Z

      
# ceph crash ls-new
ID                                                               ENTITY NEW
2022-03-01_10:31:17.079004Z_11fa7732-990f-4166-8de5-943ff6f07c10 osd.29*
# ceph crash info2022-03-01_10:31:17.079004Z_11fa7732-990f-4166-8de5-943ff6f07c10
{
    "os_version_id": "7",
    "assert_condition": "e.version > info.last_update",
    "utsname_release": "3.10.0-1160.el7.x86_64",
    "os_name": "CentOS Linux",
    "entity_name": "osd.29",
    "assert_file": "/home/miles/rpmbuild/BUILD/ceph-14.2.8/src/osd/PG.cc",
    "timestamp": "2022-03-01 10:31:17.079004Z",
    "process_name": "ceph-osd",
    "utsname_machine": "x86_64",
    "assert_line": 3964,
    "utsname_sysname": "Linux",
    "os_version": "7 (Core)",
    "os_id": "centos",
    "assert_thread_name": "tp_osd_tp",
    "utsname_version": "#1 SMP Wed Nov 18 03:43:48 UTC 2020",
    "backtrace": [
      "(()+0xf630) ",
      "(gsignal()+0x37) ",
      "(abort()+0x148) ",
      "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) ",
      "(()+0x4cc87d) ",
      "(PG::add_log_entry(pg_log_entry_t const&, bool)+0x1f5) ",
      "(PG::append_log(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, eversion_t, eversion_t, ObjectStore::Transaction&, bool, bool)+0x10b) ",
      "(non-virtual thunk to PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, bool, ObjectStore::Transaction&, bool)+0x95) ",
      "(ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xaa9) ",
      "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x257) ",
      "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) ",
      "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5b3) ",
      "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) ",
      "(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) ",
      "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) ",
      "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) ",
      "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) ",
      "(()+0x7ea5) ",
      "(clone()+0x6d) "
    ],
    "utsname_hostname": "compute08",
    "assert_msg": "/home/miles/rpmbuild/BUILD/ceph-14.2.8/src/osd/PG.cc: In function 'void PG::add_log_entry(const pg_log_entry_t&, bool)' thread 7fb52ad89700 time 2022-03-01 18:31:17.054438\n/home/miles/rpmbuild/BUILD/ceph-14.2.8/src/osd/PG.cc: 3964: FAILED ceph_assert(e.version > info.last_update)\n",
    "crash_id": "2022-03-01_10:31:17.079004Z_11fa7732-990f-4166-8de5-943ff6f07c10",
    "assert_func": "void PG::add_log_entry(const pg_log_entry_t&, bool)",
    "ceph_version": "14.2.8-111.el7"
}


# ceph crash archive 2022-03-01_10:31:17.079004Z_11fa7732-990f-4166-8de5-943ff6f07c10
# ceph health detail
HEALTH_OK



解决完成。

以下只是查看命令:
# ceph config getmgr/crash/warn_recent_interval
Error EINVAL: unrecognized entity 'mgr/crash/warn_recent_interval'
# ceph get mgr/crash/warn_recent_interval
no valid command found; 10 closest matches:
osd pause
osd unpause
osd get-require-min-compat-client
osd set-require-min-compat-client <version> {--yes-i-really-mean-it}
osd set-backfillfull-ratio <float>
osd set-nearfull-ratio <float>
mds count-metadata <property>
mds metadata {<who>}
fs dump {<int>}
versions
Error EINVAL: invalid command
# ceph config set mgr/crash/warn_recent_interval0
Invalid command: missing required parameter value(<string>)
config set <who> <name> <value> {--force} :Set a configuration option for one or more entities
Error EINVAL: invalid command
# ceph crash archive-all
# ceph -s
cluster:
    id:   29046cc0-0682-496b-98b1-912e59964282
    health: HEALTH_OK

services:
    mon: 3 daemons, quorum hostceph1,hostceph2,hostceph3 (age 27m)
    mgr: hostceph1(active, since 53m), standbys: hostceph2, hostceph3
    osd: 34 osds: 34 up (since 27m), 34 in (since 45m)

data:
    pools:   9 pools, 9344 pgs
    objects: 1.21M objects, 4.6 TiB
    usage:   16 TiB used, 110 TiB / 126 TiB avail
    pgs:   9344 active+clean

io:
    client:   2.7 KiB/s rd, 13 MiB/s wr, 0 op/s rd, 97 op/s wr
页: [1]
查看完整版本: HEALTH_WARN 1 daemons have recently crashed 解决过程