admin 发表于 2021-12-9 10:03:15

ceph bluestore与 filestore 数据存放的区别

ceph bluestore与 filestore 数据存放的区别
一、 filestore 对象所在的PG以文件方式放在xfs文件中
1 查看所有的osd硬盘,跟其他linux其他硬盘一样,被挂载一个目录中。


# df -h | grep -i ceph

/dev/sda1 3.7T 112G 3.6T 3% /var/lib/ceph/osd/ff279da7-446d-451f-9dcf-7d281e9b85a6

/dev/sdb1 3.7T 77G 3.6T 3% /var/lib/ceph/osd/263adb71-1902-4e21-acea-b7f53921b484

/dev/sdc1 3.7T 74G 3.6T 2% /var/lib/ceph/osd/e71a1f8e-08eb-4fa6-ba17-90ec20bfe84b

/dev/sdd1 3.7T 96G 3.6T 3% /var/lib/ceph/osd/f308c1e5-6d39-442e-883a-e3253e6ba4f0

/dev/sde1 3.7T 97G 3.6T 3% /var/lib/ceph/osd/81772621-7175-4d5d-bdb3-ccad15fa3848

/dev/sdf1 3.7T 111G 3.6T 3% /var/lib/ceph/osd/889ba3b0-09a9-43e8-b0c9-adc938337ccd

/dev/sdh1 3.7T 77G 3.6T 3% /var/lib/ceph/osd/269d36ae-7e3f-4c72-89a1-6f5841c3212f

/dev/sdi1 3.7T 91G 3.6T 3% /var/lib/ceph/osd/5476a45d-ca0e-4728-b09d-7a277394bda7

/dev/sdj1 3.7T 65G 3.6T 2% /var/lib/ceph/osd/6b1ab4bb-82b1-4fbd-8a57-3fae690df4e9

/dev/sdk1 3.7T 85G 3.6T 3% /var/lib/ceph/osd/0a44ee0c-c5f3-4769-b4bc-1dd819c3f1ad

/dev/sdl1 3.7T 127G 3.6T 4% /var/lib/ceph/osd/60536d66-31f2-4ec4-8728-e02bf11fd04f




2 这个uuid:ff279da7-446d-451f-9dcf-7d281e9b85a6其实就是一块硬盘保存的数据,进入这个目录。


# cd /var/lib/ceph/osd/ff279da7-446d-451f-9dcf-7d281e9b85a6

# ls -ls

total 40

4 -rw-r--r-- 1 root root 37 Dec 29 23:06 ceph_fsid

4 drwxr-xr-x 92 root root 4096 Feb 18 09:19 current

4 -rw-r--r-- 1 root root 37 Dec 29 23:06 fsid

4 -rw------- 1 root root 56 Dec 29 23:06 keyring

4 -rw-r--r-- 1 root root 21 Dec 29 23:06 magic

4 -rw-r--r-- 1 root root 6 Dec 29 23:06 ready

4 -rw-r--r-- 1 root root 4 Dec 29 23:06 store_version

4 -rw-r--r-- 1 root root 53 Dec 29 23:06 superblock

4 -rw-r--r-- 1 root root 10 Dec 29 23:06 type

4 -rw-r--r-- 1 root root 2 Dec 29 23:06 whoami




3 在进入current目录


# cd current/

# ls -ls

total 920

36 drwxr-xr-x 2 root root 32768 Jan 1 18:24 1.7_head

0 drwxr-xr-x 2 root root 6 Dec 29 23:11 1.7_TEMP

20 drwxr-xr-x 2 root root 16384 Feb 22 16:22 2.15_head

0 drwxr-xr-x 2 root root 6 Dec 31 23:07 2.15_TEMP

24 drwxr-xr-x 2 root root 20480 Feb 22 16:07 2.1d_head

0 drwxr-xr-x 2 root root 66 Feb 22 02:54 2.1d_TEMP

24 drwxr-xr-x 2 root root 20480 Feb 22 16:36 2.1f_head

0 drwxr-xr-x 2 root root 66 Feb 21 20:31 2.1f_TEMP

24 drwxr-xr-x 2 root root 20480 Feb 22 15:44 2.36_head

0 drwxr-xr-x 2 root root 6 Dec 31 23:07 2.36_TEMP

24 drwxr-xr-x 2 root root 20480 Feb 22 16:38 2.41_head

0 drwxr-xr-x 2 root root 6 Dec 31 23:07 2.41_TEMP

28 drwxr-xr-x 2 root root 24576 Feb 22 16:39 2.43_head

0 drwxr-xr-x 2 root root 6 Dec 31 23:07 2.43_TEMP

28 drwxr-xr-x 2 root root 24576 Feb 22 16:39 2.51_head

0 drwxr-xr-x 2 root root 58 Feb 22 15:57 2.51_TEMP

24 drwxr-xr-x 2 root root 20480 Feb 22 16:37 2.5c_head

0 drwxr-xr-x 2 root root 6 Dec 31 23:07 2.5c_TEMP

24 drwxr-xr-x 2 root root 20480 Feb 22 15:33 2.78_head




可见PG是一个文件方式保存的。
二、bluestore 直接写到裸盘上面,由于无文件系统,在系统无法看到具体文件。
1 查看所有的osd硬盘,只是一个日志挂载到目录中。


df -h | grep -i ceph
tmpfs                     63G   24K   63G   1% /var/lib/ceph/osd/ceph-1
tmpfs                     63G   24K   63G   1% /var/lib/ceph/osd/ceph-0





2 进入/var/lib/ceph/osd/d531c723-2bd7-4197-9b6d-8a7fa7ac4719,发觉又一个链接文件,直接指到一块块设备。


# ls -ls
total 24
0 lrwxrwxrwx 1 ceph ceph 93 Dec4 09:03 block -> /dev/ceph-c7efdce2-5f1d-475f-b19e-c1673043d2e1/osd-block-d2c2a02e-db57-4ce2-a48c-4dc7bdb9c355
4 -rw------- 1 ceph ceph 37 Dec4 09:03 ceph_fsid
4 -rw------- 1 ceph ceph 37 Dec4 09:03 fsid
4 -rw------- 1 ceph ceph 55 Dec4 09:03 keyring
4 -rw------- 1 ceph ceph6 Dec4 09:03 ready
4 -rw------- 1 ceph ceph 10 Dec4 09:03 type
4 -rw------- 1 ceph ceph2 Dec4 09:03 whoami





3 进入真正的目录,发现指到的是一块硬盘sdb2,其实df查到的是sdb1,是日志盘。


# cd /dev/disk/by-partlabel/

# ls -ls

total 0

0 lrwxrwxrwx 1 root root 10 Jan 24 14:38 KOLLA_CEPH_DATA_BS_0 -> ../../sdd1

0 lrwxrwxrwx 1 root root 10 Feb 21 17:21 KOLLA_CEPH_DATA_BS_0_B -> ../../sdd2

0 lrwxrwxrwx 1 root root 10 Jan 24 14:38 KOLLA_CEPH_DATA_BS_3 -> ../../sdc1

0 lrwxrwxrwx 1 root root 10 Feb 21 17:21 KOLLA_CEPH_DATA_BS_3_B -> ../../sdc2

0 lrwxrwxrwx 1 root root 10 Jan 24 14:38 KOLLA_CEPH_DATA_BS_6 -> ../../sdb1

0 lrwxrwxrwx 1 root root 10 Feb 21 17:21 KOLLA_CEPH_DATA_BS_6_B -> ../../sdb2






# ls -ls
total 0
0 lrwxrwxrwx 1 root root 7 Dec4 09:03 osd-block-d2c2a02e-db57-4ce2-a48c-4dc7bdb9c355 -> ../dm-2






三、
1 首先建立一个测试文件。


#echo "Hello Ceph, You are Awesome like MJ" > /tmp/helloceph

#ceph osd pool create HPC_Pool 128 128




2 将生成的测试文件存入该池中并确认文件在池中:


#rados -p HPC_Pool put object1 /tmp/helloceph

#rados -p HPC_Pool ls

object1




3 在ceph中,数据都是以对象的形式存储的,这些对象属于一个PG,这些PG又对应多个OSD。现在我们来直观地感受一下这个概念:


# ceph osd map HPC_Pool object1

osdmap e221395 pool 'HPC_Pool' (14) object 'object1' -> pg 14.bac5debc (14.3c) -> up (, p6) acting (, p6)




我们来探讨一下这个命令的输出。
osdmap e221395: 这个是OSD map的版本号。
pool 'HPC_Pool' (14) : 这是池名字和ID
object 'object1': 这个是对象名称。
pg 14.bac5debc (14.3c):这是PG编号,这表示对象oject1属于PG 14.3c。
up ( :这是osd的up集合,这里包含osd.6、osd.4、osd.7。由于副本级是3,因此每一个PG将会存放到三个OSD上。这同时表明存放PG 14.3c的三个OSD的状态都是up。这是每一个特定crush map规则集中的一个特点osd map版本的所有相关osd的有序列表。正常情况下,这个和acting集合是一样的。
acting (:这表明osd.6、osd.4、osd.7都在集合中,其中osd.6是主osd, osd.4是第二个、osd.7是第三个。acting集合是负责一个特定osd map的osd有序列表
4 检查这些osd的物理位置。你会发现osd.6、osd.4、osd.7彼此是物理隔开的,他们分别在三个节点上。


# ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF

-1 4.85994 root default

-2 0 host 10.160.20.197

-4 0 host 10.160.20.198

-3 0 host 10.160.20.199

-11 1.61998 host 99cloud3

0 hdd 0.53999 osd.0 up 1.00000 1.00000

3 hdd 0.53999 osd.3 up 1.00000 1.00000

6 hdd 0.53999 osd.6 up 1.00000 1.00000

-9 1.61998 host 99cloud4

2 hdd 0.53999 osd.2 up 1.00000 1.00000

4 hdd 0.53999 osd.4 up 1.00000 1.00000

8 hdd 0.53999 osd.8 up 1.00000 1.00000

-13 1.61998 host 99cloud5

1 hdd 0.53999 osd.1 up 1.00000 1.00000

5 hdd 0.53999 osd.5 up 1.00000 1.00000

7 hdd 0.53999 osd.7 up 1.00000 1.00000




# ceph osd tree
IDCLASS WEIGHTTYPE NAME            STATUS REWEIGHT PRI-AFF
-11             0 host ssd-compute05                           
-3       6.00000 root ssd                                       
-4       2.00000   host ssd-compute01                        
0   ssd 1.00000         osd.0            up1.00000 1.00000
1   ssd 1.00000         osd.1            up1.00000 1.00000
-5       2.00000   host ssd-compute02                        
2   ssd 1.00000         osd.2            up1.00000 1.00000
3   ssd 1.00000         osd.3            up1.00000 1.00000
-6       2.00000   host ssd-compute03                        
5   ssd 1.00000         osd.5            up1.00000 1.00000
6   ssd 1.00000         osd.6            up1.00000 1.00000
-1             0 root default                                 








admin 发表于 2021-12-9 10:13:05

分级缓存分级缓存可提升后端存储内某些(热点)数据的 I/O 性能。分级缓存需创建一个由高速而昂贵存储设备(如 SSD )组成的存储池、作为缓存层,以及一个相对低速/廉价设备组成的后端存储池(或纠删码编码的)、作为经济存储层。 Ceph 的对象处理器决定往哪里存储对象,分级代理决定何时把缓存内的对象刷回后端存储层;所以缓存层和后端存储层对 Ceph 客户端来说是完全透明的。http://docs.ceph.org.cn/_images/ditaa-2982c5ed3031cac4f9e40545139e51fdb0b33897.png缓存层代理自动处理缓存层和后端存储之间的数据迁移。然而,管理员仍可干预此迁移规则,主要有两种场景:
[*]回写模式: 管理员把缓存层配置为 writeback 模式时, Ceph 客户端们会把数据写入缓存层、并收到缓存层发来的 ACK ;写入缓存层的数据会被迁移到存储层、然后从缓存层刷掉。直观地看,缓存层位于后端存储层的“前面”,当 Ceph 客户端要读取的数据位于存储层时,缓存层代理会把这些数据迁移到缓存层,然后再发往 Ceph 客户端。从此, Ceph 客户端将与缓存层进行 I/O 操作,直到数据不再被读写。此模式对于易变数据来说较理想(如照片/视频编辑、事务数据等)。
[*]只读模式: 管理员把缓存层配置为 readonly 模式时, Ceph 直接把数据写入后端。读取时, Ceph 把相应对象从后端复制到缓存层,根据已定义策略、脏对象会被缓存层踢出。此模式适合不变数据(如社交网络上展示的图片/视频、 DNA 数据、 X-Ray 照片等),因为从缓存层读出的数据可能包含过期数据,即一致性较差。对易变数据不要用 readonly 模式。
正因为所有 Ceph 客户端都能用缓存层,所以才有提升块设备、 Ceph 对象存储、 Ceph 文件系统和原生绑定的 I/O 性能的潜力。配置存储池要设置缓存层,你必须有两个存储池。一个作为后端存储、另一个作为缓存。配置后端存储池设置后端存储池通常会遇到两种场景:
[*]标准存储: 此时,Ceph存储集群内的存储池保存了一对象的多个副本;
[*]纠删存储池: 此时,存储池用纠删码高效地存储数据,性能稍有损失。
在标准存储场景中,你可以用 CRUSH 规则集来标识失败域(如 osd 、主机、机箱、机架、排等)。当规则集所涉及的所有驱动器规格、速度(转速和吞吐量)和类型相同时, OSD 守护进程运行得最优。创建规则集的详情见 CRUSH 图。创建好规则集后,再创建后端存储池。在纠删码编码情景中,创建存储池时指定好参数就会自动生成合适的规则集,详情见创建存储池。在后续例子中,我们把 cold-storage 当作后端存储池。
配置缓存池缓存存储池的设置步骤大致与标准存储情景相同,但仍有不同:缓存层所用的驱动器通常都是高性能的、且安装在专用服务器上、有自己的规则集。制定规则集时,要考虑到装有高性能驱动器的主机、并忽略没有的主机。详情见给存储池指定 OSD 。在后续例子中, hot-storage 作为缓存存储池、 cold-storage 作为后端存储池。关于缓存层的配置及其默认值的详细解释请参考存储池——调整存储池。

创建缓存层设置一缓存层需把缓存存储池挂接到后端存储池上:ceph osd tier add {storagepool} {cachepool}
例如:ceph osd tier add cold-storage hot-storage
用下列命令设置缓存模式:ceph osd tier cache-mode {cachepool} {cache-mode}
例如:ceph osd tier cache-mode hot-storage writeback
缓存层盖在后端存储层之上,所以要多一步:必须把所有客户端流量从存储池迁移到缓存存储池。用此命令把客户端流量指向缓存存储池:ceph osd tier set-overlay {storagepool} {cachepool}
例如:ceph osd tier set-overlay cold-storage hot-storage

配置缓存层缓存层支持几个配置选项,可按下列语法配置:ceph osd pool set {cachepool} {key} {value}
详情见存储池——调整存储池。目标尺寸和类型生产环境下,缓存层的 hit_set_type 还只能用 Bloom 过滤器:ceph osd pool set {cachepool} hit_set_type bloom
例如:ceph osd pool set hot-storage hit_set_type bloom
hit_set_count 和 hit_set_period 选项分别定义了 HitSet 覆盖的时间区间、以及保留多少个这样的 HitSet 。ceph osd pool set {cachepool} hit_set_count 1ceph osd pool set {cachepool} hit_set_period 3600ceph osd pool set {cachepool} target_max_bytes 1000000000000
保留一段时间以来的访问记录,这样 Ceph 就能判断一客户端在一段时间内访问了某对象一次、还是多次(存活期与热度)。min_read_recency_for_promote 定义了在处理一个对象的读操作时检查多少个 HitSet ,检查结果将用于决定是否异步地提升对象。它的取值应该在 0 和 hit_set_count 之间,如果设置为 0 ,对象会一直被提升;如果设置为 1 ,就只检查当前 HitSet ,如果此对象在当前 HitSet 里就提升它,否则就不提升;设置为其它值时,就要挨个检查此数量的历史 HitSet ,如果此对象出现在 min_read_recency_for_promote 个 HitSet 里的任意一个,那就提升它。还有一个相似的参数用于配置写操作,它是 min_write_recency_for_promote 。ceph osd pool set {cachepool} min_read_recency_for_promote 1ceph osd pool set {cachepool} min_write_recency_for_promote 1
Note 统计时间越长、 min_read_recency_for_promote 或 min_write_recency_for_promote 取值越高, ceph-osd 进程消耗的内存就越多,特别是代理正忙着刷回或赶出对象时,此时所有 hit_set_count 个 HitSet 都要载入内存。

缓存空间消长缓存分层代理有两个主要功能:
[*]刷回: 代理找出修改过(或脏)的对象、并把它们转发给存储池做长期存储。
[*]赶出: 代理找出未修改(或干净)的对象、并把最近未用过的赶出缓存。
相对空间消长缓存分层代理可根据缓存存储池相对大小刷回或赶出对象。当缓存池包含的已修改(或脏)对象达到一定比例时,缓存分层代理就把它们刷回到存储池。用下列命令设置 cache_target_dirty_ratio :ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
例如,设置为 0.4 时,脏对象达到缓存池容量的 40% 就开始刷回:ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
当脏对象达到其容量的一定比例时,要更快地刷回脏对象。用下列命令设置 cache_target_dirty_high_ratio:ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
例如,设置为 0.6 表示:脏对象达到缓存存储池容量的 60% 时,将开始更激进地刷回脏对象。显然,其值最好在 dirty_ratio 和 full_ratio 之间:ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
当缓存池利用率达到总容量的一定比例时,缓存分层代理会赶出部分对象以维持空闲空间。执行此命令设置 cache_target_full_ratio :ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
例如,设置为 0.8 时,干净对象占到总容量的 80% 就开始赶出缓存池:ceph osd pool set hot-storage cache_target_full_ratio 0.8

绝对空间消长缓存分层代理可根据总字节数或对象数量来刷回或赶出对象,用下列命令可指定最大字节数:ceph osd pool set {cachepool} target_max_bytes {#bytes}
例如,用下列命令配置在达到 1TB 时刷回或赶出:ceph osd pool set hot-storage target_max_bytes 1000000000000
用下列命令指定缓存对象的最大数量:ceph osd pool set {cachepool} target_max_objects {#objects}
例如,用下列命令配置对象数量达到 1M 时开始刷回或赶出:ceph osd pool set hot-storage target_max_objects 1000000
Note 如果两个都配置了,缓存分层代理会按先到的阀值执行刷回或赶出。


缓存时长你可以规定缓存层代理必须延迟多久才能把某个已修改(脏)对象刷回后端存储池:ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
例如,让已修改(或脏)对象需至少延迟 10 分钟才能刷回,执行此命令:ceph osd pool set hot-storage cache_min_flush_age 600
你可以指定某对象在缓存层至少放置多长时间才能被赶出:ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
例如,要规定 30 分钟后才赶出对象,执行此命令:ceph osd pool set hot-storage cache_min_evict_age 1800


拆除缓存层回写缓存和只读缓存的去除过程不太一样。拆除只读缓存只读缓存不含变更数据,所以禁用它不会导致任何近期更改的数据丢失。
[*]把缓存模式改为 none 即可禁用。ceph osd tier cache-mode {cachepool} none
例如:ceph osd tier cache-mode hot-storage none

[*]去除后端存储池的缓存池。ceph osd tier remove {storagepool} {cachepool}
例如:ceph osd tier remove cold-storage hot-storage


拆除回写缓存回写缓存可能含有更改过的数据,所以在禁用并去除前,必须采取些手段以免丢失缓存内近期更改的对象。
[*]把缓存模式改为 forward ,这样新的和更改过的对象将直接刷回到后端存储池。ceph osd tier cache-mode {cachepool} forward
例如:ceph osd tier cache-mode hot-storage forward

[*]确保缓存池已刷回,可能要等数分钟:rados -p {cachepool} ls
如果缓存池还有对象,你可以手动刷回,例如:rados -p {cachepool} cache-flush-evict-all

[*]去除此盖子,这样客户端就不会被指到缓存了。ceph osd tier remove-overlay {storagetier}
例如:ceph osd tier remove-overlay cold-storage

[*]最后,从后端存储池剥离缓存层存储池。ceph osd tier remove {storagepool} {cachepool}
例如:ceph osd tier remove cold-storage hot-storage



admin 发表于 2021-12-9 10:17:05

mBlueStore is a new storage backend for Ceph.It boasts better performance (roughly 2x for writes), full data checksumming, and built-in compression.It is the new default storage backend for Ceph OSDs in Luminous v12.2.z and will be used by default when provisioning new OSDs with ceph-disk, ceph-deploy, and/or ceph-ansible.How fast? ¶Roughly speaking, BlueStore is about twice as fast as FileStore, and performance is more consistent with a lower tail latency.The reality is, of course, much more complicated than that:
[*]For large writes, we avoid a double-write that FileStore did, so we can be up to twice as fast.
[*]...except that many FileStore deployments put the journal on a separate SSD, so that benefit may be masked
[*]For small random writes, we do significantly better, even when compared to FileStore with a journal.
[*]...except it's less clear when you're using an NVMe, which tends not to be bottlenecked on the actual storage device but on the CPU (testing is ongoing)
[*]For key/value data, though, we do significantly better, avoiding some very ugly behavior that can crop up in FileStore. For some RGW workloads, for example, we saw write performance improve by 3x!
[*]We avoid the throughput collapse seen in FileStore (due to "splitting") when filling up a cluster with data.
[*]Small sequential reads using raw librados are slower in BlueStore, but this only appears to affect microbenchmarks. This is somewhat deliberate: BlueStore doesn't implement its own readahead (since everything sitting on top of RADOS has its own readahead), and sequential reads using those higher-level interfaces (like RBD and CephFS) are generally great.
[*]Unlike FileStore, BlueStore is copy-on-write: performance with RBD volumes or CephFS files that were recently snapshotted will be much better.
Expect another blog post in the next few weeks with some real data for a deeper dive into BlueStore performance.That said, BlueStore is still a work in progress! We continue to identify issues and make improvements. This first stable release is an important milestone but it is by no means the end of our journey--only the end of the beginning!Square peg, round hole ¶Ceph OSDs perform two main functions: replicating data across the network to other OSDs (along with the rebalancing, healing, and everything else that comes with that), and storing data on the locally attached device(s) (hard disk, SSD, or some combination of the two). The second local storage piece is currently handled by the existing FileStore module, which stores objects as files in an XFS file system. There is quite a bit of history as to how we ended up with the precise architecture and interfaces that we did, but the central challenge is that the OSD was built around transactional updates, and those are awkward and inefficient to implement properly on top of a standard file system.In the end, we found there was nothing wrong with XFS; it was simply the wrong tool for the job. Looking at the FileStore design today, we find that most of its shortcomings are related to the hackery required to adapt our interface to POSIX and not problems with Ceph or XFS in isolation.How does BlueStore work? ¶BlueStore is a clean implementation of our internal ObjectStore interface from first principles, motivated specifically by the workloads we expect. BlueStore is built on top of a raw underlying block device (or block devices). It embeds the RocksDB key/value database for managing its internal metadata. A small internal adapter component called BlueFS implements a file-system-like interface that provides just enough functionality to allow RocksDB to store its “files” and share the same raw device(s) with BlueStore.The biggest difference you’ll notice between older FileStore-based OSDs (e.g., any OSDs deployed with Ceph prior to Luminous) and BlueStore ones is what the partitions and mount point look like.For a FileStore OSD, you see something like$ lsblk … sdb      8:16   0 931.5G0 disk ├─sdb1   8:17   0 930.5G0 part /var/lib/ceph/osd/ceph-56 └─sdb2   8:18   01023M0 part … $ df -h … /dev/sdb1       931G487G444G53% /var/lib/ceph/osd/ceph-56 $ ls -al /var/lib/ceph/osd/ceph-56 … drwxr-xr-x 180 root root 16384 Aug 30 21:55 current lrwxrwxrwx   1 root root    58 Jun42015 journal -> /dev/disk/by-partuuid/538da076-0136-4c78-9af4-79bb40d7cbbd …That is, there a small journal partition (although often this is on a separate SSD), a journal symlink in the data directory pointing to the separate journal paritition, and a current/ directory that contains all of the actual object files.A df command shows you how much of the device is used.Since BlueStore consumes raw block devices, things are a bit different:$ lsblk … sdf      8:80   0   3.7T0 disk ├─sdf1   8:81   0   100M0 part /var/lib/ceph/osd/ceph-75 └─sdf2   8:82   0   3.7T0 part … $ df -h … /dev/sdf1      97M5.4M   92M   6% /var/lib/ceph/osd/ceph-75 … $ ls -al /var/lib/ceph/osd/ceph-75 … lrwxrwxrwx 1 ceph ceph   58 Aug8 18:33 block -> /dev/disk/by-partuuid/80d28eb7-a7e7-4931-866d-303693f1efc4 …You’ll notice there the data directory is now a tiny (100MB) partition with just a handful of files in it, and the rest of the device looks like a large unused partition, with a block symlink in the data directory pointing to it. This is where BlueStore is putting all of its data, and it is performing IO directly to the raw device (using the Linux asynchronous libaio infrastructure) from the ceph-osd process.(You can still see the per-OSD utilization via the standard ceph osd df command, if that's what you're after.)You can no longer see the underlying object files like you used to, but there is a new trick to looking “behind the curtain” to see what each OSD is storing that works for both BlueStore and FileStore based OSDs. If the OSD is stopped, you can “mount” the OSD data via FUSE with$ mkdir /mnt/foo $ ceph-objectstore-tool --op fuse --data-path /var/lib/ceph/osd/ceph-75 --mountpoint /mnt/fooIt is also possible to mount an online OSD (assuming fuse is configured correctly) by enabling the osd_objectstore_fuse config option (a fuse/ directory will appear in the osd data directory), but this is not generally recommended, just as it is not recommended that users make any changes to the files in a FileStore-based OSD directory.Multiple devices ¶BlueStore can run against a combination of slow and fast devices, similar to FileStore, except that BlueStore is generally able to make much better use of the fast device. In FileStore, the journal device (often placed on a faster SSD) is only used for writes. In BlueStore, the internal journaling needed for consistency is much lighter-weight, usually behaving like a metadata journal and only journaling small writes when it is faster (or necessary) to do so.The rest of the fast device can be used to store (and retrieve) internal metadata.BlueStore can manage up to three devices:
[*]The required main device (the block symlink) stores the object data and (usually) metadata too.
[*]An optional db device (the block.db symlink) stores (as much of) the metadata (RocksDB) as will fit.Whatever doesn’t fit will spill back onto the main device.
[*]An optional WAL device (the block.wal symlink) stores just the internal journal (the RocksDB write-ahead log).
The general recommendation is to take as much SSD space as you have available for the OSD and use it for the block.db device.When using ceph-disk, this is accomplished with the --block.db argument:ceph-disk prepare /dev/sdb --block.db /dev/sdcBy default a partition will be created on the sdc device that is 1% of the main device size. This can be overridden with the bluestore_block_db_size config option. A more exotic possibility would be to use three devices: a primary HDD for the main device, part of an SSD for the db device, and a smaller NVDIMM-backed device for the WAL.Note that you can expect some changes here as we add BlueStore support to the new ceph-volume tool that will eventually replace ceph-disk. (We expect to backport all new ceph-volume functionality to Luminous when it is ready.)For more information, see the BlueStore configuration guide.Memory usage ¶One nice thing about FileStore was that it used a normal Linux file system, which meant the kernel was responsible for managing memory for caching data and metadata. In particular, the kernel can use all available RAM as a cache and then release is as soon as the memory is needed for something else. Because BlueStore is implemented in userspace as part of the OSD, we manage our own cache, and we have fewer memory management tools at our disposal.The bottom line is that with BlueStore there is a bluestore_cache_size configuration option that controls how much memory each OSD will use for the BlueStore cache. By default this is 1 GB for HDD-backed OSDs and 3 GB for SSD-backed OSDs, but you can set it to whatever is appropriate for your environment. (See the BlueStore configuration guide for more information.)One caveat is that memory accounting is currently imperfect: the allocator (tcmalloc in our case) incurs some overhead on allocations, the heap can become fragmented over time, and fragmentation prevents some freed memory from being released back to the operating system. As a result, there is usually some disparity between what BlueStore (and the OSD) thinks it is using and the actual memory consumed by the process (RSS) on the order of 1.5x. You can see this disparity for yourself by comparing the ceph-osd process RSS with the output from ceph daemon osd.dump_mempools. Improving the accuracy of our memory accounting is an ongoing project.Checksums ¶BlueStore calculates, stores, and verifies checksums for all data and metadata it stores. Any time data is read off of disk, a checksum is used to verify the data is correct before it is exposed to any other part of the system (or the user).By default we use the crc32c checksum. A few others are available (xxhash32, xxhash64), and it is also possible to use a truncated crc32c (i.e., only 8 or 16 bits of the 32 available bits) to reduce the metadata tracking overhead at the expense of reliability. It’s also possible to disable checksums entirely (although this is definitely not recommended). See the checksum section of the docs for more information.Compression ¶BlueStore can transparently compress data using zlib, snappy, or lz4. This is disabled by default, but it can be enabled globally, for specific pools, or be selectively used when RADOS clients hint that data is compressible. See the compression docs for more information.Converting existing clusters to use BlueStore ¶The choice of backend is a per-OSD decision: a single cluster can contain some FileStore OSDs and some BlueStore OSDs. An upgraded cluster will continue to operate as it did before with the exception that new OSDs will (by default) be deployed with BlueStore.Most users will be interested in converting their existing OSDs over to the new backend. This is essentially a process of reprovisioning each OSD device with a new backend and letting the cluster use its existing healing capabilities to copy data back. There are a couple of ways to do this safely (and a few more that are less safe). We’ve created a guide that documents the currently recommended process for the migration.Conclusion ¶BlueStore provides a huge advantage in terms of performance, robustness, and functionality over our previous approach of layering over existing file systems. We have taken control of a larger portion of the storage stack--all the way down to the raw block device--providing greater control of the IO flow and data layout. This change brings with it the power to improve performance and functionality, but also the responsibility to manage that data safely. We are quite happy the with reliability and robustness we’ve seen from BlueStore over the past year of refinement and testing, and are very excited to recommend it to users in the Luminous release.



页: [1]
查看完整版本: ceph bluestore与 filestore 数据存放的区别