Bookmark this page

Guided Exercise: Managing and Customizing the CRUSH Map

In this exercise, you will view and modify the cluster CRUSH map.

Outcomes

You should be able to create data placement rules to target a specific device class, create a pool by using a specific data placement rule, and decompile and edit the CRUSH map.

As the student user on the workstation machine, use the lab command to prepare your system for this exercise.

[student@workstation ~]$ lab start map-crush

This command confirms that the hosts required for this exercise are accessible, backs up the CRUSH map, adds the ssd device class, and sets the mon_allow_pool_delete setting to true.

Procedure 5.1. Instructions

  1. Log in to clienta as the admin user and use sudo to run the cephadm shell. Verify that the cluster returns a HEALTH_OK state.

    [student@workstation ~]$ ssh admin@clienta
    [admin@clienta ~]$ sudo cephadm shell
    [ceph: root@clienta /]# ceph health
    HEALTH_OK
  2. Create a new CRUSH rule called onssd that uses only the OSDs backed by SSD storage. Create a new pool called myfast with 32 placement groups that use that rule. Confirm that the pool is using only OSDs that are backed by SSD storage.

    1. List the available device classes in your cluster.

      [ceph: root@clienta /]# ceph osd crush class ls
      [
          "hdd",
          "ssd"
      ]
    2. Display the CRUSH map tree to locate the OSDs backed by SSD storage.

      [ceph: root@clienta /]# ceph osd crush tree
      ID  CLASS  WEIGHT   TYPE NAME
      -1         0.08817  root default
      -3         0.02939      host serverc
       0    hdd  0.00980          osd.0
       2    hdd  0.00980          osd.2
       1    ssd  0.00980          osd.1
      -5         0.02939      host serverd
       3    hdd  0.00980          osd.3
       7    hdd  0.00980          osd.7
       5    ssd  0.00980          osd.5
      -7         0.02939      host servere
       4    hdd  0.00980          osd.4
       8    hdd  0.00980          osd.8
       6    ssd  0.00980          osd.6
    3. Add a new CRUSH map rule called onssd to target the OSDs with SSD devices.

      [ceph: root@clienta /]# ceph osd crush rule create-replicated onssd \
      default host ssd
    4. Use the ceph osd crush rule ls command to verify the successful creation of the new rule.

      [ceph: root@clienta /]# ceph osd crush rule ls
      replicated_rule
      onssd
    5. Create a new replicated pool called myfast with 32 placement groups that uses the onssd CRUSH map rule.

      [ceph: root@clienta /]# ceph osd pool create myfast 32 32 onssd
      pool 'myfast' created
    6. Verify that the placement groups for the pool called myfast are only using the OSDs backed by SSD storage. In a previous step, the OSDs are osd.2, osd.5, and osd.8.

      Retrieve the ID of the pool called myfast.

      [ceph: root@clienta /]# ceph osd lspools
      ...output omitted...
      6 myfast
    7. Use the ceph pg dump pgs_brief command to list all the PGs in the cluster.

      The pool ID is the first number in a PG ID. For example, the PG 6.1b belongs to the pool whose ID is 6.

      [ceph: root@clienta /]# ceph pg dump pgs_brief
      PG_STAT  STATE         UP       UP_PRIMARY  ACTING   ACTING_PRIMARY
      6.1b     active+clean  [6,5,1]           6  [6,5,1]               6
      4.19     active+clean  [6,2,5]           6  [6,2,5]               6
      2.1f     active+clean  [0,3,8]           0  [0,3,8]               0
      3.1e     active+clean  [2,6,3]           2  [2,6,3]               2
      6.1a     active+clean  [6,1,5]           6  [6,1,5]               6
      4.18     active+clean  [3,2,6]           3  [3,2,6]               3
      2.1e     active+clean  [2,6,5]           2  [2,6,5]               2
      3.1f     active+clean  [0,3,4]           0  [0,3,4]               0
      6.19     active+clean  [1,5,6]           1  [1,5,6]               1
      4.1b     active+clean  [3,2,8]           3  [3,2,8]               3
      2.1d     active+clean  [6,7,0]           6  [6,7,0]               6
      ...output omitted...

      The pool called myfast, whose ID is 6, only uses osd.1, osd.5, and osd.6. These are the only OSDs with SSD drives.

  3. Create a new CRUSH hierarchy under root=default-cl260 that has three rack buckets (rack1, rack2, and rack3), each of which contains one host bucket (hostc, hostd, and hoste).

    1. Create a new CRUSH map hierarchy that matches this infrastructure:

      default-cl260    (root bucket)
          rack1       (rack bucket)
              hostc      (host bucket)
                  osd.1
                  osd.5
                  osd.6
      
          rack2      (rack bucket)
              hostd      (host bucket)
                  osd.0
                  osd.3
                  osd.4
      
          rack3       (rack bucket)
              hoste      (host bucket)
                  osd.2
                  osd.7
                  osd.8

      You should place the three SSDs (in this example are OSDs 1, 5, and 6) on hostc. Because in your cluster OSD numbers can be differet, modify the CRUSH map hierarchy accordingly to this requirement.

      First, create the buckets with the ceph osd crush add-bucket command.

      [ceph: root@clienta /]# ceph osd crush add-bucket default-cl260 root
      added bucket default-cl260 type root to crush map
      [ceph: root@clienta /]# ceph osd crush add-bucket rack1 rack
      added bucket rack1 type rack to crush map
      [ceph: root@clienta /]# ceph osd crush add-bucket hostc host
      added bucket hostc type host to crush map
      [ceph: root@clienta /]# ceph osd crush add-bucket rack2 rack
      added bucket rack2 type rack to crush map
      [ceph: root@clienta /]# ceph osd crush add-bucket hostd host
      added bucket hostd type host to crush map
      [ceph: root@clienta /]# ceph osd crush add-bucket rack3 rack
      added bucket rack3 type rack to crush map
      [ceph: root@clienta /]# ceph osd crush add-bucket hoste host
      added bucket hoste type host to crush map
    2. Use the ceph osd crush move command to build the hierarchy.

      [ceph: root@clienta /]# ceph osd crush move rack1 root=default-cl260
      moved item id -14 name 'rack1' to location {root=default-cl260} in crush map
      [ceph: root@clienta /]# ceph osd crush move hostc rack=rack1
      moved item id -15 name 'hostc' to location {rack=rack1} in crush map
      [ceph: root@clienta /]# ceph osd crush move rack2 root=default-cl260
      moved item id -16 name 'rack2' to location {root=default-cl260} in crush map
      [ceph: root@clienta /]# ceph osd crush move hostd rack=rack2
      moved item id -17 name 'hostd' to location {rack=rack2} in crush map
      [ceph: root@clienta /]# ceph osd crush move rack3 root=default-cl260
      moved item id -18 name 'rack3' to location {root=default-cl260} in crush map
      [ceph: root@clienta /]# ceph osd crush move hoste rack=rack3
      moved item id -19 name 'hoste' to location {rack=rack3} in crush map
    3. Display the CRUSH map tree to verify the new hierarchy.

      [ceph: root@clienta /]# ceph osd crush tree
      ID  CLASS WEIGHT           TYPE NAME
      ID   CLASS  WEIGHT   TYPE NAME
      -13               0  root default-cl260
      -14               0      rack rack1
      -15               0          host hostc
      -16               0      rack rack2
      -17               0          host hostd
      -18               0      rack rack3
      -19               0          host hoste
       -1         0.08817  root default
       -3         0.02939      host serverc
        0    hdd  0.00980          osd.0
        2    hdd  0.00980          osd.2
        1    ssd  0.00980          osd.1
       -5         0.02939      host serverd
        3    hdd  0.00980          osd.3
        7    hdd  0.00980          osd.7
        5    ssd  0.00980          osd.5
       -7         0.02939      host servere
        4    hdd  0.00980          osd.4
        8    hdd  0.00980          osd.8
        6    ssd  0.00980          osd.6
    4. Place the OSDs as leaves in the new tree.

      [ceph: root@clienta /]# ceph osd crush set osd.1 1.0 root=default-cl260 \
      rack=rack1 host=hostc
      set item id 1 name 'osd.1' weight 1 at location {host=hostc,rack=rack1,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.5 1.0 root=default-cl260 \
      rack=rack1 host=hostc
      set item id 5 name 'osd.5' weight 1 at location {host=hostc,rack=rack1,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.6 1.0 root=default-cl260 \
      rack=rack1 host=hostc
      set item id 6 name 'osd.6' weight 1 at location {host=hostc,rack=rack1,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.0 1.0 root=default-cl260 \
      rack=rack2 host=hostd
      set item id 0 name 'osd.0' weight 1 at location {host=hostd,rack=rack2,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.3 1.0 root=default-cl260 \
      rack=rack2 host=hostd
      set item id 3 name 'osd.3' weight 1 at location {host=hostd,rack=rack2,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.4 1.0 root=default-cl260 \
      rack=rack2 host=hostd
      set item id 4 name 'osd.4' weight 1 at location {host=hostd,rack=rack2,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.2 1.0 root=default-cl260 \
      rack=rack3 host=hoste
      set item id 2 name 'osd.2' weight 1 at location {host=hoste,rack=rack3,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.7 1.0 root=default-cl260 \
      rack=rack3 host=hoste
      set item id 7 name 'osd.7' weight 1 at location {host=hoste,rack=rack3,root=default-cl260} to crush map
      [ceph: root@clienta /]# ceph osd crush set osd.8 1.0 root=default-cl260 \
      rack=rack3 host=hoste
      set item id 8 name 'osd.8' weight 1 at location {host=hoste,rack=rack3,root=default-cl260} to crush map
    5. Display the CRUSH map tree to verify the new OSD locations.

      [ceph: root@clienta /]# ceph osd crush tree
      ID   CLASS  WEIGHT   TYPE NAME
      -13         9.00000  root default-cl260
      -14         3.00000      rack rack1
      -15         3.00000          host hostc
        1    ssd  1.00000              osd.1
        5    ssd  1.00000              osd.5
        6    ssd  1.00000              osd.6
      -16         3.00000      rack rack2
      -17         3.00000          host hostd
        0    hdd  1.00000              osd.0
        3    hdd  1.00000              osd.3
        4    hdd  1.00000              osd.4
      -18         3.00000      rack rack3
      -19         3.00000          host hoste
        2    hdd  1.00000              osd.2
        7    hdd  1.00000              osd.7
        8    hdd  1.00000              osd.8
       -1               0  root default
       -3               0      host serverc
       -5               0      host serverd
       -7               0      host servere

      All the OSDs with SSD devices are in the rack1 bucket and no OSDs are in the default tree.

  4. Add a custom CRUSH rule by decompiling the binary CRUSH map and editing the resulting text file to add a new CRUSH rule called ssd-first. This rule always selects OSDs backed by SSD storage as the primary OSD, and OSDs backed by HDD storage as secondary OSDs for each placement group.

    When the rule is created, compile the map and load it into your cluster. Create a new replicated pool called testcrush that uses the rule, and verify that its placement groups are mapped correctly.

    Clients accessing the pools that are using this new rule will read data from fast drives because clients always read and write from the primary OSDs.

    1. Retrieve the current CRUSH map by using the ceph osd getcrushmap command. Store the binary map in the /home/ceph/cm-org.bin file.

      [ceph: root@clienta /]# ceph osd getcrushmap -o ~/cm-org.bin
      ...output omitted...
    2. Use the crushtool command to decompile the binary map to the ~/cm-org.txt text file. When successful, this command returns no output, but immediately use the echo $? command to determine its return code.

      [ceph: root@clienta /]# crushtool -d ~/cm-org.bin -o ~/cm-org.txt
      [ceph: root@clienta /]# echo $?
      0
    3. Save a copy of the CRUSH map as ~/cm-new.txt, and add the following rule at the end of the file.

      [ceph: root@clienta /]# cp ~/cm-org.txt ~/cm-new.txt
      [ceph: root@clienta /]# cat ~/cm-new.txt
      ...output omitted...
      rule onssd {
          id 3
          type replicated
          min_size 1
          max_size 10
          step take default class ssd
          step chooseleaf firstn 0 type host
          step emit
      }
      rule ssd-first {
          id 5
          type replicated
          min_size 1
          max_size 10
          step take rack1
          step chooseleaf firstn 1 type host
          step emit
          step take default-cl260 class hdd
          step chooseleaf firstn -1 type rack
          step emit
      }
      
      
      # end crush map

      With this rule, the first replica uses an OSD from rack1 (backed by SSD storage), and the remaining replicas use OSDs backed by HDD storage from different racks.

    4. Compile your new CRUSH map.

      [ceph: root@clienta /]# crushtool -c ~/cm-new.txt -o ~/cm-new.bin
    5. Before applying the new map to the running cluster, use the crushtool command with the --show-mappings option to verify that the first OSD is always from rack1.

      [ceph: root@clienta /]# crushtool -i ~/cm-new.bin --test --show-mappings \
      --rule=5 --num-rep 3
      ...output omitted...
      CRUSH rule 5 x 1013 [5,4,7]
      CRUSH rule 5 x 1014 [1,3,7]
      CRUSH rule 5 x 1015 [6,2,3]
      CRUSH rule 5 x 1016 [5,0,7]
      CRUSH rule 5 x 1017 [6,0,8]
      CRUSH rule 5 x 1018 [6,4,7]
      CRUSH rule 5 x 1019 [1,8,3]
      CRUSH rule 5 x 1020 [5,7,4]
      CRUSH rule 5 x 1021 [5,7,4]
      CRUSH rule 5 x 1022 [1,4,2]
      CRUSH rule 5 x 1023 [1,7,4]

      The first OSD is always 1, 5, or 6, which corresponds to the OSDs with SSD devices from rack1.

    6. Apply the new CRUSH map to your cluster by using the ceph osd setcrushmap command.

      [ceph: root@clienta /]# ceph osd setcrushmap -i ~/cm-new.bin
      ...output omitted...
    7. Verify that the new ssd-first rule is now available.

      [ceph: root@clienta /]# ceph osd crush rule ls
      replicated_rule
      onssd
      ssd-first
    8. Create a new replicated pool called testcrush with 32 placement groups and use the ssd-first CRUSH map rule.

      [ceph: root@clienta /]# ceph osd pool create testcrush 32 32 ssd-first
      cephpool 'testcrush' created
    9. Verify that the first OSDs for the placement groups in the pool called testcrush are the ones from rack1. These OSDs are osd.1, osd.5, and osd.6.

      [ceph: root@clienta /]# ceph osd lspools
      ...output omitted...
      6 myfast
      7 testcrush
      [ceph: root@clienta /]# ceph pg dump pgs_brief | grep ^6
      dumped pgs_brief
      7.b               active+clean  [1,8,3]           1  [1,8,3]               1
      7.8               active+clean  [5,3,7]           5  [5,3,7]               5
      7.9               active+clean  [5,0,7]           5  [5,0,7]               5
      7.e               active+clean  [1,2,4]           1  [1,2,4]               1
      7.f               active+clean  [1,0,8]           1  [1,0,8]               1
      7.c               active+clean  [6,0,8]           6  [6,0,8]               6
      7.d               active+clean  [1,4,8]           1  [1,4,8]               1
      7.2               active+clean  [6,8,0]           6  [6,8,0]               6
      7.3               active+clean  [5,3,7]           5  [5,3,7]               5
      7.0               active+clean  [5,0,7]           5  [5,0,7]               5
      7.5               active+clean  [5,4,2]           5  [5,4,2]               5
      ...output omitted...
  5. Use the pg-upmap feature to manually remap some secondary OSDs in one of the PGs in the testcrush pool.

    1. Use the new pg-upmap optimization feature to manually map a PG to specific OSDs. Remap the second OSD of your PG from the previous step to another OSD of your choosing, except 1, 5 or, 6.

      [ceph: root@clienta /]# ceph osd pg-upmap-items 7.8 3 0
      set 7.8 pg_upmap_items mapping to [3->0]
    2. Use the ceph pg map command to verify the new mapping. When done, log off from clienta.

      [ceph: root@clienta /]# ceph pg map 7.8
      osdmap e238 pg 7.8 (7.8) -> up [5,0,7] acting [5,0,7]
  6. Return to workstation as the student user.

    [ceph: root@clienta /]# exit
    [admin@clienta ~]$ exit
    [student@workstation ~]$

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish map-crush

This concludes the guided exercise.

Revision: cl260-5.0-29d2128