Build Private Cloud with OpenStack Kolla-ansible Step by Step Guide 1.0 -Part 5-Ceph Cluster Troubleshooting

Kevin W Tech Notes

4 min readJan 23, 2022

Index

Part 1 Preparation

Part 2 OS Preparation

Part 3 Prepare Docker Registry

Part 4-Install Ceph Cluster

Part 5-Ceph Cluster Troubleshooting

Part 6-Deploy Openstack with Kolla-ansible

Part 7-Openstack Cluster Post Installation

Part 8-Openstack Upload Image

Part 9-Openstack Create Network

Part 10-Openstack Create VM

Ceph Cluster installation with ansible is simple, but you may experience errors and here are some tips, hope that helps.

Reinstall Ceph Cluster

If installation failed, you can purge the installation and reinstall the ceph cluster.

root@openstack-staging:/home/kevin/ceph-ansible# ansible-playbook infrastructure-playbooks/purge-cluster.yml

Health-WARN “mons are allowing insecure global_id reclaim”

** Make sure all clients have been upgraded before run the following command, or else those clients will be blocked after this is set **

According to the CVE also previously mentioned, there is a security issue where clients need to be upgraded to the releases mentioned. Once all the clients are updated (e.g. the rook daemons and csi driver), a new setting needs to be applied to the cluster that will disable allowing the insecure mode.

If you see both these health warnings, then either one of the rook or csi daemons has not been upgraded yet, or some other client is detected on the older version:

health: HEALTH_WARN
            client is using insecure global_id reclaim
            mon is allowing insecure global_id reclaim

If you only see this one warning, then the insecure mode should be disabled:

health: HEALTH_WARN
            mon is allowing insecure global_id reclaim

Please make sure you have all client connected to ceph before run this command, or you can leave as it is.

ceph config set mon auth_allow_insecure_global_id_reclaim false

add docker repository failed

TASK [ceph-container-engine : add docker repository] ***************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Monday 21 June 2021  06:32:58 +0000 (0:00:01.319)       0:10:50.642 ***********
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: apt_pkg.Error: E:Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg != , E:The list of sources could not be read.
fatal: [openstack-ceph01]: FAILED! => changed=false
  module_stderr: |-
    Traceback (most recent call last):
      File "<stdin>", line 102, in <module>
      File "<stdin>", line 94, in _ansiballz_main
      File "<stdin>", line 40, in invoke_module
      File "/usr/lib/python3.8/runpy.py", line 207, in run_module
        return _run_module_code(code, init_globals, run_name, mod_spec)
      File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code
        _run_code(code, mod_globals, init_globals,
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/tmp/ansible_apt_repository_payload__2_ke2d5/ansible_apt_repository_payload.zip/ansible/modules/apt_repository.py", line 604, in <module>
      File "/tmp/ansible_apt_repository_payload__2_ke2d5/ansible_apt_repository_payload.zip/ansible/modules/apt_repository.py", line 581, in main
      File "/usr/lib/python3/dist-packages/apt/cache.py", line 170, in __init__
        self.open(progress)
      File "/usr/lib/python3/dist-packages/apt/cache.py", line 232, in open
        self._cache = apt_pkg.Cache(progress)
    apt_pkg.Error: E:Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg != , E:The list of sources could not be read.
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: 1

Workaround is remove /etc/apt/sources.list.d/docker.list on ceph nodes

root@openstack-ceph01:/home/kevin# rm /etc/apt/sources.list.d/docker.list

Failed to download ceph grafana dashboards file

This is because of my company FW blocked traffic, we can manually download the file and put it into the folder /etc/grafana/dashboards/ceph-dashboard and coment out the task, then rerun the ansible script, please make sure copy all of these files to other 2 ceph nodes.

TASK [ceph-grafana : download ceph grafana dashboards] *****************************************************************************************************
Monday 21 June 2021  06:37:51 +0000 (0:00:01.068)       0:03:57.615 ***********
failed: [openstack-ceph02] (item=ceph-cluster.json) => changed=false
  ansible_loop_var: item
  dest: /etc/grafana/dashboards/ceph-dashboard/ceph-cluster.json
  elapsed: 40
  item: ceph-cluster.json
  msg: 'Request failed: <urlopen error timed out>'
  url: https://raw.githubusercontent.com/ceph/ceph/master/monitoring/grafana/dashboards/ceph-cluster.json
failed: [openstack-ceph01] (item=ceph-cluster.json) => changed=false
  ansible_loop_var: item
  dest: /etc/grafana/dashboards/ceph-dashboard/ceph-cluster.json
  elapsed: 40
  item: ceph-cluster.json
  msg: 'Request failed: <urlopen error timed out>'
  url: https://raw.githubusercontent.com/ceph/ceph/master/monitoring/grafana/dashboards/ceph-cluster.json
failed: [openstack-ceph03] (item=ceph-cluster.json) => changed=false
  ansible_loop_var: item
  dest: /etc/grafana/dashboards/ceph-dashboard/ceph-cluster.json
  elapsed: 40
  item: ceph-cluster.json
  msg: 'Request failed: <urlopen error timed out>'
  url: https://raw.githubusercontent.com/ceph/ceph/master/monitoring/grafana/dashboards/ceph-cluster.json
failed: [openstack-ceph02] (item=cephfs-overview.json) => changed=false
  ansible_loop_var: item
  dest: /etc/grafana/dashboards/ceph-dashboard/cephfs-overview.json
  elapsed: 40
  item: cephfs-overview.json
  msg: 'Request failed: <urlopen error timed out>'

Comment out the task

root@openstack-staging:/home/kevin/ceph-ansible# vim ./roles/ceph-grafana/tasks/configure_grafana.yml
      #- name: download ceph grafana dashboards
      #  get_url:
      #    url: "https://raw.githubusercontent.com/ceph/ceph/{{ grafana_dashboard_version }}/monitoring/grafana/dashboards/{{ item }}"
      #    dest: "/etc/grafana/dashboards/ceph-dashboard/{{ item }}"
      #  with_items: "{{ grafana_dashboard_files }}"
      #  when:
      #    - not containerized_deployment | bool
      #    - not ansible_facts['os_family'] in ['RedHat', 'Suse']

HEALTH_WARN mons openstack-ceph01,openstack-ceph02,openstack-ceph03 are low on available space

The reason for the warning message is that the available capacity of the monitor node is less than the set capacity.

Generally, ceph stores the information(ceph status, dump information of each node, etc)in a files or save some data to db on the monitor node, so some capacity of the disk is always required for this.
For this reason, the function of checking the monitor node capacity of ceph is included.
And by default option value, the ceph monitor node generate the warning message when the root capacity remains less 30%.
The option related to this is as follows.
mon_data_avail_warn = 30 (default value)Check the option value

#ceph --admin-daemon /var/run/ceph/ceph-mon.<mon-hostname>.asok config show |grep 'mon_data_avail' (On monitor node)ex)
#ceph --admin-daemon /var/run/ceph/ceph-mon.cnode1.asok config show |grep 'mon_data_avail'
    "mon_data_avail_crit": "5",
    "mon_data_avail_warn": "30",root@openstack-ceph01:/home/kevin# df -h
Filesystem                         Size  Used Avail Use% Mounted on
udev                               7.8G     0  7.8G   0% /dev
tmpfs                              1.6G  2.0M  1.6G   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv   15G   10G  4.1G  72% /

Solution

ceph tell mon.* injectargs '--mon_data_avail_warn [value]'
example)
#ceph tell mon.* injectargs '--mon_data_avail_warn 10'

Verify

root@openstack-ceph01:/home/kevin# ceph tell mon.* injectargs '--mon_data_avail_warn 10'
mon.openstack-ceph01: {}
mon.openstack-ceph01: mon_data_avail_warn = '10' (not observed, change may require restart)
mon.openstack-ceph02: {}
mon.openstack-ceph02: mon_data_avail_warn = '10' (not observed, change may require restart)
mon.openstack-ceph03: {}
mon.openstack-ceph03: mon_data_avail_warn = '10' (not observed, change may require restart)
root@openstack-ceph01:/home/kevin# ceph -s
  cluster:
    id:     e2d9d8f9-a56e-43f7-899c-e43b31d1e205
    health: HEALTH_OKservices:
    mon: 3 daemons, quorum openstack-ceph01,openstack-ceph02,openstack-ceph03 (age 34h)
    mgr: openstack-ceph01(active, since 34h), standbys: openstack-ceph02, openstack-ceph03
    osd: 6 osds: 6 up (since 34h), 6 in (since 34h)
    rgw: 3 daemons active (3 hosts, 1 zones)data:
    pools:   9 pools, 233 pgs
    objects: 227 objects, 5.4 KiB
    usage:   629 MiB used, 767 GiB / 768 GiB avail
    pgs:     233 active+clean