Forums - PetaSAN

ForumGeneral Discussionrebuilt a node with all the OSDs …
You need to log in to create posts and topics. Login · Register
rebuilt a node with all the OSDs pulled out and now need to insert back

ghbiz
81 Posts

September 16, 2025, 12:56 am
Quote from ghbiz on September 16, 2025, 12:56 am
We rebuilt a node that had a failed boot disk. we had all the OSD disks and jounral out of the machine to minimize IO when we turn it back on.

Upon placing the disk in, We see the OSD and it even associates with the jorunal partition automatically but does NOT start the OSD service.

Here is some outputs below. essentially the issue looks like the start command is missing the cluster info .... "-cluster ${CLUSTER} --"

We also ran the following commands.
lvm vgs -o name | grep ceph | xargs vgchange -ay
48 ceph-volume lvm activate --all

systemctl status ceph-osd@4.service
● ceph-osd@4.service - Ceph object storage daemon osd.4
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled)
Active: active (running) since Mon 2025-09-15 20:45:38 EDT; 59s ago
Process: 365368 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 4 (code=exited, status=0/SUCCE
SS)
Main PID: 365372 (ceph-osd)
Tasks: 60
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@4.service
└─365372 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph

Sep 15 20:45:38 ceph-public9 systemd[1]: Starting Ceph object storage daemon osd.4...
Sep 15 20:45:38 ceph-public9 systemd[1]: Started Ceph object storage daemon osd.4.
Sep 15 20:45:39 ceph-public9 ceph-osd[365372]: 2025-09-15T20:45:39.822-0400 7ff3ef6b3d80 -1 Falling back to public interfa
ce
Sep 15 20:46:05 ceph-public9 ceph-osd[365372]: 2025-09-15T20:46:05.635-0400 7ff3ef6b3d80 -1 osd.4 17647254 log_to_monitors
{default=true}
root@ceph-public9:~# systemctl stop ceph-osd@4.service
root@ceph-public9:~#
root@ceph-public9:~#
root@ceph-public9:~# lvm vgs -o name | grep ceph | xargs vgchange -ay
1 logical volume(s) in volume group "ceph-8b3f5003-2ce3-40d6-b32b-ee38333ab925" now active
1 logical volume(s) in volume group "ceph-a7de5efe-cf99-46b4-ae37-8af584b78fec" now active
root@ceph-public9:~# ceph-volume lvm activate --all
--> Activating OSD ID 4 FSID f29f77bc-b192-4ddf-b2c1-c6033993dc8c
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-4
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-8b3f5003-2ce3-40d6-b32b-ee38333ab925/osd-block-f29f77bc-b192-4ddf-b2c1-c6033993dc8c --path /var/lib/ceph/osd/ceph-4 --no-mon-config
Running command: /bin/ln -snf /dev/ceph-8b3f5003-2ce3-40d6-b32b-ee38333ab925/osd-block-f29f77bc-b192-4ddf-b2c1-c6033993dc8c /var/lib/ceph/osd/ceph-4/block
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-4/block
Running command: /bin/chown -R ceph:ceph /dev/dm-1
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-4
Running command: /bin/ln -snf /dev/nvme0n1p3 /var/lib/ceph/osd/ceph-4/block.db
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p3
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-4/block.db
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p3
Running command: /bin/systemctl enable ceph-volume@lvm-4-f29f77bc-b192-4ddf-b2c1-c6033993dc8c
Running command: /bin/systemctl enable --runtime ceph-osd@4
Running command: /bin/systemctl start ceph-osd@4
--> ceph-volume lvm activate successful for osd ID: 4
--> OSD ID 44 FSID bb97f8f7-8026-4975-99b7-f4c38ee755b1 process is active. Skipping activation

These commands brought UP the OSD but it doesnt join the cluster and stays DOWN as it is missing the CLUSTER info above.

systemctl status ceph-osd@4.service
● ceph-osd@4.service - Ceph object storage daemon osd.4
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; indirect; vendor preset: enable
d)
Active: failed (Result: exit-code) since Mon 2025-09-15 20:26:18 EDT; 11min
ago
Process: 329135 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --
id 4 (code=exited, status=1/FAILURE)

Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Service hold-off time over,
scheduling restart.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Scheduled restart job, resta
rt counter is at 5.
Sep 15 20:26:18 ceph-public9 systemd[1]: Stopped Ceph object storage daemon osd.4.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Star
t request repeated too quickly.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Fail
ed with result 'exit-code'.
Sep 15 20:26:18 ceph-public9 systemd[1]: Failed to start Ceph obj
ect storage daemon osd.4.

We rebuilt a node that had a failed boot disk. we had all the OSD disks and jounral out of the machine to minimize IO when we turn it back on.

Upon placing the disk in, We see the OSD and it even associates with the jorunal partition automatically but does NOT start the OSD service.

Here is some outputs below. essentially the issue looks like the start command is missing the cluster info .... "-cluster ${CLUSTER} --"

We also ran the following commands.
lvm vgs -o name | grep ceph | xargs vgchange -ay
48 ceph-volume lvm activate --all

systemctl status ceph-osd@4.service
● ceph-osd@4.service - Ceph object storage daemon osd.4
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled)
Active: active (running) since Mon 2025-09-15 20:45:38 EDT; 59s ago
Process: 365368 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 4 (code=exited, status=0/SUCCE
SS)
Main PID: 365372 (ceph-osd)
Tasks: 60
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@4.service
└─365372 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph

Sep 15 20:45:38 ceph-public9 systemd[1]: Starting Ceph object storage daemon osd.4...
Sep 15 20:45:38 ceph-public9 systemd[1]: Started Ceph object storage daemon osd.4.
Sep 15 20:45:39 ceph-public9 ceph-osd[365372]: 2025-09-15T20:45:39.822-0400 7ff3ef6b3d80 -1 Falling back to public interfa
ce
Sep 15 20:46:05 ceph-public9 ceph-osd[365372]: 2025-09-15T20:46:05.635-0400 7ff3ef6b3d80 -1 osd.4 17647254 log_to_monitors
{default=true}
root@ceph-public9:~# systemctl stop ceph-osd@4.service
root@ceph-public9:~#
root@ceph-public9:~#
root@ceph-public9:~# lvm vgs -o name | grep ceph | xargs vgchange -ay
1 logical volume(s) in volume group "ceph-8b3f5003-2ce3-40d6-b32b-ee38333ab925" now active
1 logical volume(s) in volume group "ceph-a7de5efe-cf99-46b4-ae37-8af584b78fec" now active
root@ceph-public9:~# ceph-volume lvm activate --all
--> Activating OSD ID 4 FSID f29f77bc-b192-4ddf-b2c1-c6033993dc8c
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-4
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-8b3f5003-2ce3-40d6-b32b-ee38333ab925/osd-block-f29f77bc-b192-4ddf-b2c1-c6033993dc8c --path /var/lib/ceph/osd/ceph-4 --no-mon-config
Running command: /bin/ln -snf /dev/ceph-8b3f5003-2ce3-40d6-b32b-ee38333ab925/osd-block-f29f77bc-b192-4ddf-b2c1-c6033993dc8c /var/lib/ceph/osd/ceph-4/block
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-4/block
Running command: /bin/chown -R ceph:ceph /dev/dm-1
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-4
Running command: /bin/ln -snf /dev/nvme0n1p3 /var/lib/ceph/osd/ceph-4/block.db
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p3
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-4/block.db
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p3
Running command: /bin/systemctl enable ceph-volume@lvm-4-f29f77bc-b192-4ddf-b2c1-c6033993dc8c
Running command: /bin/systemctl enable --runtime ceph-osd@4
Running command: /bin/systemctl start ceph-osd@4
--> ceph-volume lvm activate successful for osd ID: 4
--> OSD ID 44 FSID bb97f8f7-8026-4975-99b7-f4c38ee755b1 process is active. Skipping activation

These commands brought UP the OSD but it doesnt join the cluster and stays DOWN as it is missing the CLUSTER info above.

systemctl status ceph-osd@4.service
● ceph-osd@4.service - Ceph object storage daemon osd.4
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; indirect; vendor preset: enable
d)
Active: failed (Result: exit-code) since Mon 2025-09-15 20:26:18 EDT; 11min
ago
Process: 329135 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --
id 4 (code=exited, status=1/FAILURE)

Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Service hold-off time over,
scheduling restart.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Scheduled restart job, resta
rt counter is at 5.
Sep 15 20:26:18 ceph-public9 systemd[1]: Stopped Ceph object storage daemon osd.4.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Star
t request repeated too quickly.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Fail
ed with result 'exit-code'.
Sep 15 20:26:18 ceph-public9 systemd[1]: Failed to start Ceph obj
ect storage daemon osd.4.

#1

ghbiz
81 Posts

September 16, 2025, 7:45 am
Quote from ghbiz on September 16, 2025, 7:45 am
oddly enough, looks like at 12AM midnight, the OSD automatically went UP and IN.

What commands are ran on the Nodes at night that would bring up the OSD?

oddly enough, looks like at 12AM midnight, the OSD automatically went UP and IN.

What commands are ran on the Nodes at night that would bring up the OSD?

#2

admin
3,082 Posts

September 16, 2025, 7:00 pm
Quote from admin on September 16, 2025, 7:00 pm
The OSD systemd service has a limit on how many times it can start within a certain time period.

Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Service hold-off time over,
scheduling restart.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Scheduled restart job, resta
rt counter is at 5.

so after some time this count was decremented and the OSD started.

as to why the very first time it started but could not join, you need to look at the osd logs,

The OSD systemd service has a limit on how many times it can start within a certain time period.

Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Service hold-off time over,
scheduling restart.
Sep 15 20:26:18 ceph-public9 systemd[1]: ceph-osd@4.service: Scheduled restart job, resta
rt counter is at 5.

so after some time this count was decremented and the OSD started.

as to why the very first time it started but could not join, you need to look at the osd logs,

#3

Post Reply: rebuilt a node with all the OSDs pulled out and now need to insert back

Cancel