Forums - PetaSAN

ForumBug ReportingError building cluster, trying to …
You need to log in to create posts and topics. Login · Register
Error building cluster, trying to reinstall v3.3 after 4.0 crash

daniel.shafer
10 Posts

September 12, 2025, 5:24 pm
Quote from daniel.shafer on September 12, 2025, 5:24 pm
We have a set of 12 servers that we use for petasan to create a big iscsi volume.

It was running for the past 6 months on 3.3 without a problem.

We experienced an issue with our bonds failing after upgrade to 4.0 so we wanted to go back to 3.3

I've reinstalled 3.3 and am trying to setup a cluster, but i'm getting an error while adding the third node as it tries to build the cluster.

Network bonds are working, I'm looking in the json files and only see the first two nodes listed in cluster info on the third node, but i do see a proper node_info.json for that third server in the cluster state logs.

I've checked my interface naming to make sure we are using the correct interfaces for the management bond, the backend bond and the two iscsi interfaces. I checked the switch and I'm not seeing problems there.

What should i look for in the cluster logs to help me identify why this isn't working again on the same hardware it was on before?

The atil end of hte petasan log file from the cluster dump is:

12/09/2025 12:40:49 INFO admin.py process stopped
12/09/2025 12:40:49 INFO Stopping CIFS service
12/09/2025 12:40:49 INFO Stopping NFS service
12/09/2025 12:40:49 INFO Stopping S3 service
12/09/2025 12:40:49 INFO Starting Node Stats Service
12/09/2025 12:40:50 INFO Starting local clean_ceph.
12/09/2025 12:40:50 INFO Starting clean_ceph
12/09/2025 12:40:51 INFO Stopping ceph services
12/09/2025 12:40:51 INFO Start cleaning config files
12/09/2025 12:40:51 INFO Starting ceph services
12/09/2025 12:40:52 INFO Starting local clean_consul.
12/09/2025 12:40:52 INFO Trying to clean Consul on local node
12/09/2025 12:40:52 INFO delete /opt/petasan/config/etc/consul.d
12/09/2025 12:40:52 INFO delete /opt/petasan/config/var/consul
12/09/2025 12:40:52 INFO Trying to clean Consul on 10.0.16.204
12/09/2025 12:40:53 INFO Trying to clean Consul on 10.0.16.205
12/09/2025 12:40:54 INFO cluster_name: ftcstage-backup01
12/09/2025 12:40:54 INFO local_node_info.name: ftc-petasan-006
12/09/2025 12:43:04 ERROR Could not create Consul Configuration on node: 10.0.32.204
12/09/2025 12:43:04 ERROR Error building Consul cluster
12/09/2025 12:43:04 ERROR Could not build consul.
12/09/2025 12:43:04 ERROR ['core_consul_deploy_build_error_build_consul_cluster']
12/09/2025 12:45:19 INFO execute cluster state script on ftc-petasan-006

We have a set of 12 servers that we use for petasan to create a big iscsi volume.

It was running for the past 6 months on 3.3 without a problem.

We experienced an issue with our bonds failing after upgrade to 4.0 so we wanted to go back to 3.3

I've reinstalled 3.3 and am trying to setup a cluster, but i'm getting an error while adding the third node as it tries to build the cluster.

Network bonds are working, I'm looking in the json files and only see the first two nodes listed in cluster info on the third node, but i do see a proper node_info.json for that third server in the cluster state logs.

I've checked my interface naming to make sure we are using the correct interfaces for the management bond, the backend bond and the two iscsi interfaces. I checked the switch and I'm not seeing problems there.

What should i look for in the cluster logs to help me identify why this isn't working again on the same hardware it was on before?

The atil end of hte petasan log file from the cluster dump is:

12/09/2025 12:40:49 INFO admin.py process stopped
12/09/2025 12:40:49 INFO Stopping CIFS service
12/09/2025 12:40:49 INFO Stopping NFS service
12/09/2025 12:40:49 INFO Stopping S3 service
12/09/2025 12:40:49 INFO Starting Node Stats Service
12/09/2025 12:40:50 INFO Starting local clean_ceph.
12/09/2025 12:40:50 INFO Starting clean_ceph
12/09/2025 12:40:51 INFO Stopping ceph services
12/09/2025 12:40:51 INFO Start cleaning config files
12/09/2025 12:40:51 INFO Starting ceph services
12/09/2025 12:40:52 INFO Starting local clean_consul.
12/09/2025 12:40:52 INFO Trying to clean Consul on local node
12/09/2025 12:40:52 INFO delete /opt/petasan/config/etc/consul.d
12/09/2025 12:40:52 INFO delete /opt/petasan/config/var/consul
12/09/2025 12:40:52 INFO Trying to clean Consul on 10.0.16.204
12/09/2025 12:40:53 INFO Trying to clean Consul on 10.0.16.205
12/09/2025 12:40:54 INFO cluster_name: ftcstage-backup01
12/09/2025 12:40:54 INFO local_node_info.name: ftc-petasan-006
12/09/2025 12:43:04 ERROR Could not create Consul Configuration on node: 10.0.32.204
12/09/2025 12:43:04 ERROR Error building Consul cluster
12/09/2025 12:43:04 ERROR Could not build consul.
12/09/2025 12:43:04 ERROR ['core_consul_deploy_build_error_build_consul_cluster']
12/09/2025 12:45:19 INFO execute cluster state script on ftc-petasan-006

Last edited on September 12, 2025, 6:57 pm by daniel.shafer · #1

admin
3,082 Posts

September 13, 2025, 1:08 pm
Quote from admin on September 13, 2025, 1:08 pm
It is failing to create consul on node 1.

In most cases, this is an ssh connection error from node 3 to node 1 on backend 1 network.

node 3 is where you are connecting to does initiate the cluster build process, and consul is the first clustered service to be built on backend 1.

It is failing to create consul on node 1.

In most cases, this is an ssh connection error from node 3 to node 1 on backend 1 network.

node 3 is where you are connecting to does initiate the cluster build process, and consul is the first clustered service to be built on backend 1.

#2

daniel.shafer
10 Posts

September 15, 2025, 3:22 pm
Quote from daniel.shafer on September 15, 2025, 3:22 pm
ok, this was a really weird one I had to look at stp debugging logs on the switch to figure out this weekend.

First off, forgive the ticket, not a problem with the installer.

One of the 10G cards in the server being used by the backend network failed in a way I've never seen before. It reported healthy, firmware updated ok, and I could connect to the backend network on either port, just couldn't use bonding. even with the server off, the card (due to wake on lan) was still active and transmitting horrible stp information to the switch. didn't mirror the port to see the actual traffic, just saw the reconvergence messages through the stp debug yesterday to figure out something was seriously wrong with the card, since with the server powered off I shouldn't see arp requests coming from it for random mac addresses either. So weird!

I replace the card this morning and everything is working as expected.

The card was causing all the other backend ports on that vlan on that switch to no longer create their bonds either so it looked like the whole cluster blew up, when it was just this one offending nic.

Anyway, wanted to post and let you know what was up and how I solved it in case anyone else runs into something like this.

Honestly, never seen a nic fail so spectacularly, first for me.

Too bad I had to burn a whole weekend on it, I guess it was good since I got to use debug skillz I haven't used in a long time! 🙂

ok, this was a really weird one I had to look at stp debugging logs on the switch to figure out this weekend.

First off, forgive the ticket, not a problem with the installer.

One of the 10G cards in the server being used by the backend network failed in a way I've never seen before. It reported healthy, firmware updated ok, and I could connect to the backend network on either port, just couldn't use bonding. even with the server off, the card (due to wake on lan) was still active and transmitting horrible stp information to the switch. didn't mirror the port to see the actual traffic, just saw the reconvergence messages through the stp debug yesterday to figure out something was seriously wrong with the card, since with the server powered off I shouldn't see arp requests coming from it for random mac addresses either. So weird!

I replace the card this morning and everything is working as expected.

The card was causing all the other backend ports on that vlan on that switch to no longer create their bonds either so it looked like the whole cluster blew up, when it was just this one offending nic.

Anyway, wanted to post and let you know what was up and how I solved it in case anyone else runs into something like this.

Honestly, never seen a nic fail so spectacularly, first for me.

Too bad I had to burn a whole weekend on it, I guess it was good since I got to use debug skillz I haven't used in a long time! 🙂

#3

admin
3,082 Posts

September 16, 2025, 7:01 pm
Quote from admin on September 16, 2025, 7:01 pm
Excellent, thanks for the feedback.

Excellent, thanks for the feedback.

#4

Post Reply: Error building cluster, trying to reinstall v3.3 after 4.0 crash

Cancel