ESX-Server ISCSI problems?

therm
121 Posts
August 5, 2017, 8:22 amQuote from therm on August 5, 2017, 8:22 amI`ve restarted the PetaSAN-Server providing CEPH-ESX-1 (the LUN where the Path is not recovering) .
But now all the paths in ESX which were moved are marked as failed. Why does ESX not recover paths??
I`ve restarted the PetaSAN-Server providing CEPH-ESX-1 (the LUN where the Path is not recovering) .
But now all the paths in ESX which were moved are marked as failed. Why does ESX not recover paths??
Last edited on August 5, 2017, 8:41 am · #11

therm
121 Posts
August 5, 2017, 8:38 amQuote from therm on August 5, 2017, 8:38 amIn PetaSANs those paths are not assigned to an node!
192.168.3.20
192.168.3.21
192.168.4.20
ceph-node-mru-2
192.168.4.21
ceph-node-mru-2
In PetaSANs those paths are not assigned to an node!
192.168.3.20
192.168.3.21
192.168.4.20
ceph-node-mru-2
192.168.4.21
ceph-node-mru-2

therm
121 Posts
August 5, 2017, 8:41 amQuote from therm on August 5, 2017, 8:41 amErrors in Petasan.log
root@ceph-node-mru-1:~# tail -50 /opt/petasan/log/PetaSAN.log
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:30 ERROR Error during __proces.
05/08/2017 10:39:30 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:49 ERROR Error during __proces.
05/08/2017 10:39:49 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:09 ERROR Error during __proces.
05/08/2017 10:40:09 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:20 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 10:40:21 ERROR Error during __proces.
05/08/2017 10:40:21 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
Errors in Petasan.log
root@ceph-node-mru-1:~# tail -50 /opt/petasan/log/PetaSAN.log
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:30 ERROR Error during __proces.
05/08/2017 10:39:30 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:49 ERROR Error during __proces.
05/08/2017 10:39:49 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:09 ERROR Error during __proces.
05/08/2017 10:40:09 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:20 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 10:40:21 ERROR Error during __proces.
05/08/2017 10:40:21 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'

admin
2,974 Posts
August 5, 2017, 9:18 amQuote from admin on August 5, 2017, 9:18 amComparing the different logs there are connection failures, operations taking too long (latency increasing from 3 ms to 15 sec and more ) and tasks timing out. When a task takes too long, ESX will send an task abort. when there are connection failures, ESX will attempt to reconnect/re-login.
It could be load issues or network issues:
- Have you increased the load lately or do you see potential for network issues ?
- Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
- Can you run atop command on PetaSAN node and see if there is a resource bottleneck under load ( ram/cpu/disk busy) ?
- For heavy load we recommend having dedicated iSCSI Target nodes separate from storage OSD nodes, if you can do this with at least one iSCSI Target node. For collocated nodes, iSCSI Target service requires 16G RAM on top of the other RAM requirements.
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It happens when an ESX retries a new login while a previous login timed out.
Comparing the different logs there are connection failures, operations taking too long (latency increasing from 3 ms to 15 sec and more ) and tasks timing out. When a task takes too long, ESX will send an task abort. when there are connection failures, ESX will attempt to reconnect/re-login.
It could be load issues or network issues:
- Have you increased the load lately or do you see potential for network issues ?
- Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
- Can you run atop command on PetaSAN node and see if there is a resource bottleneck under load ( ram/cpu/disk busy) ?
- For heavy load we recommend having dedicated iSCSI Target nodes separate from storage OSD nodes, if you can do this with at least one iSCSI Target node. For collocated nodes, iSCSI Target service requires 16G RAM on top of the other RAM requirements.
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It happens when an ESX retries a new login while a previous login timed out.

therm
121 Posts
August 5, 2017, 9:32 amQuote from therm on August 5, 2017, 9:32 am>>Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It >>happens when an ESX retries a new login while a previous login timed out.
That would be great. In addition I might change the LoginTimeout (of ESX) to 60s ?
>>Have you increased the load lately ... ?
Yes, I moved more VMs into PetaSAN. And holidays are over, so people are working again 😉
>> Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
I will try, but the number of paths left makes me nervous.
Any idea how to bring up petasans paths again? See Petasan.log entries above...
>>Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It >>happens when an ESX retries a new login while a previous login timed out.
That would be great. In addition I might change the LoginTimeout (of ESX) to 60s ?
>>Have you increased the load lately ... ?
Yes, I moved more VMs into PetaSAN. And holidays are over, so people are working again 😉
>> Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
I will try, but the number of paths left makes me nervous.
Any idea how to bring up petasans paths again? See Petasan.log entries above...

admin
2,974 Posts
August 5, 2017, 11:30 amQuote from admin on August 5, 2017, 11:30 amThere was a arp refresh issue that was solved in 1.3.1 in v 1.3.0 failover paths that were not active in an ESX can take up to 20 min for ESX to update its arp. After 20 min it will update them.
If this does not occur in 20 min:
grab the iputils-arping-xx.deb from our 1.3.1 iso CD under packages directory.
copy it to the PetaSAN nodes and install it via dpkg -i iputils-arping-xx.deb
from each node run the command:
arping -A ip -I interface -c 5
ip is the failed over path ip and interface is the nic/bond ( example eth1 ) serving this ip, so the interface is different for iSCSI 1 and iSCSI 2 subnets.
Let me know how it goes.
I will look at PetaSAN logs and will get back to you.
Also can you please reply to my earlier clarifications, i am a bit confused and would like a better understanding of what your see:
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
There was a arp refresh issue that was solved in 1.3.1 in v 1.3.0 failover paths that were not active in an ESX can take up to 20 min for ESX to update its arp. After 20 min it will update them.
If this does not occur in 20 min:
grab the iputils-arping-xx.deb from our 1.3.1 iso CD under packages directory.
copy it to the PetaSAN nodes and install it via dpkg -i iputils-arping-xx.deb
from each node run the command:
arping -A ip -I interface -c 5
ip is the failed over path ip and interface is the nic/bond ( example eth1 ) serving this ip, so the interface is different for iSCSI 1 and iSCSI 2 subnets.
Let me know how it goes.
I will look at PetaSAN logs and will get back to you.
Also can you please reply to my earlier clarifications, i am a bit confused and would like a better understanding of what your see:
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
Last edited on August 5, 2017, 11:31 am · #16

therm
121 Posts
August 5, 2017, 12:04 pmQuote from therm on August 5, 2017, 12:04 pm
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
All ESX are mapped to all Petasan-LUNs and as I can see the latencies during the night(that is when crashes happen) are high on all ESX, but not all (only one) crash/timeout in the same night. In the night are backups made and some IO-intensive Database optimizations. I will move one of those big database server off Petasan to reduce the load.
Written this I am not really sure because logs contain so much timeouts on reconnect on not available paths....
Will try to install arping because it has not gone away....
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
All ESX are mapped to all Petasan-LUNs and as I can see the latencies during the night(that is when crashes happen) are high on all ESX, but not all (only one) crash/timeout in the same night. In the night are backups made and some IO-intensive Database optimizations. I will move one of those big database server off Petasan to reduce the load.
Written this I am not really sure because logs contain so much timeouts on reconnect on not available paths....
Will try to install arping because it has not gone away....

admin
2,974 Posts
August 5, 2017, 12:09 pmQuote from admin on August 5, 2017, 12:09 pmThe PetaSAN logs
self.__clean_unused_rbd_images()
AttributeError: 'NoneType' object has no attribute 'iteritems'
WARNING PetaSAN Could not complete process, there are many exceptions
Are due to the Ceph command returning error:
rbd showmapped --cluster CLUSTER_NAME
Can you run it via cli on different PetaSAN nodes and see if they are OK
also is Ceph reporting anything in status
ceph status --cluster CLUSTER_NAME
The PetaSAN logs
self.__clean_unused_rbd_images()
AttributeError: 'NoneType' object has no attribute 'iteritems'
WARNING PetaSAN Could not complete process, there are many exceptions
Are due to the Ceph command returning error:
rbd showmapped --cluster CLUSTER_NAME
Can you run it via cli on different PetaSAN nodes and see if they are OK
also is Ceph reporting anything in status
ceph status --cluster CLUSTER_NAME

therm
121 Posts
August 5, 2017, 12:11 pmQuote from therm on August 5, 2017, 12:11 pmroot@ceph-node-mru-1:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
11
root@ceph-node-mru-3:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
There are five LUNs using four IPs each, so there must be 20 IPs, but there are 11 only. Am I right that sending arping is only usefull if the IPs are present?
root@ceph-node-mru-1:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
11
root@ceph-node-mru-3:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
There are five LUNs using four IPs each, so there must be 20 IPs, but there are 11 only. Am I right that sending arping is only usefull if the IPs are present?

therm
121 Posts
August 5, 2017, 12:15 pmQuote from therm on August 5, 2017, 12:15 pm for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 14:12:58 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 14:12:59 INFO The path 00004/2 was locking by ceph-node-mru-3.
05/08/2017 14:12:59 INFO This node will stop node ceph-node-mru-3/192.168.2.196.
After fixing a setting in ceph.conf above happend.
Problem was:
root@ceph-node-mru-1:/tmp# rbd showmapped --cluster ceph
warning: line 60: 'osd_crush_update_on_start' in section 'global' redefined
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 14:12:58 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 14:12:59 INFO The path 00004/2 was locking by ceph-node-mru-3.
05/08/2017 14:12:59 INFO This node will stop node ceph-node-mru-3/192.168.2.196.
After fixing a setting in ceph.conf above happend.
Problem was:
root@ceph-node-mru-1:/tmp# rbd showmapped --cluster ceph
warning: line 60: 'osd_crush_update_on_start' in section 'global' redefined
ESX-Server ISCSI problems?
therm
121 Posts
Quote from therm on August 5, 2017, 8:22 amI`ve restarted the PetaSAN-Server providing CEPH-ESX-1 (the LUN where the Path is not recovering) .
But now all the paths in ESX which were moved are marked as failed. Why does ESX not recover paths??
I`ve restarted the PetaSAN-Server providing CEPH-ESX-1 (the LUN where the Path is not recovering) .
But now all the paths in ESX which were moved are marked as failed. Why does ESX not recover paths??
therm
121 Posts
Quote from therm on August 5, 2017, 8:38 amIn PetaSANs those paths are not assigned to an node!
192.168.3.20 192.168.3.21 192.168.4.20 ceph-node-mru-2 192.168.4.21 ceph-node-mru-2
In PetaSANs those paths are not assigned to an node!
192.168.3.20 | |
192.168.3.21 | |
192.168.4.20 | ceph-node-mru-2 |
192.168.4.21 | ceph-node-mru-2 |
therm
121 Posts
Quote from therm on August 5, 2017, 8:41 amErrors in Petasan.log
root@ceph-node-mru-1:~# tail -50 /opt/petasan/log/PetaSAN.log
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:30 ERROR Error during __proces.
05/08/2017 10:39:30 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:49 ERROR Error during __proces.
05/08/2017 10:39:49 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:09 ERROR Error during __proces.
05/08/2017 10:40:09 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:20 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 10:40:21 ERROR Error during __proces.
05/08/2017 10:40:21 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
Errors in Petasan.log
root@ceph-node-mru-1:~# tail -50 /opt/petasan/log/PetaSAN.log
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:30 ERROR Error during __proces.
05/08/2017 10:39:30 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:49 ERROR Error during __proces.
05/08/2017 10:39:49 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:09 ERROR Error during __proces.
05/08/2017 10:40:09 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:20 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 10:40:21 ERROR Error during __proces.
05/08/2017 10:40:21 ERROR 'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in __process
while self.__do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
admin
2,974 Posts
Quote from admin on August 5, 2017, 9:18 amComparing the different logs there are connection failures, operations taking too long (latency increasing from 3 ms to 15 sec and more ) and tasks timing out. When a task takes too long, ESX will send an task abort. when there are connection failures, ESX will attempt to reconnect/re-login.
It could be load issues or network issues:
- Have you increased the load lately or do you see potential for network issues ?
- Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
- Can you run atop command on PetaSAN node and see if there is a resource bottleneck under load ( ram/cpu/disk busy) ?
- For heavy load we recommend having dedicated iSCSI Target nodes separate from storage OSD nodes, if you can do this with at least one iSCSI Target node. For collocated nodes, iSCSI Target service requires 16G RAM on top of the other RAM requirements.
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It happens when an ESX retries a new login while a previous login timed out.
Comparing the different logs there are connection failures, operations taking too long (latency increasing from 3 ms to 15 sec and more ) and tasks timing out. When a task takes too long, ESX will send an task abort. when there are connection failures, ESX will attempt to reconnect/re-login.
It could be load issues or network issues:
- Have you increased the load lately or do you see potential for network issues ?
- Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
- Can you run atop command on PetaSAN node and see if there is a resource bottleneck under load ( ram/cpu/disk busy) ?
- For heavy load we recommend having dedicated iSCSI Target nodes separate from storage OSD nodes, if you can do this with at least one iSCSI Target node. For collocated nodes, iSCSI Target service requires 16G RAM on top of the other RAM requirements.
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It happens when an ESX retries a new login while a previous login timed out.
therm
121 Posts
Quote from therm on August 5, 2017, 9:32 am>>Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It >>happens when an ESX retries a new login while a previous login timed out.
That would be great. In addition I might change the LoginTimeout (of ESX) to 60s ?
>>Have you increased the load lately ... ?
Yes, I moved more VMs into PetaSAN. And holidays are over, so people are working again 😉
>> Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
I will try, but the number of paths left makes me nervous.
Any idea how to bring up petasans paths again? See Petasan.log entries above...
>>Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It >>happens when an ESX retries a new login while a previous login timed out.
That would be great. In addition I might change the LoginTimeout (of ESX) to 60s ?
>>Have you increased the load lately ... ?
Yes, I moved more VMs into PetaSAN. And holidays are over, so people are working again 😉
>> Can you decrease the load for example decrease the number of running VMs and see if this fixes ?
I will try, but the number of paths left makes me nervous.
Any idea how to bring up petasans paths again? See Petasan.log entries above...
admin
2,974 Posts
Quote from admin on August 5, 2017, 11:30 amThere was a arp refresh issue that was solved in 1.3.1 in v 1.3.0 failover paths that were not active in an ESX can take up to 20 min for ESX to update its arp. After 20 min it will update them.
If this does not occur in 20 min:
grab the iputils-arping-xx.deb from our 1.3.1 iso CD under packages directory.
copy it to the PetaSAN nodes and install it via dpkg -i iputils-arping-xx.deb
from each node run the command:
arping -A ip -I interface -c 5
ip is the failed over path ip and interface is the nic/bond ( example eth1 ) serving this ip, so the interface is different for iSCSI 1 and iSCSI 2 subnets.
Let me know how it goes.
I will look at PetaSAN logs and will get back to you.
Also can you please reply to my earlier clarifications, i am a bit confused and would like a better understanding of what your see:
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
There was a arp refresh issue that was solved in 1.3.1 in v 1.3.0 failover paths that were not active in an ESX can take up to 20 min for ESX to update its arp. After 20 min it will update them.
If this does not occur in 20 min:
grab the iputils-arping-xx.deb from our 1.3.1 iso CD under packages directory.
copy it to the PetaSAN nodes and install it via dpkg -i iputils-arping-xx.deb
from each node run the command:
arping -A ip -I interface -c 5
ip is the failed over path ip and interface is the nic/bond ( example eth1 ) serving this ip, so the interface is different for iSCSI 1 and iSCSI 2 subnets.
Let me know how it goes.
I will look at PetaSAN logs and will get back to you.
Also can you please reply to my earlier clarifications, i am a bit confused and would like a better understanding of what your see:
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
therm
121 Posts
Quote from therm on August 5, 2017, 12:04 pmAnother thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
All ESX are mapped to all Petasan-LUNs and as I can see the latencies during the night(that is when crashes happen) are high on all ESX, but not all (only one) crash/timeout in the same night. In the night are backups made and some IO-intensive Database optimizations. I will move one of those big database server off Petasan to reduce the load.
Written this I am not really sure because logs contain so much timeouts on reconnect on not available paths....
Will try to install arping because it has not gone away....
Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?
All ESX are mapped to all Petasan-LUNs and as I can see the latencies during the night(that is when crashes happen) are high on all ESX, but not all (only one) crash/timeout in the same night. In the night are backups made and some IO-intensive Database optimizations. I will move one of those big database server off Petasan to reduce the load.
Written this I am not really sure because logs contain so much timeouts on reconnect on not available paths....
Will try to install arping because it has not gone away....
admin
2,974 Posts
Quote from admin on August 5, 2017, 12:09 pmThe PetaSAN logs
self.__clean_unused_rbd_images()
AttributeError: 'NoneType' object has no attribute 'iteritems'
WARNING PetaSAN Could not complete process, there are many exceptions
Are due to the Ceph command returning error:
rbd showmapped --cluster CLUSTER_NAME
Can you run it via cli on different PetaSAN nodes and see if they are OK
also is Ceph reporting anything in status
ceph status --cluster CLUSTER_NAME
The PetaSAN logs
self.__clean_unused_rbd_images()
AttributeError: 'NoneType' object has no attribute 'iteritems'
WARNING PetaSAN Could not complete process, there are many exceptions
Are due to the Ceph command returning error:
rbd showmapped --cluster CLUSTER_NAME
Can you run it via cli on different PetaSAN nodes and see if they are OK
also is Ceph reporting anything in status
ceph status --cluster CLUSTER_NAME
therm
121 Posts
Quote from therm on August 5, 2017, 12:11 pmroot@ceph-node-mru-1:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
11
root@ceph-node-mru-3:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0There are five LUNs using four IPs each, so there must be 20 IPs, but there are 11 only. Am I right that sending arping is only usefull if the IPs are present?
root@ceph-node-mru-1:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
11
root@ceph-node-mru-3:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
There are five LUNs using four IPs each, so there must be 20 IPs, but there are 11 only. Am I right that sending arping is only usefull if the IPs are present?
therm
121 Posts
Quote from therm on August 5, 2017, 12:15 pmfor image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 14:12:58 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 14:12:59 INFO The path 00004/2 was locking by ceph-node-mru-3.
05/08/2017 14:12:59 INFO This node will stop node ceph-node-mru-3/192.168.2.196.After fixing a setting in ceph.conf above happend.
Problem was:
root@ceph-node-mru-1:/tmp# rbd showmapped --cluster ceph
warning: line 60: 'osd_crush_update_on_start' in section 'global' redefined
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 14:12:58 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 14:12:59 INFO The path 00004/2 was locking by ceph-node-mru-3.
05/08/2017 14:12:59 INFO This node will stop node ceph-node-mru-3/192.168.2.196.
After fixing a setting in ceph.conf above happend.
Problem was:
root@ceph-node-mru-1:/tmp# rbd showmapped --cluster ceph
warning: line 60: 'osd_crush_update_on_start' in section 'global' redefined