Forums - PetaSAN

ForumBug ReportingESX-Server ISCSI problems?
You need to log in to create posts and topics. Login · Register
ESX-Server ISCSI problems?

Pages: 1 2 3 4

therm
121 Posts

August 5, 2017, 8:22 am
Quote from therm on August 5, 2017, 8:22 am
I`ve restarted the PetaSAN-Server providing CEPH-ESX-1 (the LUN where the Path is not recovering) .

But now all the paths in ESX which were moved are marked as failed. Why does ESX not recover paths??

I`ve restarted the PetaSAN-Server providing CEPH-ESX-1 (the LUN where the Path is not recovering) .

But now all the paths in ESX which were moved are marked as failed. Why does ESX not recover paths??

Last edited on August 5, 2017, 8:41 am · #11

therm
121 Posts

August 5, 2017, 8:38 am
Quote from therm on August 5, 2017, 8:38 am
In PetaSANs those paths are not assigned to an node!

192.168.3.20

192.168.3.21

192.168.4.20 ceph-node-mru-2

192.168.4.21 ceph-node-mru-2

In PetaSANs those paths are not assigned to an node!

192.168.3.20

192.168.3.21

192.168.4.20 ceph-node-mru-2

192.168.4.21 ceph-node-mru-2

#12

therm
121 Posts

August 5, 2017, 8:41 am
Quote from therm on August 5, 2017, 8:41 am
Errors in Petasan.log

root@ceph-node-mru-1:~# tail -50 /opt/petasan/log/PetaSAN.log
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:30 ERROR    Error during __proces.
05/08/2017 10:39:30 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:49 ERROR    Error during __proces.
05/08/2017 10:39:49 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:09 ERROR    Error during __proces.
05/08/2017 10:40:09 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:20 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 10:40:21 ERROR    Error during __proces.
05/08/2017 10:40:21 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'

Errors in Petasan.log

root@ceph-node-mru-1:~# tail -50 /opt/petasan/log/PetaSAN.log
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:30 ERROR    Error during __proces.
05/08/2017 10:39:30 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:39:49 ERROR    Error during __proces.
05/08/2017 10:39:49 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:09 ERROR    Error during __proces.
05/08/2017 10:40:09 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 10:40:20 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 10:40:21 ERROR    Error during __proces.
05/08/2017 10:40:21 ERROR    'NoneType' object has no attribute 'iteritems'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 90, in start
self.__process()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 116, in process
while self.do_process() != True:
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 175, in __do_process
self.__clean_unused_rbd_images()
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 340, in __clean_unused_rbd_images
for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'

#13

admin
2,981 Posts

August 5, 2017, 9:18 am
Quote from admin on August 5, 2017, 9:18 am
Comparing the different logs there are connection failures, operations taking too long (latency increasing from 3 ms to 15 sec and more ) and tasks timing out. When a task takes too long, ESX will send an task abort. when there are connection failures, ESX will attempt to reconnect/re-login.

It could be load issues or network issues:

Have you increased the load lately or do you see potential for network issues ?

Can you decrease the load for example decrease the number of running VMs and see if this fixes ?

Can you run atop command on PetaSAN node and see if there is a resource bottleneck under load ( ram/cpu/disk busy) ?

For heavy load we recommend having dedicated iSCSI Target nodes separate from storage OSD nodes, if you can do this with at least one iSCSI Target node. For collocated nodes, iSCSI Target service requires 16G RAM on top of the other RAM requirements.

Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?

Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It happens when an ESX retries a new login while a previous login timed out.

Comparing the different logs there are connection failures, operations taking too long (latency increasing from 3 ms to 15 sec and more ) and tasks timing out. When a task takes too long, ESX will send an task abort. when there are connection failures, ESX will attempt to reconnect/re-login.

It could be load issues or network issues:

Have you increased the load lately or do you see potential for network issues ?

Can you decrease the load for example decrease the number of running VMs and see if this fixes ?

Can you run atop command on PetaSAN node and see if there is a resource bottleneck under load ( ram/cpu/disk busy) ?

For heavy load we recommend having dedicated iSCSI Target nodes separate from storage OSD nodes, if you can do this with at least one iSCSI Target node. For collocated nodes, iSCSI Target service requires 16G RAM on top of the other RAM requirements.

Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?

Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It happens when an ESX retries a new login while a previous login timed out.

#14

therm
121 Posts

August 5, 2017, 9:32 am
Quote from therm on August 5, 2017, 9:32 am
>>Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It >>happens when an ESX retries a new login while a previous login timed out.

That would be great. In addition I might change the LoginTimeout (of ESX) to 60s ?

>>Have you increased the load lately ... ?

Yes, I moved more VMs into PetaSAN. And holidays are over, so people are working again 😉

>> Can you decrease the load for example decrease the number of running VMs and see if this fixes ?

I will try, but the number of paths left makes me nervous.

Any idea how to bring up petasans paths again? See Petasan.log entries above...

>>Lastly we did look at the login exception in detail to identify the exact location, we can send you a new kernel with guards against this case, but this is not root cause of the issue. It >>happens when an ESX retries a new login while a previous login timed out.

That would be great. In addition I might change the LoginTimeout (of ESX) to 60s ?

>>Have you increased the load lately ... ?

Yes, I moved more VMs into PetaSAN. And holidays are over, so people are working again 😉

>> Can you decrease the load for example decrease the number of running VMs and see if this fixes ?

I will try, but the number of paths left makes me nervous.

Any idea how to bring up petasans paths again? See Petasan.log entries above...

#15

admin
2,981 Posts

August 5, 2017, 11:30 am
Quote from admin on August 5, 2017, 11:30 am
There was a arp refresh issue that was solved in 1.3.1 in v 1.3.0 failover paths that were not active in an ESX can take up to 20 min for ESX to update its arp. After 20 min it will update them.

If this does not occur in 20 min:

grab the iputils-arping-xx.deb from our 1.3.1 iso CD under packages directory.

copy it to the PetaSAN nodes and install it via dpkg -i iputils-arping-xx.deb

from each node run the command:

arping -A ip -I interface -c 5

ip is the failed over path ip and interface is the nic/bond ( example eth1 ) serving this ip, so the interface is different for iSCSI 1 and iSCSI 2 subnets.

Let me know how it goes.

I will look at PetaSAN logs and will get back to you.

Also can you please reply to my earlier clarifications, i am a bit confused and would like a better understanding of what your see:

Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?

There was a arp refresh issue that was solved in 1.3.1 in v 1.3.0 failover paths that were not active in an ESX can take up to 20 min for ESX to update its arp. After 20 min it will update them.

If this does not occur in 20 min:

grab the iputils-arping-xx.deb from our 1.3.1 iso CD under packages directory.

copy it to the PetaSAN nodes and install it via dpkg -i iputils-arping-xx.deb

from each node run the command:

arping -A ip -I interface -c 5

ip is the failed over path ip and interface is the nic/bond ( example eth1 ) serving this ip, so the interface is different for iSCSI 1 and iSCSI 2 subnets.

Let me know how it goes.

I will look at PetaSAN logs and will get back to you.

Also can you please reply to my earlier clarifications, i am a bit confused and would like a better understanding of what your see:

Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?

Last edited on August 5, 2017, 11:31 am · #16

therm
121 Posts

August 5, 2017, 12:04 pm
Quote from therm on August 5, 2017, 12:04 pm

Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?

All ESX are mapped to all Petasan-LUNs and as I can see the latencies during the night(that is when crashes happen) are high on all ESX, but not all (only one) crash/timeout in the same night. In the night are backups made and some IO-intensive Database optimizations. I will move one of those big database server off Petasan to reduce the load.

Written this I am not really sure because logs contain so much timeouts on reconnect on not available paths....

Will try to install arping because it has not gone away....

Another thing i am not clear on, when this problem happen on an ESX, does it have active paths to other PetaSAN storage nodes ? If yes do they all have connection/timeout issues at the same time ? Also if these problem nodes are serving other ESXs, do the other ESXs show any similar latency/timeout issues ?

All ESX are mapped to all Petasan-LUNs and as I can see the latencies during the night(that is when crashes happen) are high on all ESX, but not all (only one) crash/timeout in the same night. In the night are backups made and some IO-intensive Database optimizations. I will move one of those big database server off Petasan to reduce the load.

Written this I am not really sure because logs contain so much timeouts on reconnect on not available paths....

Will try to install arping because it has not gone away....

#17

admin
2,981 Posts

August 5, 2017, 12:09 pm
Quote from admin on August 5, 2017, 12:09 pm
The PetaSAN logs

self.__clean_unused_rbd_images()

AttributeError: 'NoneType' object has no attribute 'iteritems'

WARNING PetaSAN Could not complete process, there are many exceptions

Are due to the Ceph command returning error:

rbd showmapped --cluster CLUSTER_NAME

Can you run it via cli on different PetaSAN nodes and see if they are OK

also is Ceph reporting anything in status

ceph status --cluster CLUSTER_NAME

The PetaSAN logs

self.__clean_unused_rbd_images()

AttributeError: 'NoneType' object has no attribute 'iteritems'

WARNING PetaSAN Could not complete process, there are many exceptions

Are due to the Ceph command returning error:

rbd showmapped --cluster CLUSTER_NAME

Can you run it via cli on different PetaSAN nodes and see if they are OK

also is Ceph reporting anything in status

ceph status --cluster CLUSTER_NAME

#18

therm
121 Posts

August 5, 2017, 12:11 pm
Quote from therm on August 5, 2017, 12:11 pm
root@ceph-node-mru-1:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
11
root@ceph-node-mru-3:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0

There are five LUNs using four IPs each, so there must be 20 IPs, but there are 11 only. Am I right that sending arping is only usefull if the IPs are present?

root@ceph-node-mru-1:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
11
root@ceph-node-mru-3:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0

There are five LUNs using four IPs each, so there must be 20 IPs, but there are 11 only. Am I right that sending arping is only usefull if the IPs are present?

#19

therm
121 Posts

August 5, 2017, 12:15 pm
Quote from therm on August 5, 2017, 12:15 pm
    for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 14:12:58 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 14:12:59 INFO     The path 00004/2 was locking by ceph-node-mru-3.
05/08/2017 14:12:59 INFO     This node will stop node ceph-node-mru-3/192.168.2.196.

After fixing a setting in ceph.conf above happend.

Problem was:

root@ceph-node-mru-1:/tmp# rbd showmapped --cluster ceph
warning: line 60: 'osd_crush_update_on_start' in section 'global' redefined

    for image, mapped_count in rbd_images.iteritems():
AttributeError: 'NoneType' object has no attribute 'iteritems'
05/08/2017 14:12:58 WARNING PetaSAN Could not complete process, there are many exceptions occurred.
05/08/2017 14:12:59 INFO     The path 00004/2 was locking by ceph-node-mru-3.
05/08/2017 14:12:59 INFO     This node will stop node ceph-node-mru-3/192.168.2.196.

After fixing a setting in ceph.conf above happend.

Problem was:

root@ceph-node-mru-1:/tmp# rbd showmapped --cluster ceph
warning: line 60: 'osd_crush_update_on_start' in section 'global' redefined

#20

Post Reply: ESX-Server ISCSI problems?

Cancel

Pages: 1 2 3 4