Fighting with ZFS to make zpools failover/failback possible

After success with single zpool switching, I’m trying now to create a reliable and balanced construction with two controllers and two pools.

When one of the controllers is dead, its pool successfully migrates to the other controller. But when the controller recovers, zpool rebalancing is needed.

And here rises a huge problem of “detaching” a pool from the running controller online. I tried to send commands with ctladm to stop the appropriate LUN and even remove/create it again. But it doesn’t work properly for me at least for now. Unbelievably, but it seems the only reliable way to detach a LUN from the frontend is to stop ctld!

It turns out, the BeaST will have offline rebalancing procedure for now. Until the offline rebalancing take place, the BeaST will have to work with one active controller and the other forwarding data only.

The second issue is related to ZIL mirroring. Adding appropriate definitions to ctl.conf doesn’t work properly for CTL HA configuration for many reasons, and the main of them is that HA mode requires right the same LUNs to be defined on both nodes. And it looks impossible to exclude ZIL LUNs from HA configuration. I tried to start two instances of ctld with different confiuration files for backend ZIL mirroring and front-end ports, but it doesn’t help me to avoid CTL HA issue on backend.

Then I have tried to use gmirror + geom gated as a mirroring transport, but something strange happens with gmirror when ggate device looses connection with the dead controller. Yes, gmirror detects that remote ggate is detached, but it doesn’t want to drop it and continues to wait for something!

Finally, I replaced it with legacy iSCSI port – istgt. It works quite stable, but sometimes drops connection. Fortunately, gmirror detects it well and it’s possible to restore ZIL-mirroring on the fly.

So, now there are two different iSCSI stacks in the BeaST! 🙂 And I’m not a half way to make everything work, as it seems the gmirror sometimes is a cause of kernel panic.

UPD 2017.05.14:

gmirror_kernelpanic

So “gmirror + something” chains are not very stable for mirroring ZIL. And yes, HAST doesn’t look sutable for the BeaST purposes it it’s based on ggate and it creates one-way replication. Also I’d prefer to keep same device names on both sides of replication.

Let’s see  if “shared drive” for ZIL make everything more stable.

Advertisements

About mezzantrop

10 years of experience in large SAN and storage environments: mainly Hitachi, HP and Brocade. Now I am a proud SAN/storage IBMer. Empty – expect-like tool author. FreeBSD enthusiast.
This entry was posted in BeaST and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s