First of all, both terms ‘extended’ and ‘stretched’ are used on blogs, in documentation and logfiles. I will use ‘extended’ in this one. Also, voting files are files, not disks. Oracle will store the file(s) on one of the disks within the specified diskgroup, preferable in different failgroups. Also people talk about a ‘RAC’ cluster, but an Oracle cluster can exist without RAC. RAC is the database option to run the database in a cluster on Oracle clusterware.
Sorry, no images in this blog post… only text.
When is an Oracle cluster an extended/stretched cluster?
It’s not distance. Oracle does not know if it’s in the same or different racks or buildings, divided by roads or rivers… There is no setting ‘extended=Y’ in Oracle that it knows about being extended.
It’s the way _you_ design the storage for Oracle clusterware and ASM.
My point of view, extended is; When there is mirroring of blocks on disks and these mirrored blocks are on disks in different storage locations.
Mirroring and Failgroups
Oracle only knows that it should mirror blocks to other disks which are ‘in a different failgroup’ (and the same diskgroup). When you do not specify a failgroup when creating a new ASM disk, the disk will get a failgroup of it’s own.
So in case of mirroring, Oracle just writes to another failgroup, which will automatically be on another disk.
But in the case of an extended cluster, the mirroring in different failgroups could still be on the same storage location!
You should only create two (NORMAL redundancy) or three (HIGH redundancy) failgroups for one for each of these storage locations. In case you want to use HIGH redundancy, you can have a maximum of two failgroups per storage location (preferably group by controller), so a maximum of four with two locations and six with three locations. You need to avoid your mirrored blocks will be on disks in the same storage location. There will be some cases explained in part 2 of this blog post.
If you create more, there is still the possibility Oracle mirrors blocks to failgroups on the same storage device as where the ‘original’ data is.
Extended or not
So there is no ‘extended=Y’ setting and Oracle does not know about different locations other than failgroups. But how does Oracle know it is extended, does it know at all, should Oracle know? Mirroring does not automatically tell Oracle it’s extended…
Well, there is something you must consider when creating a ‘non-local extended cluster’ and that is the ‘third location for the quorum voting files’ (or fourth and fifth in case of HIGH redundancy and three storage locations). This quorum voting file is not actively used, but must be available in a different location other than the ‘main’ storage units! It’s only actively used when there is no other communication left when the interconnect or other voting files have gone down. That’s why it can reside on a small storage location with NFS or iSCSI somewhere else; in the cloud or on a NAS (or in theory a Raspberry Pi), as long it’s always available.
When creating disks for the quorum usage, you need to add them in a quorum failgroup. This is a special failgroup which only will store these voting files will not allow you to store data.
When a quorum failgroup is active, Oracle will know it is in an extended configuration; see this warning:
(my diskgroup was called OCR)
"WARNING: OCR has too many failure groups for a stretch cluster."
Hey, “stretch”, Oracle does know it’s extended!
So Oracle takes into account it’s extended and seems to have a build-in meganism to
protect warn against to many failgroups and prevent storing mirrors in failgroups on the same storage locations. Now I have to admit that ‘+OCR’ was also the diskgroup with the quorum failgroup and I did not test to add more failgroups to +DATA or +FRA for instance.
It will use this third file / location to avoid a ‘split brain’. The part of the cluster which sees the majority of voting files will survive and the part (or nodes) which see the minority will shut down. It needs this third file (odd number of files) to avoid the situation that separated parts of the cluster will see a majority and these separated parts survive and will process work independently and cause data corruption.
Some real life experiences, from tests in labs and hard times with production issues:
– When a cluster node (or all cluster nodes), loose the majority of voting files, the node will go down (period). Even if you would think, ‘hey, all nodes could still communicate with each other and talk about this loss’, no… the node goes down!
– A minority of ‘offlined’ voting files will be automatically re-mirrored to other disks/failgroups if asked nicely.
– A minority of ‘offlined’ voting files will not automatically be re-mirrored to other disks/failgroups if forcedly removed or in case of failure.
– A majority of lost voting files will automatically result in a shutdown of your node(s) or cluster.
– Detection of failure/automatic repair or mirroring of disks can/will stall the database; if there are (database) timeouts set shorter than these repair times, other issues can happen:
LMON (ospid: 4916) waits for event 'control file sequential read' for 80 secs. ERROR: Some process(s) is not making progress. LMHB (ospid: 4933) is terminating the instance. Please check LMHB trace file for more details. Please also check the CPU load, I/O load and other system properties for anomalous behavior LMHB (ospid: 4933): terminating the instance due to error 29770
There should be fixes for this, but I have seen this go wrong with the latest patched version of 12.1 (July 2016).
You need to be aware when you are creating an extended cluster and make sure mirrored blocks will be written to a disks on different storage locations. This can be achieved by using failgroups, but use them wisely. There is no wizard helping you and you need to do design work. Make sure you test in a lab or draw your design on a piece of paper and virtually ‘pull cables’. Try to predict what will happen and correct that when it proves wrong.
I’ve been on a project once where they called me: Mr. “you can pull any cable”.