|
Redundancy, load balancing and high availability are essential for true carrier grade performance. WebLogic Network Gatekeeper uses both software and hardware components to achieve these important ends:
WebLogic Network Gatekeeper's high availability mechanisms are supported by the clustering mechanisms made available by its container, WebLogic Server. For general information about WebLogic Server and clustering, see Using WebLogic Server Clusters.
For both high availability and security reasons, Network Gatekeeper is split into two tiers: the Access Tier and the Network Tier. Each tier consists of a cluster, with at least two server instances per cluster, and all server instances run in active mode, independently of each other. The servers in both clusters are, in the context of WebLogic Server, managed servers. Together the clusters make up a single WebLogic Server administration domain, controlled through an administration server.

Communication between the Access Tier and the Network Tier takes place using Java RMI. Application requests are load-balanced between the Access Tier and the Network Tier and failover mechanisms are present between the two. See Traffic Management Inside Network Gatekeeper for more information on these mechanisms in application-initiated and network-triggered traffic flows.
There is an additional tier containing the database. Within the cluster, data is made highly available using a cluster-aware storage service which ensures that session state data is made available among Network Tier instances since multiple invocations can relate to the same session.
Potential failure is possible at many stages along the path that traffic follows as it moves through Network Gatekeeper. The following sections detail, tier by tier, how Network Gatekeeper deals with problems that might arise in both application-initiated and network-triggered traffic.
Application-initiated traffic consists of all requests that travel from applications through Network Gatekeeper to underlying network nodes.
The example below follows the worst-case scenario for application-initiated traffic as it passes through Network Gatekeeper, and the failover mechanisms that attempt to keep the request alive.

| Note: | If the request fails within the Network Tier, failover does not occur. In this case, an exception is thrown to the application, which can then re-send the request. |
Network-triggered traffic can consist of the following:
For network-triggered traffic, Network Gatekeeper relies heavily on the telecom network node, or other external artifacts such as load-balancers with failover capabilities, to do failover.
In the case of network nodes that can handle the registration of multiple callback interfaces, such as a Parlay Gateway, Network Gatekeeper registers one primary and one secondary callback interface. If the Parlay Gateway is unable to send a request to the network plug-in registered as the primary callback interface, the Parlay Gateway is responsible for retrying the request, sending it to the plug-in that is registered as the secondary callback interface. This secondary callback interface is found in a network plug-in residing in another Network Tier instance. The plug-ins are responsible for communicating with each other and making sure that both callback interfaces are registered. See Network Node Supports Primary and Secondary Notification below for more information.
For HTTP-based protocols, such as MM7, MLP, and PAP, Network Gatekeeper relies on an HTTP load balancer with failover functionality between the telecom network node and Network Gatekeeper. See Network Node Supports Only Single Notification below for more information.
If a telecom network protocol does not support load balancing and high availability, a single point of failure is unavoidable. In this case, all traffic associated with a specific application is routed through the same Network Tier server and each plug-in has one single connection to one telecom network node.
The worst-case scenario for network triggered traffic for medium life span notifications using a network node that supports primary and secondary callback interfaces is described below.
| Note: | For more information on life spans, see Registering Notifications with Network Nodes. |

Before applications can receive network-triggered traffic, or notifications, they must register their interest in doing so with Network Gatekeeper, either by sending a request or having the operator set the notification up using OAM methods. In turn these notifications must be registered with the underlying network node that will be supplying them. The form of this registration is dependent on the capabilities of that node.
If registration for notifications is supported by the underlying network node protocol, the traffic path's network plug-in is responsible for performing it, whether the registration is the result of an application-initiated registration request or an on-line provisioning step in Network Gatekeeper. For example, all OSA/Parlay Gateway interfaces support such registration for notifications.
Some network protocols may not support all registration types. For example, in MM7 an application can register to receive notifications for delivery reports on messages is sent from the application, but not to receive notifications on messages sent to the application from the network. In this case, registration for such notifications can be done as an off-line provisioning step in the MMSC.
Network Gatekeeper is responsible for correlating all network-triggered traffic with its corresponding application, whether the original registration for notification was completed using a request from the application or OAM methods.
There are three categories for such registrations, based on the expected life span of the notification. These categories determine the failover strategies used:
Figure below illustrates how Network Gatekeeper registers both primary and secondary notifications with network nodes that support it. This capability must be supported both by the network protocol in the abstract, and by the implementation of the protocol as it exists in both the network node and the traffic path's network plug-in.
| Note: | The scenario assumes that the network node supports registration for notifications with overlapping criteria (primary/secondary). |

| Note: | The concept of primary/secondary notification is not necessarily ordered. The most recently registered notification may, for example, be designated the primary notification. |
When a network-triggered request that matches the criteria in a previously registered notification reaches the telecom network node, the node first tries the network plug-in that registered the primary notification. If that request fails, the network node has the responsibility of retrying, using the plug-in that registered the secondary notification. The secondary plug-in will have all necessary information to propagate the request through Network Gatekeeper and on to the correct application.
Figure 9-5 below illustrates the registration step in Network Gatekeeper if the underlying network node does not support primary/secondary notification registration.
| Note: | The scenario assumes that the network node does not support registration for notifications with overlapping criteria. Only one notification for a given criteria is allowed. |

As is clear from the above illustration, in this situation the underlying network node has a callback interface to only a single network plug-in. In order to achieve high-availability and load-balancing a load balancer with fail-over support must be introduced between the network protocol plug-in and the network node, as in Figure 9-6 below.
| Note: | Whether of not this is possible depends on the network protocol, as the load-balancer must be protocol-aware. |

In addition to the specific hardware components listed above, the general structure of a Network Gatekeeper installation is designed to support redundancy and high availability. A typical installation consists of a number of UNIX/Linux servers connected through duplicated switches. Each server has redundant network cards connected to separate switches. The servers are organized into clusters, with the number of servers in the cluster determined by the needed capacity.
As described previously, Network Gatekeeper is divided into an Access Tier, which manages connections to applications and a Network Tier, which manages connections to the underlying telecom network. For security, the Network Tier is usually connected only to Access Tier servers, the appropriate underlying network nodes, and the WebLogic Server administration server, which manages the domain. A third tier hosts the database. This tier should be hosted on dedicated, redundant servers. For physical storage, a Network Attached Storage via fibre channel controller cards is an option.
Because the different tiers perform different tasks, their servers should be optimized with different physical profiles, including amount of RAM, disk-types, and CPUs. Each tier scales individually, so the number of servers in a specific layer tier can be increased without affecting the other tiers.
A sample configuration is shown in Figure 9-7. Smaller systems in which the Access Tier and the Network Tier are co-located in the same physical servers are possible, but only for non-production systems,. Particular hardware configurations depend on the specific deployment requirements, and are worked out in the dimensioning and capacity planning stage.

In high availability mode, all hardware components are duplicated, eliminating single point of failure. This means that there are at least two servers executing the same software modules, that each server has two network cards, and that each server has a fault-tolerant disk system, for example RAID.
The administration server may have duplicate network cards, connected to each switch. The optional PRM servers should run on separate, dedicated servers.
For security reasons, the servers used for the Access Tier can be separated from the Network Tier servers using firewalls. The Access Tier servers reside in a Demilitarized Zone (DMZ) while the Network Tier servers are in a trusted environment.
All Network Gatekeeper modules in production systems are deployed in clusters to ensure high availability. This prevents single points of failure in general usage. To prevent service failure in the face of catastrophic events - natural disasters or massive system outages like power failures - Network Gatekeeper can also be deployed at two geographically distant sites as site pairs. Each site, which is a Network Gatekeeper domain, has a site peer. See Figure 9-8 for more information.

| Note: | The geographic distribution of the sites is not transparent to the applications accessing Network Gatekeeper. There is no single sign-on mechanism across sites and an application must establish a session with each site it intends to use. In case of site failure, an application must manually fail-over to a different site. Provisioning for each site must be performed individually. |
SLA enforcement is synchronized across geographic sites and SLAs are enforced across predefined pairs. Each site is configured to have a reference to its peer site. A subset of all SLAs for a given site is designated as being enforceable across sites. Exactly which parts are selected depends on particular applications and their usage patterns.
Each site maintains a designated hub node that responsible for accounting and the enforcement of SLAs at that site. The service executing on the hub node is highly available and is migrated to another server should server failure occur. Cross-site enforcement is accomplished through hub-to-hub synchronization of global usage counts. The accuracy of enforcement across site pairs is configurable through an accuracy factor, which is translated into a synchronization interval based on, among other settings, the number of servers.
Applications that normally use only a single site for their traffic can failover to their peer site while maintaining ongoing SLA enforcement. This scenario is particularly relevant for SLA aspects that have longer term impact such as quotas.

The geographic redundancy design does not maintain state for ongoing conversations. Conversations in this sense are defined in terms of the correlation identifiers that are returned to the applications by Network Gatekeeper or passed into Network Gatekeeper from the applications. Any state associated with a correlation identifier exists on only a single geographic site and is lost in the event of a site-wide disaster. Conversational state includes, but is not limited to, call state and registration for network triggered notifications. This type of state is considered volatile, or transient, and is not replicated at the site level.
By implication, therefore, conversations must be conducted and complete on their site of origin. If an application wishes to maintain conversational state cross-site - for example, a registration for network-triggered traffic - it must register with each site individually.
On the other hand, this type of affinity does not prevent load balancing between sites for different or new conversations. An example might be sending an SMS message. Because each such a request constitutes a new conversation, sending SMS messages could be balanced between the sites.
Below is a high-level outline of the redundancy functionality:
|