Architectural Overview

     Previous  Next    Open TOC in new window    View as PDF - New Window  Get Adobe Reader - New Window
Content starts here

Redundancy, Load Balancing, and High Availability

Redundancy, load balancing and high availability are essential for true carrier grade performance. WebLogic Network Gatekeeper uses both software and hardware components to achieve these important ends:

WebLogic Network Gatekeeper's high availability mechanisms are supported by the clustering mechanisms made available by its container, WebLogic Server. For general information about WebLogic Server and clustering, see Using WebLogic Server Clusters.

 


Tiering

For both high availability and security reasons, Network Gatekeeper is split into two tiers: the Access Tier and the Network Tier. Each tier consists of a cluster, with at least two server instances per cluster, and all server instances run in active mode, independently of each other. The servers in both clusters are, in the context of WebLogic Server, managed servers. Together the clusters make up a single WebLogic Server administration domain, controlled through an administration server.

Figure 9-1 Example Production Domain

Example Production Domain

Communication between the Access Tier and the Network Tier takes place using Java RMI. Application requests are load-balanced between the Access Tier and the Network Tier and failover mechanisms are present between the two. See Traffic Management Inside Network Gatekeeper for more information on these mechanisms in application-initiated and network-triggered traffic flows.

There is an additional tier containing the database. Within the cluster, data is made highly available using a cluster-aware storage service which ensures that session state data is made available among Network Tier instances since multiple invocations can relate to the same session.

 


Traffic Management Inside Network Gatekeeper

Potential failure is possible at many stages along the path that traffic follows as it moves through Network Gatekeeper. The following sections detail, tier by tier, how Network Gatekeeper deals with problems that might arise in both application-initiated and network-triggered traffic.

Application-initiated Traffic

Application-initiated traffic consists of all requests that travel from applications through Network Gatekeeper to underlying network nodes.

The example below follows the worst-case scenario for application-initiated traffic as it passes through Network Gatekeeper, and the failover mechanisms that attempt to keep the request alive.

Figure 9-2 Failover mechanisms in application-initiated traffic

Failover mechanisms in application-initiated traffic

  1. The application sends a request to Network Gatekeeper. In a production environment, this request is routed through a hardware load balancer, usually protocol-aware. If the request towards the initial Access Tier server fails (1.1), either a time-out or a failure is reported. The load-balancer, or the application itself, is responsible for retrying the request.
  2. The request is retried on a second server in the cluster (1.2) and it succeeds. It then attempts to send the request on to the Network Tier.
  3. The request either fails to reach the Network Tier or fails during the process of marshalling/unmarshalling the request as it travels to the Network Tier server (1.2.1).
  4. A fail-over mechanism in the Access Tier sends the request to a different server in the Network Tier cluster and it succeeds (1.2.2). It then attempts to send the request on to the network node.
  5. Note: If the request fails within the Network Tier, failover does not occur. In this case, an exception is thrown to the application, which can then re-send the request.
  6. The attempt to send the request to the telecom network node fails (1.2.2.1).
  7. If a redundant pair of network nodes exists, the request is forwarded to the redundant node (1.2.2.2). If this request fails, the failure is reported to the application.

Network-triggered Traffic

Network-triggered traffic can consist of the following:

For network-triggered traffic, Network Gatekeeper relies heavily on the telecom network node, or other external artifacts such as load-balancers with failover capabilities, to do failover.

In the case of network nodes that can handle the registration of multiple callback interfaces, such as a Parlay Gateway, Network Gatekeeper registers one primary and one secondary callback interface. If the Parlay Gateway is unable to send a request to the network plug-in registered as the primary callback interface, the Parlay Gateway is responsible for retrying the request, sending it to the plug-in that is registered as the secondary callback interface. This secondary callback interface is found in a network plug-in residing in another Network Tier instance. The plug-ins are responsible for communicating with each other and making sure that both callback interfaces are registered. See Network Node Supports Primary and Secondary Notification below for more information.

For HTTP-based protocols, such as MM7, MLP, and PAP, Network Gatekeeper relies on an HTTP load balancer with failover functionality between the telecom network node and Network Gatekeeper. See Network Node Supports Only Single Notification below for more information.

If a telecom network protocol does not support load balancing and high availability, a single point of failure is unavoidable. In this case, all traffic associated with a specific application is routed through the same Network Tier server and each plug-in has one single connection to one telecom network node.

The worst-case scenario for network triggered traffic for medium life span notifications using a network node that supports primary and secondary callback interfaces is described below.

Note: For more information on life spans, see Registering Notifications with Network Nodes.
Figure 9-3 Failover mechanisms in network-triggered traffic

Failover mechanisms in network-triggered traffic

  1. A telecom network node sends a request to the Network Gatekeeper network plug-in that has been registered as the primary. It fails (1.1) due to either a communication or server failure.
  2. The telecom network node resends the request, this time to the plug-in that is registered as the secondary call-back interface. This plug-in is in a different server instance within the Network Tier cluster.
  3. The Network Tier attempts to send the message to the callback EJB in the Access Tier. It fails (1.2.1)
  4. If the request fails to reach the Access Tier, or failure occurs during the marshalling/unmarshalling process, the Network Tier retries, targeting another server in the Access Tier. It succeeds (1.2.2). If, however, the failure occurs after processing has begun in the Access Tier, failover does not occur and an error is reported to the network node.
  5. The callback EJB in the Access Tier attempts to send the request to the application (1.2.2.1). If the application is unreachable or does not respond, the request is considered failed, and an error is reported to the network node.

 


Registering Notifications with Network Nodes

Before applications can receive network-triggered traffic, or notifications, they must register their interest in doing so with Network Gatekeeper, either by sending a request or having the operator set the notification up using OAM methods. In turn these notifications must be registered with the underlying network node that will be supplying them. The form of this registration is dependent on the capabilities of that node.

If registration for notifications is supported by the underlying network node protocol, the traffic path's network plug-in is responsible for performing it, whether the registration is the result of an application-initiated registration request or an on-line provisioning step in Network Gatekeeper. For example, all OSA/Parlay Gateway interfaces support such registration for notifications.

Some network protocols may not support all registration types. For example, in MM7 an application can register to receive notifications for delivery reports on messages is sent from the application, but not to receive notifications on messages sent to the application from the network. In this case, registration for such notifications can be done as an off-line provisioning step in the MMSC.

Network Gatekeeper is responsible for correlating all network-triggered traffic with its corresponding application, whether the original registration for notification was completed using a request from the application or OAM methods.

There are three categories for such registrations, based on the expected life span of the notification. These categories determine the failover strategies used:

These notifications are very short-lived, with an expected life span of a few seconds. Typically these are delivery acknowledgements for hand-off of the request to the network node, where the response to the request is reported asynchronously. For this category, a single plug-in, the originating one, is deemed sufficient to handle the response from the network node.
These notifications are neither short- nor long-lived, with an expected life span of minutes up to a few days. Typically these are delivery acknowledgements for message delivery to an end-user terminal. For this category, the delivery notification criteria that have been registered are replicated to exactly one additional instance of the network protocol plug-in. The plug-in that receives the notification is responsible for registering a secondary notification with the network node, if possible.
These notifications are long-lived, with an expected life span of more than a a few days. Typically these are registrations for notifications for network-triggered SMS and MMS messages or calls that need to be forwarded to an application. For this category, the delivery notification criteria are replicated to all instances of the network plug-in. Each plug-in that receives the notification is responsible for registering an interface with the network node.

Network Node Supports Primary and Secondary Notification

Figure  below illustrates how Network Gatekeeper registers both primary and secondary notifications with network nodes that support it. This capability must be supported both by the network protocol in the abstract, and by the implementation of the protocol as it exists in both the network node and the traffic path's network plug-in.

Note: The scenario assumes that the network node supports registration for notifications with overlapping criteria (primary/secondary).
Figure 9-4 Registration flow with primary/secondary notifications

Registration flow with primary/secondary notifications

  1. The request to register for notifications enters the network protocol plug-in from the application.
  2. The primary notification is registered with the telecom network node.
  3. The notification information is propagated to another instance of the network protocol plug-in.
  4. The secondary notification is registered with the telecom network node.
Note: The concept of primary/secondary notification is not necessarily ordered. The most recently registered notification may, for example, be designated the primary notification.

When a network-triggered request that matches the criteria in a previously registered notification reaches the telecom network node, the node first tries the network plug-in that registered the primary notification. If that request fails, the network node has the responsibility of retrying, using the plug-in that registered the secondary notification. The secondary plug-in will have all necessary information to propagate the request through Network Gatekeeper and on to the correct application.

Network Node Supports Only Single Notification

Figure 9-5 below illustrates the registration step in Network Gatekeeper if the underlying network node does not support primary/secondary notification registration.

Note: The scenario assumes that the network node does not support registration for notifications with overlapping criteria. Only one notification for a given criteria is allowed.
Figure 9-5 Registration flow with single notification node

Registration flow with single notification node

  1. The request to register for notifications enters the network protocol plug-in from the application.
  2. The primary notification is registered with the telecom network node.
  3. The notification information (matching criteria, target URL, etc.) is propagated to another instance of the network protocol plug-in. The plug-in makes the necessary arrangements to be able to receive notifications.

As is clear from the above illustration, in this situation the underlying network node has a callback interface to only a single network plug-in. In order to achieve high-availability and load-balancing a load balancer with fail-over support must be introduced between the network protocol plug-in and the network node, as in Figure 9-6 below.

Note: Whether of not this is possible depends on the network protocol, as the load-balancer must be protocol-aware.
Figure 9-6 Traffic flow with single notification node

Traffic flow with single notification node

 


Network Configuration

In addition to the specific hardware components listed above, the general structure of a Network Gatekeeper installation is designed to support redundancy and high availability. A typical installation consists of a number of UNIX/Linux servers connected through duplicated switches. Each server has redundant network cards connected to separate switches. The servers are organized into clusters, with the number of servers in the cluster determined by the needed capacity.

As described previously, Network Gatekeeper is divided into an Access Tier, which manages connections to applications and a Network Tier, which manages connections to the underlying telecom network. For security, the Network Tier is usually connected only to Access Tier servers, the appropriate underlying network nodes, and the WebLogic Server administration server, which manages the domain. A third tier hosts the database. This tier should be hosted on dedicated, redundant servers. For physical storage, a Network Attached Storage via fibre channel controller cards is an option.

Because the different tiers perform different tasks, their servers should be optimized with different physical profiles, including amount of RAM, disk-types, and CPUs. Each tier scales individually, so the number of servers in a specific layer tier can be increased without affecting the other tiers.

A sample configuration is shown in Figure 9-7. Smaller systems in which the Access Tier and the Network Tier are co-located in the same physical servers are possible, but only for non-production systems,. Particular hardware configurations depend on the specific deployment requirements, and are worked out in the dimensioning and capacity planning stage.

Figure 9-7 Sample hardware configuration

Sample hardware configuration

In high availability mode, all hardware components are duplicated, eliminating single point of failure. This means that there are at least two servers executing the same software modules, that each server has two network cards, and that each server has a fault-tolerant disk system, for example RAID.

The administration server may have duplicate network cards, connected to each switch. The optional PRM servers should run on separate, dedicated servers.

For security reasons, the servers used for the Access Tier can be separated from the Network Tier servers using firewalls. The Access Tier servers reside in a Demilitarized Zone (DMZ) while the Network Tier servers are in a trusted environment.

 


Geographic Redundancy

All Network Gatekeeper modules in production systems are deployed in clusters to ensure high availability. This prevents single points of failure in general usage. To prevent service failure in the face of catastrophic events - natural disasters or massive system outages like power failures - Network Gatekeeper can also be deployed at two geographically distant sites as site pairs. Each site, which is a Network Gatekeeper domain, has a site peer. See Figure 9-8 for more information.

Figure 9-8 Overview of geographically redundant site pairs

Overview of geographically redundant site pairs

Note: The geographic distribution of the sites is not transparent to the applications accessing Network Gatekeeper. There is no single sign-on mechanism across sites and an application must establish a session with each site it intends to use. In case of site failure, an application must manually fail-over to a different site. Provisioning for each site must be performed individually.

SLA enforcement is synchronized across geographic sites and SLAs are enforced across predefined pairs. Each site is configured to have a reference to its peer site. A subset of all SLAs for a given site is designated as being enforceable across sites. Exactly which parts are selected depends on particular applications and their usage patterns.

Each site maintains a designated hub node that responsible for accounting and the enforcement of SLAs at that site. The service executing on the hub node is highly available and is migrated to another server should server failure occur. Cross-site enforcement is accomplished through hub-to-hub synchronization of global usage counts. The accuracy of enforcement across site pairs is configurable through an accuracy factor, which is translated into a synchronization interval based on, among other settings, the number of servers.

Applications that normally use only a single site for their traffic can failover to their peer site while maintaining ongoing SLA enforcement. This scenario is particularly relevant for SLA aspects that have longer term impact such as quotas.

Figure 9-9 Geographically redundant site pairs and applications

Geographically redundant site pairs and applications

The geographic redundancy design does not maintain state for ongoing conversations. Conversations in this sense are defined in terms of the correlation identifiers that are returned to the applications by Network Gatekeeper or passed into Network Gatekeeper from the applications. Any state associated with a correlation identifier exists on only a single geographic site and is lost in the event of a site-wide disaster. Conversational state includes, but is not limited to, call state and registration for network triggered notifications. This type of state is considered volatile, or transient, and is not replicated at the site level.

By implication, therefore, conversations must be conducted and complete on their site of origin. If an application wishes to maintain conversational state cross-site - for example, a registration for network-triggered traffic - it must register with each site individually.

On the other hand, this type of affinity does not prevent load balancing between sites for different or new conversations. An example might be sending an SMS message. Because each such a request constitutes a new conversation, sending SMS messages could be balanced between the sites.

Below is a high-level outline of the redundancy functionality:

Quotas that span over longer period of time are persisted in the database to avoid losing state information during server or site failures. The replication is performed at the level of the Network Gatekeeper as a whole as opposed to relying on the databases to do the replication.
Request limits that span over longer period of time are persisted, in a manner similar to that of quota counters.

Limitations:

Service provider, application group (including SLA) and account data are not replicated across sites. Provisioning must be performed at each individual site.
Applications are expected to either register for notifications from all the sites or to re-register for notifications upon site failure.
If application requests are to be load balanced across sites, the applications must establish sessions with each site separately.
If an application fails over to the back-up site, Network Gatekeeper does not support fail-back to the original site.
SLAs may use overrides that, for example, set traffic levels based on time-of-day. Overrides are not enforced across sites, even if Network Gatekeeper is otherwise configured to enforce SLAs across sites. If overrides are present in these SLAs, alarms are emitted.

  Back to Top       Previous  Next