TOC 
Network Working GroupM. Schwartz
Internet-DraftCode On The Road, LLC
Expires: April 7, 2002October 7, 2001

The ANTACID Replication Service: Rationale and Architecture
draft-schwartz-antacid-service-00

Status of this Memo

This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026 except that the right to produce derivative works is not granted. (If this document becomes part of an IETF working group activity, then it will be brought into full compliance with Section 10 of RFC 2026.)

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on April 7, 2002.

Copyright Notice

Copyright (C) The Internet Society (2001). All Rights Reserved.

Abstract

This memo presents the ANTACID Replication Service, which replicates hierarchically named repositories of XML documents for business-critical, internetworked applications.

ASCII and HTML versions of this document are available at http://www.codeontheroad.com/papers/draft-schwartz-antacid-service.txt and http://www.codeontheroad.com/papers/draft-schwartz-antacid-service.html, respectively.



 TOC 

Table of Contents




 TOC 

1. Introduction

In this document we present the motivation, design rationale, and architecture for the ANTACID Replication Service (ARS). ARS replicates repositories of hierarchically-named, well-formed XML documents[1] in a manner that supports business-critical, internetworked applications (BCIAs). The ARS protocol and algorithmic details are defined in a companion document [2].

By "business-critical" we mean applications requiring stronger data coherence guarantees than file-by-file replication, but weaker than global ACID semantics (i.e., Atomic, Consistent, Isolated, Durable [3] semantics spanning all replicas). Our motivation is that many commercial services require coherence guarantees for replicated data, but that global ACID semantics are not appropriate across the Internet because:

The "ANTACID" part of ARS refers to data coherence semantics we define later in this document, which we believe are well suited to BCIAs.

By "internetworked" we mean applications characterized by:

To date, much of the replication technology that provides business-critical support has developed in the relational database (RDBMS) world. This world has produced technology that provides significant structure (well defined data models, integrity constraints, atomic update, and serial audit trails), but that does not work well in internetworked environments. For example, RDBMS's often use protocols too data intensive for low-bandwidth networks, lack support for distributed administration, and require all replicas to communicate with a single master server. On the other hand, much of the current internetwork-oriented replication technology (specifically, protocols standardized by the Internet Engineering Task Force (IETF)) does not provide business-critical support. For example, IETF replication protocols typically provide no means of atomically updating a set of data objects, and in some cases do not guarantee that all updates are delivered to all replicas.

With continuing growth in the Internet's size and breadth of uses, applications increasingly require distributed data management technology that provides data coherence guarantees and also works well across the Internet. Recent examples include content delivery networks, corporate data sharing partnerships, content syndication, and network service providers sharing parts of their network management data. The growing popularity of XML promises to accelerate the trend of deploying business critical applications across the Internet.

Because of the split RDBMS/IETF evolution of replication technology, and because of the large number of incompatible replication technologies that have been developed to date, building a BCIA typically requires custom implementation and integration. The author has personally worked on a number of commercial replication efforts that took significant time and effort to deploy, because of the need to integrate technologies from multiple vendors and fill out missing pieces with custom development. Market pressures forced each of these projects to launch with incomplete integration (e.g., lacking some needed network management functionality). Moreover, there were recurring integration costs as the vendors introduced new versions of their technologies. This kind of experience is common to many Management Information Systems (MIS) shops.

The current effort seeks to define a standard capable of replicating data for use by BCIA's, allowing a robust service to be deployed by configuration rather than custom development/integration. The current effort also seeks to incorporate lessons learned over the past 20 years from both the RDBMS and the IETF worlds. Both the protocol and the data units replicated by ARS are XML-based because we're betting on XML to become a dominant means of structuring data on the Internet.

We begin this section by presenting three example applications, to motivate ARS's goals of standardized replication support for BCIA's. Next, we compare caching and replication, to clarify our meanings for these terms and motivate our focus on replication. Next, we discuss why we believe existing replication technology is not well suited to business-critical internetworked environments. We then enumerate the ARS replication requirements.

At the end of this document we summarize the strengths and limitations of ARS, to help data service architects determine whether ARS is appropriate for their needs.

1.1 Example Applications

We now present and discuss some example replication applications for each of the characteristics introduced earlier. Briefly, those characteristics were:

  1. business-critical support: stronger data coherence guarantees than file-by-file replication, but weaker than global ACID semantics;
  2. support for internetworked environments: the ability to operate in an administratively distributed environment with up to millions of servers in the face of partial outages, a wide range of network capacity, and a wide range of provisioning requirements; and
  3. standard interfaces: the ability to deploy applications through configuration rather than custom integration/development.

As an example application requiring characteristic 1 (business-critical support), consider the problem of distributing web site updates to a number of replica servers. This problem is faced by content distribution networks, as well as by MIS shops that manage geographically replicated intranets. After being editorially reviewed or otherwise approved for distribution, content is placed on a central staging server, from where it is replicated to a number of other servers. When replication occurs, the set of files holding images referenced from the HTML "assembly" pages must be moved into place atomically with the assembly pages themselves, to avoid "broken link" errors when users browse the content.

As an example application requiring characteristic 2 (support for internetworked environments), consider the problem of a real-time news provider, which must relay news headlines and stories to a large base of mobile clients. As mobile data services grow over the next few years, the client base could grow to several hundred million in size. To handle the huge load bursts and 99.9999% uptime such scale would require, a service provider might replicate content through a tiered arrangement, with a single staging server feeding a set of continental replica servers across wide area links. Each continental server, in turn, feeds a number of regional replica servers, which eventually populate several thousand local point-of-presence (POP) servers globally. Each POP then passes updates to the set of geographically local clients as they become intermittently reachable over the network. The entire system would need to be provisioned in a fashion that ensures updates reach the POP servers within (say) 10 minutes of when the updates are injected into the staging server, and might support multiple classes of service -- for example with professional brokers paying more for up-to-the-minute stock quote data.

As a second example requiring support for internetworked environments consider the problem of provisioning service for a new user by populating a set of downstream services with slices of data sourced at a back office client database. For example, when a new user signs up for Internet service at a national service provider, information such as their login name, password, given name, and subscribed service list must be propagated to a set of RADIUS [4], DHCP [5], POP [6], SMTP [7], LDAP [8] and database-driven web servers. A similar operation (though perhaps applied to a smaller scale user base) is needed for supporting MIS operations managing employees and contractors in a medium-to-large corporate intranet. One particularly complex requirement (which may or may not be selected, depending on local policy) is allowing clients to update their local server's information during temporary network partitions. For example, a service provider may wish to allow clients to change their passwords and continue logging in to the local server during a network partition, and then propagate the password updates to other replica sites when the partition is repaired. This functionality would require support within the replication system for so-called "multi-master" updates. This functionality could result in conflicting updates being injected at multiple points in the network, requiring conflict detection and resolution procedures. Multi-master update and conflict detection/resolution add significant complexity to a replication system, so it is desirable to structure the replication protocol in such a fashion that services that do not need this functionality can avoid the associated implementation costs and runtime overhead.

As an example application requiring characteristic 3 (standard interfaces), consider again the service provisioning example mentioned above. At present the commercially available implementations of RADIUS, SMTP, etc. are implemented by a number of different vendors, whose technologies use incompatible systems for populating and replicating their databases. Moreover, large MIS operations often manage employee or client data using technologies developed for the corporate personnel marketplace or the telco customer data marketplace, and these technologies also use their own proprietary data population/replication technology. Developing a service provisioning system therefore requires non-trivial custom integration and implementation, particularly as provisioning requires integration with operations support systems (e.g., the ability to disable all online and physical building access with a single action when an employee is terminated).

1.2 Caching vs. Replication and Lazy-vs-Eager Update Propagation

Caching and replication are both used for distributing copies of data, to lower latency and reduce bottlenecks. It is worth briefly reviewing the distinction we see between these approaches, as the industry has not converged on widely shared definitions for these terms [9][10].

Caching systems typically make copies of individual objects in response to requests for those objects, and refresh copies during a subset of future access requests. Replication systems typically make copies of defined collections of objects in advance of requests to those collections, and update collections when the origin objects (those being replicated) change.

Caching can typically be implemented more simply than replication, but lacks several useful properties of replication systems:

Caching systems are therefore most clearly appropriate for cases where simple configuration is important, full collection availability during partition is not required, object content changes slowly, and client notification is not required.

Replication is most clearly appropriate for an environment where content changes rapidly (and therefore, selectively pushing updates of changed content works better than caching/pulling content); timely information is important; and provisioned service is important (so that service providers can engineer the flow of updates to remove bottlenecks that interfere with meeting their service level agreements).

Note that replication can be implemented using software originally designed to support caching. For example, one could implement a content delivery network that replicates content by pre-loading a set of Squid [11] caches when new objects are added to a collection, flushing the caches when objects are deleted, and re-loading the caches whenever objects change.

1.3 Currently Deployed Replication Technology

A great deal of work has been done on replication over the past 20 years. In the current section we focus on deployed technology. Throughout the remainder of this document we discuss research approaches as they relate to the ARS architecture.

Many vendors have developed proprietary replication technology, for file systems, relational and object databases, version control, messaging and directory services, business-to-business integration and electronic data interchange, systems and network management, workflow, and content delivery networks. While many of these systems provide sophisticated replication functionality, an open standards-based approach is valuable because of its potential to enhance interoperability among applications that share access to replicated datastores. For example, it is unfortunate that such inherently related processes as version control and workflow are typically deployed in current intranets using different vendors' incompatible replication technologies. To achieve deep integration, companies must spend time developing "plumbing" code that moves data between these incompatible systems, encodes semantics from one system into opaque fields in the other system(s), and integrates with other operations support software (such as network and systems management systems). Often sites deploy only partially integrated solutions because of time-to-market pressures.

While there are a number of replication standards designed for internetworked environments, none adequately supports business-critical applications. For example, the Network News Transfer Protocol (NNTP) [12] replicates postings from thousands of newsgroups across millions of machines, but enforces little structure. It is common to see news postings sequenced in different orders on different machines, with confusing references to "earlier" articles that appear later in a particular repository. There also are no guarantees that all articles posted to a newsgroup will reach all replicas of that newsgroup.

The Internet Cache Protocol (ICP) [13] provides a way for cache servers to share data among a hierarchy of caching peers. This approach comes at a cost in runtime latency [14][15]. It also provides no means of synchronizing cache contents or ensuring content integrity. Sites wishing such stronger guarantees must develop additional mechanisms for those purposes.

The HTTP Distribution and Replication Protocol (DRP) [16] defines efficient means for requesting missing or updated files, primarily in support of software distribution. It lacks support for a variety of requirements we consider important for business-critical internetworked environment. For example, DRP provides no support for atomically updating a group of files, routing updates according to network structure or policy, or allowing multiple sites to update shared data.

The Distributed Authoring and Versioning protocol (WebDAV) [17] defines means of remotely editing documents, including support for collection creation and versioning. This functionality is orthogonal to replication, and in fact the WebDAV requirements state that "the WebDAV extensions must be able to operate in a distributed, replicated environment".

The Network Data Management Protocol (NDMP) [18] defines protocol support for clients to control how backup is done among various configurations of servers and networks. This problem is tangentially related to replication.

A common way to achieve some of the advantages of open standards with deeper replication capabilities is to represent business logic and content as XML stored inside a database or file system that supports replication. While such an approach can be made to work, a storage management-independent replication protocol is needed because of the fragmented state of the file system and database marketplaces:

Beyond the marketplace fragmentation issue, file systems and databases do not provide appropriate functionality for a business-critical internetworked environment:

Finally, most database and file replication systems use communication-intensive network protocols that exhibit performance problems when used across the Internet. While a number of research systems address weakly connected [23] or topologically complex [26] environments, there is a paucity of practically deployable replication solutions that work well across the range of connectivity and topology found on the Internet.

The service that comes closest to ANTACID's goals and approaches is the UDDI replication service [27]. Like ANTACID, UDDI replication supports pull-based replication of XML content with notification when changes are available, and transmits updates over a Directed Acyclic Graph (DAG). Also, since UDDI is used for advertising and discovering services offered by network-accessible businesses, it is clearly intended for a business-critical internetworked environment. Unlike ANTACID, UDDI replication defines a service-specific interface (i.e., intended to support only UDDI data). As increasing numbers of XML-based services come online we believe a general XML replication service would be valuable, for the reasons discussed earlier. Because UDDI replication is focused on the specific needs of UDDI servers, the replication design fixes a number of choices that ANTACID leaves configurable, including authentication and confidentiality services (vs. ANTACID's reliance on BEEP [28], which allows negotiation of security services); update content encodings (vs. ANTACID's support for negotiated encodings); non-support of update-anywhere functionality (vs. ANTACID's optional support for update anywhere); and the requirement that updates be propagated within a set amount of time (vs. ANTACID's treatment of convergence time guarantees as a separate service provisioning issue).

1.4 Requirements

We begin with a high-level, idealized list of requirements, which we then selectively relax when we consider the example replication applications:

  1. Computational Power and Connectivity of Participating Devices: support everything from small/weakly connected devices to large/well connected data centers.
  2. Scale: replicate arbitrarily large volumes of data, to arbitrarily many sites.
  3. Distribution of Control: allow arbitrarily fine-grained control over what content is replicated, to which sites, and under which administrative policies.
  4. Availability: allow reading and writing any available copy of data in the presence of server and network outages, subject to security policy restrictions.
  5. Consistency: ensure globally consistent transactional (ACID) data updates.
  6. Efficiency / Performance: avoid duplicate transmissions of updates across network links; compress redundancy out of data transmissions; route updates intelligently with respect to network topology; disseminate updates at full network speed; guarantee that updates commit globally within a specified, bounded period of time.
  7. Security: support authenticated, encrypted, authorized access to digitally signed data.
  8. Implementation Complexity vs. Depth of Functionality: define easily implemented primitives that can support deep, complex application scenarios.
  9. Configuration and Operational Management: support adding and removing replicas at any time. Support automatic configuration (e.g., of replication topology and update scheduling) that adapts intelligently over time. Support a serial audit capability: the ability to enumerate an authoritative list of all update operations applied, singly-clock-timestamps, and principals that performed those updates.

Looking at the breadth of example target applications, we see that they vary in a number of dimensions, resulting, for example, in requirements to support wide ranges of computational power, connectivity, and number of replicas. Perhaps more important are the characteristics these applications have in common:

As an example of the update atomicity and serialization semantics, in a content distribution network it is important to update a group of HTML and image files atomically so that a network outage cannot leave a server in a state where the top-level ("assembly") page has been updated but the new image files it references have not yet arrived. As a second example, if multiple back office sites submit updates that result in subscriber account provisioning for an Internet Service Provider, it is important that an audit trail be generated showing the serialized sequence of updates that were applied. Such a trail can help track down data integrity problems introduced at various back office sites that cause subscriber-visible service problems.

Below we revisit the idealized requirement list, highlight those that are particularly difficult to satisfy in an internetworked environment, and, with an eye towards the example replication applications, present our chosen compromises:

  1. Computational Power and Connectivity of Participating Devices: It is unlikely that a single engineering design could work well for Internet ranges of connectivity and capacity, and also compare favorably to technology tuned for high capacity local interconnects (e.g., a fiber channel-based storage area network used to interconnect backend databases and application servers). We therefore focus ARS on two classes of Internet computing environments:
    • Well connected, high-capacity Internet server complexes: for example, a set of failover servers with diverse routed DS3 connectivity; and,
    • Mobile nodes: modest computational and storage capacity, frequently disconnected network.

  2. Scale: while we see no algorithmic reasons why ARS could not scale arbitrarily, we establish the following numeric goals:
    • Ability to replicate data sets up to 30 terabytes in size.
    • Ability to replicate individual pieces of data to hundreds of well-connected data centers, or to millions of Personal Digital Assistants (PDA's).

  3. Distribution of Control: replication systems designed to support arbitrarily distributed control typically route updates among replicas at random and use anti-entropy operations [29] to ensure that all updates circulate globally so that the replication service provides eventual consistency [30]. Anti-entropy can be a costly approach, because it requires pair-wise state comparisons to determine missed updates. Moreover, by allowing arbitrary flows of data it can be difficult for replica administrators to engineer the global service to meet needed convergence times. For these reasons, we impose the following constraints on distribution of control:
    • Updates flow according to a structured communication graph, rather than to arbitrary peers.
    • Ability to distribute control over replication across administrative boundaries defined by name tree partitions.
    • Ability to set differing numbers of replicas for different parts of the name space.
    • Support for both central source-of-record and update-anywhere models.
    • No support for master-less replication (ala LDUP [8]). While master-less replication is useful, it makes serial audit impossible and automated 100% coherence difficult.

  4. Availability: one of the classic design decisions to be made for a replication system is how to trade availability against consistency. In the Internet environment, we consider availability generally to be more important, and impose most of the constraint relaxation on consistency:
    • Ability to submit updates to any available copy.
    • Ability to read committed updates from any available copy. Optional support for reading the current set of submitted updates from any available copy.
    • Ability to propagate updates in the presence of partial node and network outages.

  5. Consistency:
    • Support eventual consistency among asynchronously circulated updates.
    • Ensure mutual consistency among documents at a datastore after each update to that datastore. We exclude from scope broader notions of referential integrity, for example guaranteeing that a bookmarked web page can later be successfully visited. This latter problem might be addressed by a combination of archival repositories and a versioning name space, which, again, are out of scope.
    • Support ability to determine when an update has committed at the submission server. Provide no support for determining when an update has circulated to all replicas.
    • Support ability to detect write-write conflicts. Ability to notify applications when updates arrive, as a partial solution to the read-write conflict problem.
    • Not all applications need to replicate all data served by an ARS server, so it must be possible for applications to cache documents using a simple Time-To-Live (TTL) mechanism to detect when it is necessary to request an updated copy. ARS servers are not required to participate in cache coherence management.

  6. Efficiency / Performance: consistency requirements generally impose performance bottlenecks that prevent updates from flowing at full network speed. Our goal here is that ARS have performance reasonably close to that achieved by the best currently deployed replication technology, while making intelligent use of network resources. Specifically:
    • Performance of individual point-to-point replication should not be significantly worse than proprietary schemes, excluding convergence time (which depends on connectivity factors outside the control of the replication system).
    • Efficient use of network resources: don't send the same update many times over a link.
    • Ability to route update transmissions in ways that make sense with respect to network topology and/or policy.
    • Support for time-bounded convergence is outside the scope of ARS, but it should be possible to configure server operations to bound convergence times, when done in conjunction with appropriate system provisioning/capacity/uptime engineering. As a possible future extension, a protocol could be defined to support some form of conditional service guarantees for deployments where resource reservations are possible.

  7. Security: security generally comes at a tradeoff against ease of configuration and use, and sometimes with performance impact. ARS should provide ranges of choices for authentication, confidentiality, access control, and data integrity, which allow individual sites to choose how they make the tradeoffs. The collective ARS documents should provide sufficient background to allow sites to make these choices in an educated fashion.
  8. Implementation Complexity vs. Depth of Functionality: define a minimal set of broadly useful required protocol elements, and an additional set of optional protocol elements that increase the functionality for more complex applications. Also, focus on replication, reusing existing solutions to related problems such as content filtering, load sharing, and configuration management.
  9. Configuration and Operational Management:
    • Automatic configuration is difficult to achieve for anything but the simplest of systems. Even something as outwardly simple as auto-selecting full- vs. half- duplex network switching has historically been the source of surprising numbers of operational problems. Data replication is a far more complex problem. ARS therefore defines no configuration automation. ARS may impose some restrictions about the service states when replicas may be added or removed.
    • One of ARS's key design tradeoffs is the decision that serial audit must be supported - as will be seen, this single decision forces other design choices that impact availability and consistency. Nonetheless we feel serial audit is critical because it can help when tracking down service failures and security breaches, and it can provide a legal record of update activity.



 TOC 

2. Terminology

We use the term "document" to refer to a well-formed XML document. We use the term "datastore" to refer to any hierarchically-named document repository. The datastore could be implemented as a file system, a database table, or some persistent storage system. ARS makes no assumptions with respect to granularity of access (e.g., attribute-level vs. file level) of the underlying datastore. However, the ARS service itself replicates at the granularity of an individual document. In other words, modifications to individual elements or attributes will result in updates to the complete document.

We use the term "name space" to refer to the universe of all names within a particular datastore naming system. For example, the Blocks name space [31] consists of all names for Blocks (such as "doc.rfc.2629"). We use the term "name tree" to refer to the hierarchical collection of names within a particular name space (e.g., the names within the Blocks name space).

We use the term "server" to mean a process that responds to requests issued by another process. "Client" means a request issuer. These terms apply to roles that may be transient. In particular, "ARS servers" act as servers when a client submits a request, and act as clients when forwarding requests to other ARS servers. Moreover, ARS servers can initiate communications when performing callback notification to a client that made an earlier request to the server. We use the term "ARS peers" when referring to either a client or server that communicates using the ARS protocol.

While the ARS client and server roles can be transient, we make a fixed distinction between processes that receive (and possibly forward) replication requests vs those that only submit requests. We refer to the former as "replication servers" and the latter as "clients of the replication service". We refer in aggregate to replication servers as "the replication service".

This role distinction is motivated by the Internet gateway model[32], which distinguishes between the information that must be known within the routing core of the Internet vs by hosts needing packet routing service. In the case of Internet gateways, the distinction reduces the number of machines that are affected by routing changes. In the case of ARS the distinction allows us to:

We discuss both of these issues in more depth later.

We use the term "primary" to refer to a designated server for a particular region (the structure of which will be explained later) of the name space that serializes updates for that region, and "non-primary" for all other servers for that name space region.

ARS servers for individual regions of the name space are arranged in a Directed Acyclic Graph (DAG [33]) rooted at the primary. We refer to ARS servers closer to the primary along the DAG than a given server as "upstream servers", and those further as "downstream servers".

We do not use the term "secondary" to denote servers that relay updates to primary servers, because there can be multiple hops between a primary and the set of all downstream servers in the DAG. Instead we speak of primary vs. non-primary servers, and of upstream vs. downstream servers. It is important that when discussing ARS one not use the term "secondary" because it does not adequately capture the subtleties of interactions along the propagation DAG.

We use the term "ARS service" to refer to the overall replicated data service supported by ARS.

We use the term "stable storage" to denote non-volatile memory (e.g., disk) implemented in a fashion where write operations survive system crashes/reboots [34].

We use the term "commit" to indicate a computation performed on a single datastore that applies a set of updates in a fashion that respects ACID semantics at that single datastore. There is no notion of multi-server commit in ARS.

We use the term "submission server" to indicate the server to which an update was originally submitted by a client of the ARS service. We use the term "submission path" to indicate the sequence of servers, starting at the submission server and terminating at the primary server, via which an update is submitted and propagated.

We use the term "principal" to refer to any named entity in the system, including users, groups of users, and services running on behalf of particular administrative regions of the replicated name space.

ARS specifies a required subset of protocol elements and two optional subsets (the details of which will be presented shortly). We use the term "the ARS protocol" to refer to any legitimate subset of all three sets of protocol elements.

Note that some of the terms used by ARS are different from their use in the Domain Name System [35], even though ARS also forms a hierarchical distributed datastore. Specifically, both systems have a "primary" server notion, but where the DNS has "secondary" servers, ARS has non-primary servers. The DNS replicates data from primary to secondary servers, but ARS replicates data along a DAG from primary to non-primary servers. Also, the DNS and ARS both define a notion of "zone" to support distributed administration, but the DNS defines more ways to partition zones than ARS does (specifically, the DNS defines a notion of "class" that is not present in the ARS).



 TOC 

3. Overview of the Approach

Many aspects of the ARS architecture were influenced by the desire to balance support for business-critical functionality against the need to operate in an internetworked environment. In this section we present an overview of the ARS approach, at the end of which we summarize the design decisions made to enhance scalability and operation across a variety of network environments.

ARS partitions each name space into disjoint regions called zones, to support distributed administration. Zones can also enhance ARS scalability by allowing a primary that receives too much update traffic to delegate part of its work to another server.

Each ARS server supports one or more zones. Individual zones may or may not be replicated.

For each zone updates are propagated along a DAG rooted at a distinguished primary server that serializes all updates to that zone. This DAG provides for redundancy and load distribution while distributing updates, and can be configured to take network characteristics (topology, link bandwidth, policy, etc.) into account.

There are two distinct phases of the update process:

To distinguish these phases, we talk about update submission as distinct from serialization/commit.

ARS servers can push updates down the DAG as they arrive, to reduce replication latency if needed. Servers can also pull updates, for example to "catch up" when first joining a replication DAG, or after downtime, partition repair, or mobile reconnection. For synchronization and implementation simplification reasons a push is actually just a suggestion from the upstream server that the downstream server should perform a pull request. The choice of whether to push or pull updates is a local configuration decision made at each server, though server administrators should coordinate these choices to support specific provisioning requirements.

When a replication server receives an update submission, it enters into what we call a client-server promise: after accepting the update, the server must follow that update through either to success or failure. This promise extends from the submission server through each ARS server up the DAG to the primary, so that in effect the client enters into a promise with the replication service (not just an individual server). Because of the client-server promise, an infrequently connected client of the replication service can thus disconnect from the network after an ARS server has accepted the update submission, and rely on the replication service to carry the update through to completion or failure without the client's needing to handle details of timing out and resubmitting the update. Clients can learn of commit results through an asynchronous notification mechanism.

ARS consists of three sub-protocols, only the first of which must be implemented by all ARS servers:

  1. the Commit-and-Propagate Protocol (ars-c), which allows a client to submit an update to an ARS server, performs zone-wide serialization, commits updates, and propagates committed updates down the DAG to other ARS servers that each individually commit and propagate the updates to their downstream servers. ars-c also provides asynchronous notification of commit-or-fail results to the submitting clients.
  2. the Submission-Propagation Protocol (ars-s), which allows an update that was submitted to a non-primary server to be propagated up the DAG to the primary server, where it is serialized with respect to all submission activity at the submission server (in addition to the serialization performed by ars-c). ars-s re-uses the protocol element defined by ars-c for asynchronous client notification to notify servers down the submission path of update commit-or-fail results. ars-s provides a store-and-forward update submission mechanism, so that updates can succeed even if all servers along a submission path are never simultaneously available. Implementations that do not support ars-c allow updates to be submitted only to the primary.
  3. the Encoding Negotiation Protocol (ars-e), which defines a protocol for negotiating a set of MIME-based [36] data representations that can be used when passing submitted and committed updates among ARS peers. These negotiated representations support a variety of transmission optimizations, including compression, delta encoding, and bulk transfer. Implementations that do not support ars-e use a single required encoding for submitted and committed updates.

This protocol decomposition allows relatively simple implementations for broadly useful functionality, and additional functionality for more demanding services. In particular, "update-anywhere" functionality (supported by ars-s) is responsible for a significant amount of the service complexity. Many services do not require this functionality and need not incur its implementation complexity and runtime overhead.

Similar to Bayou [37], ARS transmits operations rather than data when transmitting both submitted and committed updates, to (a) avoid ambiguities with respect to deleted documents and (b) make the amount of information transmitted proportional to update activity instead of overall database size. ARS also defines a collapsing update mechanism that eliminates committed update transmissions that are overwritten by subsequent updates. This approach modifies coherence semantics in a way that should not impact most applications, and enhances efficiency for infrequently synchronized datastores. It can also reduce space requirements for log-based server implementations.

ARS supports a server-by-server model of update atomicity and consistency. When a client submits an update, it contains a set of operations that are to be performed together, all-or-nothing. That group of updates commits first at the primary and then travels to all downstream servers. The group of updates are applied asynchronously at each server with ACID semantics local to that server, and updates are guaranteed eventually to circulate to all replicas. In this way, a client interacting with a single replica sees updates that are locally consistent (e.g., retaining referential integrity between cross-linked documents at that server), but the replicated ARS service avoids scaling problems that would result from global ACID semantics.

The ARS service tracks commit sequence numbers (CSNs) for each document, which may be used to detect write-write conflicts. The local server implementation or an application may also handle a simple case of read-write conflicts by notifying clients when an update to their input documents arrives. Both of these forms of conflict detection are optional, and may be implemented outside of the ARS service (e.g., with some form of trigger or "persistent fetch").

ARS provides no protocol support for determining when an update has circulated globally. An application could be written to traverse the ARS DAG for a given zone and check for update completion, but distributed administration of servers could prevent discovery or querying of some servers. Moreover, the cost of this computation could be prohibitive. Finally, the list of replicas may itself be in flux, making it difficult to define a meaningful notion of "all replicas".

Looking at the overall approach outlined above, a variety of design decisions contribute to ARS's applicability to internetworked environments. These decisions concern scalability and operation over a variety of network environments.

In terms of scalability, probably the most important design decision is ARS's server-by-server model of update atomicity. This approach avoids the deadlock and reconciliation problems of global ACID semantics, and induces an application architecture where clients interact with local replicas rather than "popping around the net". A second design decision in support of scalability is routing updates according to a configured graph structure. This approach avoids the overhead of pair-wise state comparisons performed by anti-entropy operations, which are needed when updates are allowed to flow between arbitrary peers. A third scalability-oriented design is the separation of conflict detection from the replication service. This approach offloads conflict detection overhead from replication servers. A fourth scalability feature is the use of transmission encodings, which can support a number of optimizations, most importantly bulk transfer (which can make it feasible to replicate very large datastores). Finally, scalability is supported by the ability to partition the name space for distributed administration.

In terms of operation across a variety of network environments, probably the most important design decision is routing updates according to a configured graph structure. This approach allows data to be routed according to bandwidth, policy, or other network characteristics. For example, the graph could be configured so that updates are sent once across an expensive trans-oceanic link, and fan out from there to other replicas. Graph-based update routing also supports a provisioned service model [38], where administrators plan flows based on measured update rates and server and network capacity. In this way, commercial providers can support service level agreements across a variety of network environments, which is difficult to do if updates were routed in an unconstrained fashion. A second design decision in support of varying network characteristics is store-and-forward update submissions. This approach allows updates to succeed when all servers along a submission path are not simultaneously available. On a related note, by allowing both pull and push models of committed update propagation, peers that are not continuously connected can request updates when they connect, yet continuously connected servers can learn of updates without polling. Finally, the store-and-forward update mechanism also supports deployment on bastion hosts that forward traffic across security boudaries, similar to how proxy servers are often used for allowing Internet HTTP access from within an intranet.

Mobile PDA's represent a broad class of network environment deserving particular mention. ARS's client-server promise and asynchronous notification allow a PDA to connect, submit an update, and disconnect, managing only the state needed to tie notification back to the original submission. Also, ARS's collapsing update mechanism can reduce committed update traffic for infrequently synchronized PDA's. ARS's update-anywhere capability supports geographically constrained network architectures, allowing a PDA to synchronize data with a nearby cell server or kiosk. Finally, ARS's distinction of replication-service vs client-of-replication-service minimizes implementation complexity for computationally modest PDA devices.



 TOC 

4. Name Space

ARS operations specify documents using Uniform Resource Identifiers (URI's [39]). For this purpose, the "scheme" part specifies the name space being used. For example, a URI for a document in the Blocks name space might be "blocks:doc.rfc.2629".

The following figure illustrates a number of aspects of ARS naming, drawn from the Blocks name space:


                   zzzzzzzzzzzzzzzzzzz
                   z     -------     z
                   z    /   \   \    z
                   z   /    ... ...  z
              zzzzzzzzzzzz           z
              z      /   z           z
              z     doc  zzzzzzzzzz  z
              z    / | \---\      z  z
              z  rfc     z edgar  z  z
              z / | \    z  / | \ z  z
              z          z        z  z
              zzzzzzzzzzzzzzzzzzzzzzzz
                   ||\\ \\
                   || \\ \\_________
                   ||  \\ \--------\\
                   ||    \\         \\
                   ||      \\        \----------\\
                   ||        \\                  \\
              +----------+    +----------+    +-------------+
              |          |    |          |    |             |
              |     doc  |    |     doc  |    |     doc     |
              |    / | \ |    |    / | \ |    |    / | \    |
              |  rfc     |    |  rfc     |    |  rfc        |
              | / | \    |    | / | \    |    | / | \       |
              |          |    |          |    |             |
              +----------+    +----------+    +-------------+
                   ||    \\  //    ||               ||
                   ||      \\      ||               ||
                   ||    //  \\    ||               ||
              +----------+    +----------+    +-------------+
              |          |    |          |    |             |
              |     doc  |    |     doc  |    |      doc    |
              |    / | \ |    |    / | \ |    |     / | \   |
              |  rfc     |    |  rfc     |    |   rfc       |
              | / | \    |    | / | \    |    |  / | \      |
              |          |    |          |    |             |
              +----------+    +----------+    +-------------+

In this figure boxes surrounded by z's indicate zone boundaries. Zones are named by the top-level node in each zone, with the root given the distinguished name ".". Zones are partitioned by placing cuts between nodes in the name tree, and designating a primary server for each zone. In the figure there is a cut at "blocks:doc" and another at "blocks:doc.edgar", making 3 separate zones ("blocks:.", "blocks:doc", and "blocks:doc.edgar").



 TOC 

5. Replication Topology

For each zone that an ARS server replicates, the server is configured to execute the ARS protocol with zero or more upstream servers and with zero or more downstream servers. These servers are arranged into a DAG rather than permitting updates to flow via unconstrained peer-to-peer interconnections (as done for example by Bayou [37]) both to eliminate the need for anti-entropy operations and to provide more structure for replica administrators who wish to engineer flows to meet needed convergence times.

Each upstream server may be either the zone primary or another ARS server that replicates the needed zone. In this way committed document updates are replicated along a DAG rooted at the primary for zone data they replicate. An example is illustrated in the figure in the Name Space section, with two of the "doc" zone replicas configured to replicate from either of two different upstream replicas.

In the simplest ARS service configuration (one supporting only ars-c at all servers) updates are submitted at the primary, and after committing at the primary the updates propagate down the DAG to the non-primary servers. In a service that also supports ars-s, updates can be submitted to non-primary servers and routed up to the primary along the DAG. Supporting non-primary update submissions can provide several advantages:

  1. it allows clients (such as wireless PDA's) to submit updates to topologically nearby servers, which may provide transmission cost advantages;
  2. it allows load bursts to be distributed away from the primary, with non-primaries time-shifting or otherwise shaping update traffic up to the primary;
  3. it supports a distributed access control model, where individual server administrators manage update privileges for their local users; and,
  4. it allows updates to be submitted when the primary is temporarily unreachable.

Scalability in the number of replicas is achieved by building a DAG of ARS servers with fan-out at any particular replica engineered based on measured update rates and server and network capacity.

Each ARS server knows all of its immediate upstream servers and all of its immediate downstream servers. Thus, becoming a downstream server involves configuring the downstream server and all upstream servers being used by the new downstream server, through administrative arrangement outside the ARS service. The means by which this configuration is done (configuration file, database, etc.) are a matter of local implementation. ARS does not define a protocol for joining/leaving the replication topology.

When zones divide/delegate, the ARS servers beneath them only get their content from the named zones, not the new delegated zones. The ARS server administrator must explicitly add the new zones to the configuration if they wish to replicate that newly delegated content.

The choice whether the replication topology construction is done via manual configuration (e.g., using a graph editor) vs. by some automatic process (e.g., one that looks at routing tables or that performs bandwidth computations) is outside the scope of the ARS architecture. Over time we believe that automating the construction of replication DAGs will be needed to scale the service to thousands of replicas [26]. Solutions to this problem are an area of future study (for example, using some form of Protocol Independent Multicast [40] to help discover topologically nearby nodes).



 TOC 

6. Update Ordering

ARS serializes submitted updates to a zone so that all servers commit updates in the same order. Additionally, for pairs of servers that support ars-s, ARS serializes submitted updates according to the order that they arrived at the submission server. Each submission server's ordering imposes a partial zone-wide ordering: submissions are ordered per server, but do not imply any ordering between servers. The partial orderings of all submission servers are respected by the zone-wide total ordering chosen by the primary.

This ordering model is motivated by the desire to provide behavior that meets users' intuitive expectations, for the case of applications that interact with a single replica. Clearly it is important to sequence all updates into a globally consistent order, but it is also important to maintain submission site ordering. If submission site ordering were not maintained, problems like the following example scenario could arise. Consider an online news site that publishes updates daily at midnight. Each day, the site performs an internal editorial review and then submits the update to the replication system. The staff member in charge of content replication submits the approved update, and while waiting for notification that the update has committed begins reading a local copy of the updated content. In browsing through the content the staff member notices an error that slipped past the editorial review, and submits an update to correct the error. At this point, there are two update submissions in the system, neither of which has committed at the primary yet. If the ordering model did not respect submission site ordering, it is possible that the correction could reach the primary before the original update (for example, because of a transient network outage that caused the first update to stall while the second update routed through a different path to the primary) and be serialized before the original update. As a result, the original (uncorrected) update content becomes the final content on the web site for the current 24 hour period.

ARS servers request updates from an upstream server by specifying the zone and the CSN of the last operation seen for that zone. The upstream server responds by sending all updates that have committed since the specified sequence number, possibly collapsing some updates per the collapsing update notion mentioned earlier.

Note that ANTACID addresses a relatively constrained update ordering problem because it assumes updates are always applied to entire documents. Supporting finer-grained updates (e.g., attribute-level updates commonly supported by relational databases) would require dealing with update dependencies, merges, etc.



 TOC 

7. Atomicity, Consistency, and Reliability

7.1 Eventual Consistency

ARS supports eventual consistency semantics [30]. Specifically, there are no guarantees that all replicas are ever in a consistent state, but if updates cease for long enough while updates spread, eventually all replicas will converge to a consistent state. There are no guarantees about how long it can take for updates to propagate, but the local implementation may schedule updates and engineer sufficient capacity to meet a given propagation schedule.

Note that some form of conditional service guarantees [41] might be provided by proprietary vendor extensions, to bound update propagation times for deployments where resource reservations are possible. Moreover, in the future a new ARS sub-protocol could be defined to support such conditional service guarantees.

7.2 ANTACID Semantics

As noted earlier, globally consistent ACID semantics do not scale and are not needed for a usefully large class of applications. Instead, ARS supports a server-by-server model of update atomicity and consistency, which we call Asynchronous Network Traversed ACID (ANTACID) semantics. Specifically, updates submitted together are guaranteed to be applied in the same order with ACID semantics at each replica as they traverse the DAG and are applied at each server. There are no guarantees about when updates reach individual replicas or about the update state of one replica with respect to another replica. This approach is motivated by the scaling problems of providing global ACID semantics and by the fact that many applications work by interacting with a single replica, for which all that is required is that updates to that replica appear to be ACID to applications interacting with that replica.

The term "ANTACID" is also intended to connote a reduction in unpleasant complications that would arise from the over-use of ACID semantics in the global environment.

Fox et al. [42] define BASE (Basically Available, Soft state, Eventual consistency) semantics to be, in essence, everything that does not support ACID semantics. We believe it is useful to characterize ANTACID semantics at two different levels of system granularity. At the granularity of the global replicated service, ANTACID semantics are BASE. At the granularity of individual datastores, ANTACID semantics are ACID. We believe this split is key to supporting business critical internetworked applications.

Mnesia [43] is an RDBMS that supports the usual set of transactional operations as well as a set of so-called "dirty operations" that trade away transactional integrity for increased performance. These operations are atomic for individual reads/writes, but do not provide any way to group logically related operations in a way that supports referential integrity. Although Mnesia provides a way to selectively relax traditional ACID semantics (as ARS does), the mechanism is targeted towards realtime applications, rather than Internet data replication.

To illustrate why we feel ANTACID semantics are appropriate for BCIA's, consider the three example applications introduced earlier.

For the web content distribution application, a limited form of referential integrity is required: all of the content updates for a site must be made "live" simultaneously to avoid broken link errors when users view the content with a web browser. It is not required that all of the updates are made "live" simultaneously at all replica servers, and in fact imposing that sort of global ACID semantics would heavily impact availability: a partition affecting a single replica would delay all other replicas from turning the needed updates "live" on schedule, for applications requiring scheduled delivery (e.g., an online publication that updates its content at regular intervals). By supporting ANTACID semantics, the impact of regional network outages can be limited to the affected region. Note that there may be additional requirements for how closely together updates should be made "live" across all of the regions, and that such requirements fall under provisioning issues outside the scope of the replication system (e.g., requiring a certain level of routing redundancy and network capacity to ensure the delivery schedule can be met).

There are similar referential integrity requirements for the real-time news provider application. The provisioning requirements are likely to be significantly more demanding than they are for the content distribution application, given the time-sensitive nature of stock quote data. Again, these provisioning requirements require systems and network engineering beyond the scope of the replication system.

The service provisioning application also requires referential integrity. For example, it would be confusing to users if their DHCP accounts were activated before their email accounts were activated. By replicating customer/client data in a single standard format to a regional server, all customer-facing services (DHCP, POP, etc.) could access the same server and achieve the desired property that "all services in the region are available as soon as one is". Attempting to populate all regional servers simultaneously in a large geographically dispersed enterprise or ISP would unnecessarily reduce availability and increase lock contention/deadlocks/rollbacks needed to achieve such global ACID semantics.

To our knowledge ANTACID semantics have not previously been defined per se. However, these semantics are motivated by trying to formalize an approach that has been applied informally/one-off in various applications over the years.

7.3 Locking and Deadlock Prevention

To support ANTACID semantics we define an update group as a grouping of update operations to be applied to documents within a single zone, all-or-nothing at each ARS replica as the update propagates from primary to all non-primary servers.

To implement ANTACID semantics, locks are acquired separately at each server at the time a set of updates are to be committed, starting at the primary. Each server acquires a zone-wide lock (to ensure that update groups are applied in a zone-wide serialized order), performs the updates, and then releases the zone-wide lock. Because all locks for an update group are acquired at one time at each server after the entire update content has arrived at that server, ARS prevents deadlock [44]. Note that deadlock is possible if systems other than ARS acquire locks in a hold-and-wait fashion, so ARS implementations should carefully consider the possible deadlock problems if they allow services other than ARS to update the datastore.

The ARS grouped update mechanism is the only way to relate updates to each other, with all-or-nothing failure semantics. In particular, unless updates to multiple documents are grouped together, failure of one update has no affect on subsequent updates.

ARS does not provide transactional consistency among read/written values. In fact, no read locking is supported by ARS.

7.4 Collapsing Update Semantics

For efficiency of operation among infrequently synchronized datastores and to reduce server space requirements, ARS defines a notion of collapsing update semantics. Rather than insisting that each replica apply all update operations, the semantics are relaxed as follows. One or more operations may be elided (the expression of which we will discuss shortly) from an update group if the datastore contents after successfully committing this subset of updates would be identical to the datastore contents that would result if the complete set of updates were performed. Thus, for example, a document creation followed by three updates to that document may be "replayed" at a downstream server as a single write with that document's final value.

The term "elided" above means that the given update operations and content are replaced by null operations accompanied by the CSN for each operation. This approach is used rather than transmitting nothing for elided operations so that the downstream server receives a (sometimes null) operation for each CSN. This provides a degree of fault tolerance, since it is a simple matter to detect missing operations from an up-counting set of sequence numbers.

Update collapsing may be performed over operations that originally spanned multiple update groups, as long as the collapsed updates are transmitted within a single update group. As an example, if ten updates were applied in separate update groups to a particular document over the course of a day and a PDA requests all committed updates at the end of that day, the responding server could send a single update group that collapses out all but the final update to that document.

Upstream ARS servers may collapse the set of update operations sent in response to a request for committed update content. Downstream ARS servers must be prepared to handle collapsed updates.

7.5 Submitted Update Visibility

While updates do not become part of the "official" state of the replicated ARS service until they have been serialized and committed at the primary, there are times when it can be useful for applications to be able to see updates immediately after they have been submitted. For example, in a system that manages passwords for users it would be useful for users to be able to login to the system after changing their password at a local server rather than waiting for the update to commit at the primary. This issue gets to the heart of some key tradeoffs in the design of a replication system, so before presenting the ARS approach we briefly review two other approaches we considered.

Bayou provides a mechanism that allows applications to see a view (in RDBMS parlance; essentially, a stored query) showing updates that have been submitted but not yet committed, allowing tentative updates to circulate and possibly later be undone if a conflict is detected [45]. We chose not to use this approach because it can lead to a large volume of global undo/re-ordering activity in the case of widely replicated data.

LDUP defines schema and protocol extensions to LDAPv3 [8] for replicating directory content [46]. LDUP allows updates to be visible immediately after they have been accepted by a server by having each server act as an independent master, with peer-to-peer conflict resolution procedures performed between servers from time to time. This approach derives from a fundamental requirement of LDUP, namely, not depending on any single primary server for any given update. There are good reasons for this requirement, including the fact that updates are never stalled by an unreachable primary server. However, we chose to accommodate the limitation of requiring a primary for reasons that may not be appropriate in the LDAP arena. Specifically, without a designated primary server we would lose the serial audit capability, which is required by ARS.

Because of the above design considerations, ARS uses the following approach. As a local implementation matter the submission server may allow applications to view the state of locally submitted updates. Submitted updates must not be propagated to other ARS servers in response to pull requests for committed updates. Applications that view submitted but not-yet-committed state must be prepared to deal with the case where the update is rolled back after failing to commit at the primary. For example, a password management application might allow local users to login with the new password after a local password change, but revert to the old password if the primary later rejects the update.

If an ARS implementation allows applications to view the state of locally submitted updates, it should provide two views from which applications can choose: one with only the committed updates, and one with the committed as well as locally submitted updates.

7.6 Caching Partial Repository Contents

Developers of some applications (such as browsers) may wish to cache copies of individual documents rather than undertaking the more involved implementation and administration requirements of replicating an entire zone. ARS servers themselves do not participate in any caching protocol; they manage copies of document data using only the ARS protocol. Caching, if it is performed at all, is performed by clients of the ARS service. In effect, the ARS protocol is used for maintaining a given level of data coherence corresponding to the local provisioning requirements of the replication server administrator(s), and caching may be used by clients to hold on to replicated data once it leaves the replication service "cloud" according to coherence requirements defined outside of the replication service.

While ARS servers do not participate in a caching protocol, ARS specifies one aspect of how caching must be performed by clients that wish to cache documents. Specifically, each document may contain a 32 bit Time-To-Live (TTL) field that specifies for how many seconds a cached document may be held by a client before it expires. To avoid clock synchronization requirements, this TTL must be represented as a "delta" from current time rather than as an absolute clock time at which expiry is to occur. Processes must decrement the TTL field by an amount equal to how long they have cached the document before passing the document to other processes.

7.7 Asynchronous Submission and the Client-Server Promise

When an ARS server receives an update submission from a client, it enters into a promise to follow that update through either to success or failure. An infrequently connected client can thus disconnect from the network after an ARS server has successfully received the update submission, and rely on the server to perform all needed store-and-forward delivery reatempts until the update succeeds or fails.

For pairs of ARS servers that support ars-s, each server that accepts an update submission along the submission path implicitly enters into the promise accepted at the submission server. As a consequence, servers must not be shut down permanently while they are processing update submissions. Before permanently decommissioning a server it must be operated in a mode that refuses new requests while completing old ones.

To handle the case of a server not shut down cleanly (which can happen for a variety of reasons in operational settings, even though it violates the above requirement), the administrator of the immediately downstream server along the submission path can manually run a tool to fail submissions that have been routed to the shutdown server.



 TOC 

8. Conflict Detection

Conflict detection is outside the scope of ARS, because some applications do not need it and would prefer to avoid the additional runtime overhead and implementation complexity. However, ARS provides "hooks" to allow the local server implementation or an application to perform conflict detection as follows.

8.1 Write-Write Conflicts

A server or application running at the ARS primary may detect write-write conflicts by checking whether the CSN value associated with the document submitted for update matches the CSN value associated with the primary's copy of that document. If it does not match a write-write conflict has occurred, in which case the primary may abort the update and respond to the submitter with an appropriate failure response message. It is up to the submitting client to resolve the conflict, for example by re-reading the updated document and re-running the computation that updates it.

Non-primary servers must not check for write-write conflicts in this fashion, else downstream servers will erroneously conclude that a conflict has occurred when they attempt to apply committed updates propagated down from upstream servers.

8.2 Read-Write Conflicts

A local implementation may handle a basic case of read-write conflicts by allowing downstream clients to be notified when an update to their input documents arrives. This approach allows updates to trickle through downstream servers (by re-running computations when a read-write conflict is detected so that eventually correct versions of the computed documents will be replicated), but does not support global transactional consistency among read/written values. For example, consider this scenario: A server reads from replicated document X and uses it to compute a value for document Y. Concurrently, someone updates X at another replica server after the above server has read X but before Y has been generated. Thus, the generated value of Y does not reflect the most recent update to X. With the notification mechanism, the server will find out when the new value of X arrives, run again, and regenerate Y based on the latest value of X.

Note that ARS does not provide an update arrival notification primitive itself. Rather, the use of the above-described arrival notification is outside the scope of ARS. A local implementation may use database triggers or other mechanisms to provide notification.

We chose this relatively simple approach to read-write conflict detection because we feel it is a reasonable compromise of scalability and implementation simplicity. For example, we have scaling concerns about precedence graph-based approaches [47], and we believe that requiring application-specific conflict detection [48] would add too much complexity to the protocol.



 TOC 

9. Transmission Optimization: Encodings

ARS defines a single data update representation that can be used for transmitting all submitted and committed updates between ARS peers. However, there are a variety of cases where a different representation can be significantly more efficient. The ARS Encoding Negotiation Protocol (ars-e) provides a mechanism that can be used between pairs of communicating ARS peers for selecting a common set of MIME-based [36] data representations that support various types of transmission optimizations. Its existence was inspired in part by features in HTTP/1.1 [49] as well as by the author's experience with the proprietary replication implementations in a variety of commercial technologies and services.

A basic type of optimization we wish to support with ars-e is transmission of compressed data, for example using MIME type "gzip-compressed". ars-e can also be used to support delta encoded data [50], which can be useful for textual repositories where small changes are made to individual documents.

Given our data scale and performance requirements, probably the most important optimization is bulk transmission of raw data files between a pair of ARS servers that are both implemented atop the same proprietary data storage system (e.g., a particular version of a particular vendor's relational database). For example, the Oracle 8i exportable tablespace feature [25] provides a means of exposing the underlying file system representation of database tables, which can be replicated across a network orders of magnitude faster than would be possible by transmitting the corresponding individual record-level SQL data manipulation operations. A bulk transfer encoding might also be used to take advantage of proprietary backend hardware, for example using EMC's TimeFinder [51] product to transfer huge datastores across a storage area network.

This ability to perform bulk synchronization of entire zones is how we address the requirement, "Performance of individual point-to-point replication should not be significantly worse than proprietary schemes, excluding convergence time." Specifically, in the author's experience the performance difference between different incremental update protocols is minor compared to the difference between a system that supports bulk transmission and one that does not.

There is one case where the use of an encoding can allow strictly greater functionality (as opposed to better performance) to the peers using that encoding. As mentioned earlier, each ARS server maintains a CSN-indexed list of operations applied to each zone. When a downstream server requests all updates since a given CSN, the upstream server consults the operation list to generate the needed response. As a local implementation matter, servers may choose to represent this list in a log, and they may choose to truncate a prefix of this log periodically to conserve space. If they do, it is possible that a downstream server could request updates that precede the upstream server's log truncation point. In this case, the upstream server sends an error to the downstream server, from which the downstream server may recover by requesting that the complete zone content be transmitted (similar to the approach taken by Bayou [37]). This content must be transmitted with an encoding designed for full zone transfers. For example, an encoding could be defined that transmits the complete zone as a sequence of one create operation for each existing document. Or, as noted above, the entire zone could be transferred as a compressed archive file if both servers are built on the same type of data storage system. In either case, the downstream server would need to delete its existing zone content before inserting the transmitted content.

We place one key restriction on the use of ars-e: an implementation must not use any proprietary formats in its encodings except for the special case of transmitting an entire zone as a bulk transmission. In all other cases the encodings must be published as open specifications. Implementations that do not meet this requirement are deemed not to be in compliance with the ARS specification. This restriction is intended to allow an important efficiency enhancement for bulk data transfer, yet to ensure that there is always an interoperable way to transfer data between ARS peers. Moreover, we seek to prevent vendors from claiming "trivial" ARS compliance by misusing the bulk encoding "escape" to drop into their proprietary system for all replication, and in so doing claim a more efficient (but non-interoperable) "ARS implementation" than their competitors.



 TOC 

10. Security

ARS is defined in terms of a BEEP [28] profile. A number of ARS's security characteristics derive from BEEP's built-in security models.

10.1 Authentication

As a BEEP profile, ARS uses the Simple Authentication and Security Layer (SASL [52]) to negotiate the authentication mechanism used between ARS peers.

10.2 Confidentiality

Encryption may be used during transmission, per usual practice with BEEP profiles.

10.3 Integrity

Documents may contain primary-generated digital signatures to enforce integrity of contents and prevent man-in-the middle or sequence number collision attacks. As is the case with caching, digital signing (if implemented) is implemented outside of the ARS service, by clients of the ARS service. ARS does not specify how documents are to be signed. The XML DSIG representation [53] is a possible approach.

10.4 Access Control

Access control is not yet worked for ARS. We intend to define a two-level access control architecture for ARS, wherein a Document Type Definition (DTD) specifies an Access Control List structure used between pairs of ARS servers, and user-level access control is left as a local implementation matter. (Note that users interact with ARS through clients which authenticate for the given users' identity.) The ACL DTD, requirements governing the distribution and uniform application of ACLs, and other protocol and architecture issues, are all areas for future work.



 TOC 

11. Datastore Interoperation

Other services may need to share access to a datastore being replicated by ARS. For example, ARS provides no search support, instead assuming that such functionality could be built into a system that shares access to the replicated datastore.

There are three requirements for sharing access to a datastore replicated by ARS:

  1. The underlying datastore must consist entirely of hierarchically-named, well-formed XML documents.
  2. All datastore read requests must perform zone-wide locking, as described earlier.
  3. All datastore updates must "funnel" through ARS. For example, a non-ARS service may accept update requests for replicated documents, forward these requests to ARS, and wait for ARS to indicate the update has completed before responding to the initiating client.



 TOC 

12. Summary: ARS's Realm of Applicability

In this section we briefly summarize the strengths and limitations of ARS, to help data service architects determine whether ARS is appropriate for their needs.

12.1 Strengths

ARS replicates XML content in a manner that we believe is appropriate for a broad class of business-critical internetworked applications. Specifically, ANTACID semantics guarantee that updates submitted together are applied in the same order and with ACID semantics at each replica, but that individual replicas apply updates asynchronously from one another. This approach is motivated by the scaling problems of providing global ACID semantics and by the fact that many applications work by interacting with a single replica, for which all that is required is that updates to that replica appear to be ACID to applications interacting with that replica.

ARS addresses scalability in data volume and replica count through a combination of ANTACID semantics, update routing according to a configured graph structure, the separation of conflict detection from the replication service, transmission encodings, and the ability to partition the name space for distributed administration.

ARS addresses operation across a variety of network environments through a combination of update routing according to a configured graph structure, store-and-forward update submissions, and support for both pull and push models for committed update propagation.

ARS addresses mobile/computationally modest client operation through a combination of the client-server promise, asynchronous notification, collapsing updates, update-anywhere capability, and the distinction of replication-service vs client-of-replication-service.

12.2 Limitations

Replication is a broad problem, for which no single standard could work well for "all" applications. ARS has a number of limitations, which we discuss in two categories below: architectural limitations, and limitations concerned with the current state of completeness of the specification. The former are probably fundamental to the design tradeoffs we have chosen (unless the specification evolves in ways we have not predicted), while the latter are likely to change over time.

In some cases the limitations discussed below can be handled by suitable modifications to how data is organized in a given application. In other cases ARS may simply be inappropriate for the application.

12.2.1 Architectural Limitations

Probably ARS's biggest limitation is that it does not support global ACID updates. Applications requiring global ACID semantics should use a different replication solution, such as one of the commercially available RDBMS's.

On a related note, ARS's ANTACID semantics provide no support for a notion of referential integrity that spans beyond a single update. For example, ARS does not provide any means of ensuring that a web page that was once visible at a website might be bookmarked and later visited (which can break if sites regularly take old content off their servers).

Also related to ANTACID semantics, ARS's read-write conflict detection mechanism is limited to detecting when an update arrives for a document or document subtree. This approach is oriented towards applications where updates trigger computations that can re-generate derived data, in support of computed content pipelines. ARS specifically does not support any notion of transactional integrity in the face of read-write conflicts. In other words, an ARS server cannot guarantee that it will detect that a read-write conflict has occurred at the time an update is being committed.

Another limitation is ARS's reliance on a primary server, the unavailability of which will stall update submissions from committing. Because of this, an ARS primary will typically need to reside on a highly available, well connected machine. Mobile devices using ARS can only commit updates when they connect to a network capable of routing to the primary. For example, a pair of mobile users on an airplane could not use ARS to commit updates and synchronize between each other while disconnected from the rest of the Internet unless one of the mobile users' machines is itself the primary server. Updates can be submitted while disconnected, and ARS provides a way for clients to view the local submitted update state, but they cannot commit and propagate the updates to other ARS peers until the updates have committed at the primary server.

More generally, processes are not free to propagate updates to arbitrary other ARS peers. All updates must flow according to a configured graph topology. This graph topology provides for redundant paths through the network, but for example an ARS server cannot simply locate a nearby server (e.g., by LAN broadcast) and exchange update state.

On a related note, the ARS replication topology must conform to a DAG structure. This restriction can in some cases mean that achieving redundant paths requires more servers than would be needed if the graph weren't required to be acyclic. For example, consider the following cyclic replication topology:


                 s1
                 | \
                 |  \ 
                s2---s3

In this figure updates can propagate to s2 if s3 is down and vice versa. With ARS's DAG-based topology an additional server would be required to achieve the same level of redundancy:


               s1------>s4
               | \      /
               |  \    /
               |    \ /
               |    / \
               |   /   \
              \|/\|/   \|/
                s2----->s3

Related to the DAG limitations, if a server accepts a submission and all of its upstream servers are unreachable there is currently no way to "back the submission out" from the accepting server and send it to another server whose upstream servers are currently reachable. Because of the client-server promise, once a server accepts the submission, that part of the submission path is fixed. As a consequence it's important that each server be provisioned with upstream servers that together provide acceptable availability.

Related to ARS's primary server design is the fact that replicated collections are defined by the accumulation of changes visible at the single primary server. Thus, for example, ARS would not be a very natural fit for applications that need to allow a user to select pieces of data to download as a one-time operation from multiple sources -- e.g., as done by peer-to-peer file sharing applications like Napster and Gnutella.

Another limitation of ARS is that it only replicates XML content. While arbitrary data could be encoded into XML, doing so would not be an efficient means of transmitting large volumes of binary data, such as audio or video content.

Another limitation is that ARS does not support attribute-level replication. Updates are performed at the granularity of an entire XML document (although ARS's encoding mechanism may cause significantly less data than the full document to be transmitted during replication). Thus, for example, an application that modifies a few cells in a large table would not do well to represent the table as one large document with elements or attributes representing the individual cells. Instead, each individual document should be "reasonably small" (the exact definition of which depends on the speed of network links and servers, required update commit rates, etc.)

Another granularity limitation of ARS concerns locking: because ARS serializes updates at the granularity of a zone, zone-wide locking is needed while applying committed updates. In turn this rough-grained locking can present performance bottlenecks for some application workloads compared with, for example, row-level locking in RDBMS's.

Another limitation is that ARS provides no protocol support for determining when an update has circulated globally. An application could be written to traverse the ARS DAG for a given zone and check for update completion, but distributed administration of servers could prevent discovery or querying of some servers. Moreover, the cost of this computation could be prohibitive. Finally, the list of replicas may itself be in flux, making it difficult to define a meaningful notion of "all replicas".

12.2.2 Completeness Limitations

The digital signing and access control models are not worked out. Among other things, an Access Control List format DTD needs to be defined, along with protocol/architectural requirements on the distribution and uniform application of access control in ARS.

A MIME subtype for registering new ARS encodings needs to be formally defined and registered with the Internet Assigned Numbers Authority (IANA).

Procedures need to be worked out for how to split and delegate zones.

SNMP MIBs have not been defined for ARS service monitoring and management.

There is no support for automated layout of the replication topology, which may in turn limit the number of servers to which it is operationally manageable to deploy replication service.

ARS does not support guarantees about how long it can take for updates to propagate, and instead considers this to be a local engineering/capacity planning matter. However, in the future a new ARS sub-protocol might be also defined to support some form of conditional service guarantees for deployments where resource reservations are possible.



 TOC 

13. Acknowledgements

The author would like to thank the following people for their many helpful suggestions about ARS: David Clark, Dave Crocker, Marco Gazzetta, Carl Malamud, Paul Mockapetris, Darren New, Calton Pu, and Marshall Rose.



 TOC 

References

[1] World Wide Web Consortium, "Extensible Markup Language (XML) 1.0", W3C XML, February 1998.
[2] Schwartz, M., "The ANTACID Replication Service: Protocol and Algorithms", draft-schwartz-antacid-protocol-00 (work in progress), October 2001.
[3] Gray, I. and A. Reuter, "Transaction Processing: Concepts and Techniques", Morgan-Kaufmann Publishers, Inc. , 1993.
[4] Rigney, C., Rubens, A., Simpson, W. and S. Willens, "Remote Authentication Dial In User Service (RADIUS)", RFC 2058, January 1997.
[5] Droms, R., "Dynamic Host Configuration Protocol", RFC 1531, October 1993.
[6] Reynolds, J., "Post Office Protocol", RFC 918, Oct 1984.
[7] Postel, J., "Simple Mail Transfer Protocol", RFC 788, Nov 1981.
[8] Yeong, W., Howes, T. and S. Kille, "X.500 Lightweight Directory Access Protocol", RFC 1487, July 1993.
[9] Rabinovich, M., "Issues in Web Content Replication", Data Engineering Bulletin Vol. 21 No. 4, December 1998.
[10] Cooper, I., Melve, I. and G. Tomlinson, "Internet Web Replication and Caching Taxonomy", RFC 3040, January 2001.
[11] Wessels, D., "Squid Web Proxy Cache", August 2000.
[12] Kantor, B. and P. Lapsley, "Network News Transfer Protocol", RFC 977, Feb 1986.
[13] Wessels, D. and K. Claffy, "Internet Cache Protocol (ICP), version 2", RFC 2186, September 1997.
[14] Schwartz, M., "Formal Service Agreements for Scaling Internet Caching", NLANR Web Cache Workshop , June 1997.
[15] Microsoft Corporation, "Cache Array Routing Protocol and Microsoft Proxy Server 2.0", August 1998.
[16] Hoff, A., Giannandrea, J., Hapner, M., Carter, S. and M. Medin, "The HTTP Distribution and Replication Protocol", August 1997.
[17] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. Jensen, "HTTP Extensions for Distributed Authoring -- WEBDAV", RFC 2518, February 1999.
[18] Skardal, H., Bunnell, J., Chellam, S., Gardner, T., Hendrie, C., Komatsu, K., Linn, G., Manley, D., Waidhofer, G. and J. Ward, "NDMP Version 4 Protocol", draft-skardal-ndmpv4-02 (work in progress), April 2001.
[19] Leach, P. and D. Naik, "A Common Internet File System (CIFS/1.0) Protocol", December 1997.
[20] Microsystems, Sun., "NFS: Network File System Protocol specification", RFC 1094, Mar 1989.
[21] Microsoft Corporation, "ODBC Section of the Microsoft Universal Data Access Web Site", March 1999.
[22] International Organization for Standardization, "Information Technology - Database Languages - SQL", ISO/IEC 9075:1992, 1992.
[23] Satyanarayanan, M., Kistler, J., Kumar, P., Okasaki, M., Siegel, E. and D. Steere, "Coda: A Highly Available File System for a Distributed Workstation Environment", IEEE Transactions on Computers Vol. 39, No. 4, April 1990.
[24] Gray, J., Helland, P., O'Neil, P. and D. Shasha, "The Dangers of Replication and a Solution", Proceedings of SIGMOD International Conference on Management of Data , 1996.
[25] Oracle Corporation, "Oracle8i Replication Release 8.1.5 A67791-01", February 1999.
[26] Danzig, P., DeLucia, D. and K. Obraczka, "Massively Replicating Services in Wide-Area Internetworks", University of Southern California Technical Report 94-595, 1994.
[27] Atkinson, R. and J. Munter, "UDDI Version 2.0 Replication Specification", June 2001.
[28] Rose, M., "The Blocks Extensible Exchange Protocol Core", RFC 3080, March 2001.
[29] Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D. and D. Terry, "Epidemic Algorithms for Replicated Database Maintenance", Proceedings of the Sixth Conference on Principles of Distributed Computing , August 1987.
[30] Schroeder, M., Birrell, A. and R. Needham, "Experience with Grapevine: The Growth of a Distributed System", ACM Transactions on Computer Systems Vol. 2 No. 1, February 1984.
[31] Rose, M., Gazzetta, M. and M. Schwartz, "The Blocks Datastore Model", Draft Technical Memo, January 2001.
[32] Braden, R. and J. Postel, "Requirements for Internet gateways", RFC 1009, Jun 1987.
[33] Even, S., "Graph Algorithms", Computer Science Press , 1979.
[34] Lampson, B. and H. Sturgis, "Crash Recovery in a Distributed Data Storage System", Xerox Palo Alto Research Center Internal Report , April 1979.
[35] Mockapetris, P., "Domain names - concepts and facilities", RFC 1034, STD 13, Nov 1987.
[36] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", RFC 1521, September 1993.
[37] Petersen, K., Spreitzer, M., Terry, D., Theimer, M. and A. Demers, "Flexible Update Propagation for Weakly Consistent Replication", Proceedings of the 16th ACM Symposium on Operating System Principles , October 1997.
[38] Claffy, K., Braun, H. and M. Schwartz, "A Distributed Testbed for National Information Provisioning", A proposal submitted to the National Science Foundation , April 1995.
[39] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998.
[40] Fenner, B., Handley, M., Holbrook, H. and I. Kouvelas, "Protocol Independent Multicast - Sparse Mode (PIM-SM): Protocol Specification (Revised)", draft-ietf-pim-sm-v2-new-03 (work in progress), July 2001.
[41] Steere, D., Baptista, A., McNamee, D., Pu, C. and J. Walpole, "Research Challenges in Environmental Observation and Forecasting Systems", Proceedings of The Sixth Annual International Conference on Mobile Computing and Networking , August 2000.
[42] Fox, A., Gribble, S., Chawathe, Y., Brewer, E. and P. Gauthier, "Cluster-Based Scalable Network Services", Proceedings of the 16th Symposium on Operating Systems Principles , October 1997.
[43] Ericsson Utvecklings AB, "Mnesia 3.9.2", 2000.
[44] Hansen, P., "Operating System Principles", Prentice Hall , 1973.
[45] Terry, D., Petersen, K., Spreitzer, M. and M. Theimer, "The Case for Non-transparent Replication: Examples from Bayou", IEEE Data Engineering , December 1998.
[46] Merrells, J., Reed, E. and U. Srinivasan, "LDAP Replication Architecture", draft-ietf-ldup-model-06 (work in progress), March 2000.
[47] Davidson, S., "Optimism and Consistency in Partitioned Distributed Database Systems", ACM Transactions on Database Systems Vol. 9 No. 3, September 1984.
[48] Terry, D., Theimer, M., Petersen, K., Demers, A., Spreitzer, M. and C. Hauser, "Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System", Proceedings of the 15th ACM Symposium on Operating System Principles , December 1995.
[49] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[50] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A., Goland, Y., van Hoff, A. and D. Hellerstein, "Delta Encoding in HTTP", draft-mogul-http-delta-10 (work in progress), October 2001.
[51] Kodumudi, S. and P. Manning, "Split Mirror Backup and Database Replication Using EMC TimeFinder", February 1999.
[52] Myers, J., "Simple Authentication and Security Layer (SASL)", RFC 2222, October 1997.
[53] Eastlake, D., Reagle, J. and D. Solo, "XML-Signature Syntax and Processing", RFC 3075, March 2001.


 TOC 

Author's Address

  Michael F. Schwartz
  Code On The Road, LLC
EMail:  schwartz@CodeOnTheRoad.com
URI:  http://www.CodeOnTheRoad.com


 TOC 

Appendix A. Extended List of Potential ARS Applications

To help illustrate ARS's potential applicability, the following table expands on the three example replication applications provided in the Example Applications section. We present these examples without further discussion, and instead refer readers wanting more depth to the aforementioned section.


|--------------------------------------------------------------------|
|      Application     |                 Description                 |
|--------------------------------------------------------------------|
| content distribution | central content provider publishes content  |
| network              | updates that propagate to servers at many   |
|                      | network points of presence                  |
|--------------------------------------------------------------------|
| real-time news       | many small docs flowing from central site   |
| provider             | out to sites with which PDA's synchronize   |
|--------------------------------------------------------------------|
| service              | integr. of back office subscriber db's with |
| provisioning         | RADIUS, DHCP, POP, SMTP, LDAP, WWW, other   |
|                      | subscriber- or client-facing service DBs    |
|--------------------------------------------------------------------|
| computed content     | data owner injects data to be read by       |
| pipeline             | competing downstream services that compute  |
|                      | derived content                             |
|--------------------------------------------------------------------|
| nearby cell data     | PDAs submit updates and receive committed   |
| synchronization      | updates from nearest cell server or kiosk   |
|--------------------------------------------------------------------|
| network management   | multiple agents observing each net element  |
| data "plumbing"      | make conflicting updates.  Regional cor-    |
|                      | relation/aggregation reduces NOC traffic    |
|                      | (cf. NetExpert [55])                        |
|--------------------------------------------------------------------|
| private annotations  | publicly updateable replication space for   |
| to centrally owned   | one data set, more restrictive space for    |
| part of name space   | others. Multiple annotation authors/owners  |
|--------------------------------------------------------------------|
| integrated workflow  | open standard replication for workflow,     |
|                      | version control, web content publishing     |
|--------------------------------------------------------------------|
| XML-driven dynamic   | distributed dynamic content servers cont-   |
| content              | rolled by XML-encoded business logic+data   |
|--------------------------------------------------------------------|
| multi-PDA            | users perform updates via more than two     |
| synchronization      | devices (PDA, cell phone, desktop, etc.)    |
|--------------------------------------------------------------------|
| intranet replication | mobile/field office/partner data exchange   |
|--------------------------------------------------------------------|
| structured netnews   | advantages over NNTP: (a) eventually        |
|                      | consistent, ordered updates, (b) fielded    |
|                      | search, (c) authenticated                   |
|--------------------------------------------------------------------|



 TOC 

Full Copyright Statement

Acknowledgement