TOC 
Network Working GroupM. Schwartz
Internet-DraftCode On The Road, LLC
Expires: April 7, 2002October 7, 2001

The ANTACID Replication Service: Protocol and Algorithms
draft-schwartz-antacid-protocol-00

Status of this Memo

This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026 except that the right to produce derivative works is not granted. (If this document becomes part of an IETF working group activity, then it will be brought into full compliance with Section 10 of RFC 2026.)

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on April 7, 2002.

Copyright Notice

Copyright (C) The Internet Society (2001). All Rights Reserved.

Abstract

This memo specifies the protocol and algorithms of the ANTACID Replication Service, designed to replicate hierarchically named repositories of XML documents for business-critical, internetworked applications.

ASCII and HTML versions of this document are available at http://www.codeontheroad.com/papers/draft-schwartz-antacid-protocol.txt and http://www.codeontheroad.com/papers/draft-schwartz-antacid-protocol.html, respectively.



 TOC 

Table of Contents




 TOC 

1. Introduction

This document specifies the protocol and algorithms used to implement the ANTACID Replication Service (ARS). Readers are referred to [1] for a motivation of the problem addressed, the replication architecture, and terminology used in the current document. The current document assumes the reader has already read that document, and that the reader is familiar with XML [2]. Moreover, since the ARS protocol is defined in terms of a BEEP [3] profile, readers are referred to that document for background.

We begin in by walking through example ARS interactions, to give the reader a concrete flavor for how the protocol works. We then present the ARS syntax and semantics, and then provide algorithms and implementation details.



 TOC 

2. Walk-Through of Example ARS Interactions

ARS updates follow a simple pattern, with Submit Sequence Numbers (SSN's) assigned by each submission server flowing up the DAG and Commit Sequence Numbers (CSN's) assigned by the primary flowing back down after a submission has committed at the primary. As an example, consider the DAG illustrated below:


                             svr3
                            |    | 
                           \|/  \|/
                          svr2<-svr4
                            |    | 
                           \|/  \|/
                          svr1  svr5

In this diagram, arc directions indicate the "is a downstream server from" relationship. Thus, svr3 is the zone primary, svr1, svr2, svr4, and svr5 are non-primaries, svr2 and svr4 are downstream from svr3, svr2 is downstream from svr4, svr1 is downstream from svr2, and server 5 is downstream from server 4.

Given this DAG, an update submitted at svr1 might be assigned SSN 1 by svr1, and then be propagated by svr1 to svr2, and then from svr2 to svr4, and then from svr4 to svr3. svr3 serializes the update submission, commits the update, and assigns it a CSN of, say, 2. At this point the committed update propagates back down the DAG, for example first to svr4 and svr2 (from svr3), and then in parallel from svr4 to svr5 and from svr2 to svr1. As this example illustrates, the path by which committed updates propagate down the DAG may differ from the path by which submissions are propagated up the DAG.

This DAG represents a set of ARS servers that implement ars-c as well as ars-s, which supports updates being submitted to non-primary servers and propagated up to the primary. In an ARS service that implements only ars-c all updates must be submitted to the primary. For that case, the only propagation that occurs is when committed updates propagate from the primary to all downstream servers.

Given this basic understanding of how submitted and committed updates propagate across the DAG, we now walk through examples of the protocol content exchanged between a set of ARS peers. We start with a server that implements the minimal required ARS protocol elements (ars-c). We then show the additional functionality of ars-s and ars-e, each in turn.

The examples in this section are based on a pair of servers configured as follows:


                   /->svr1 (primary)
                  /    |
            client     |
                  \   \|/
                   \->svr2 (non-primary)

In some of the examples the client makes requests of the primary. In other examples the client makes requests of the non-primary. Here the DAG between the servers is just a single edge, but in general there could be many servers upstream and downstream from each server (except for the primary, which never has upstream servers unless it is also a non-primary for other zones).

In the examples we list the communication endpoints flush left on the page, with the transmitted content indented, like so:


client->svr1:
    <ARSRequest ReqNum='1'>
        ...
    </ARSRequest>

The "client->svr1:" above is only for labeling the flow as going from the client to svr1, and is not part of the transmitted ARS content. The indented text is the transmitted ARS content.

The examples here all use the Blocks [4] name space.

2.1 ARS Commit-and-Propagate Protocol (ars-c)

For the simplest case, the interaction begins when the client performs a "SubmitUpdate" request to the zone primary:


client->svr1:
    <ARSRequest ReqNum='1'>
        <SubmitUpdate NotifyHost='client.example.com'
        NotifyPort='10201'>
            <UpdateGroup>
                <DataWithOps>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk1' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk2' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                </DataWithOps>
            </UpdateGroup>
        </SubmitUpdate>
    </ARSRequest>

Here the client generates and passes a request number that can be used to correlate the response with the request, to support concurrent requests. The client makes a SubmitUpdate request, passing the host name and port number to which notification should be sent when the update completes or fails, as well as a single UpdateGroup containing all updates to be performed. The request uses the DataWithOps encoding, since in this basic example the ARS client-server pair do not support any other encodings. The DataWithOps encoding contains a set of (in this case 2) documents, each of which has an associated operation (in this case "create") to be performed and attributes containing the document's name and CSN. Because the data in this example come from the Blocks name space, the name and CSN information are also contained as attributes within the Blocks. This redundancy happens because Blocks require these attributes to be present in the root XML element, as additional structure beyond that imposed by ARS. Since ARS only assumes documents and not the more constrained structure of Blocks, the name and CSN need to be included in the ARS encoding. Finally, note that because the documents have not yet been created in the datastore, the CSN is not meaningful. The CSN value is only meaningful once the content has been created in the datastore.

The server responds as follows:


svr1->client:
    <ARSResponse ReqNum='1'>
        <ARSAnswer>
            <GlobalSubmitID SubmisSvrHost='svr1.example.com'
            SubmisSvrPort='10201' SubmisSvrIncarn='979428854'
            ssn='1'>
        </ARSAnswer>
    </ARSResponse>

The ARSAnswer element contains the server's host name, port, incarnation stamp, and a 64 bit Submit Sequence Number assigned by the submission server. Together, these four pieces of data constitute an identification of the update submission that is globally unique for all time, called the GlobalSubmitID.

This ARSAnswer indicates that the submission was successfully received, and that the server has entered into the client-server promise described in [1]. If an error had occurred the response would have contained a ARSError instead of an ARSAnswer.

At some later time, the server performs a SubmittedUpdateResultNotification request to notify the client that the update has been successfully committed, and the client acknowledges receipt of this notification:


svr1->client:
    <ARSRequest ReqNum='5'>
        <SubmittedUpdateResultNotification
        SubmisSvrHost='svr1.example.com' SubmisSvrPort='10201'
        SubmisSvrIncarn='979428854' ssn='1' csn='2'
        ZoneTopNodeName='blocks:test.schwartz' />
    </ARSRequest>

client->svr1:
    <ARSResponse ReqNum='5'>
        <ARSAnswer />
    </ARSResponse>

The ReqNum here is 5 because the server happens to have performed 4 other requests before this one. The SubmittedUpdateResultNotification element contains the four attributes that constitute the GlobalSubmitID, as well as two other attributes: the 64 bit CSN that was assigned by the primary when it committed this update, and the URI [5] of the top node in the zone within which this update occurred. The URI in effect names the zone. It is needed because servers can handle multiple zones, and CSN's are allocated per zone. Together, the URI and CSN constitute an identification of the update commit event that is globally unique for all time.

This ARSAnswer indicates that the submission was successfully committed. If it had failed the SubmittedUpdateResultNotification would have contained an ARSError element describing the error, and the CSN would have been 0.

At a time determined by the local implementation's configuration settings, the primary performs a PushCommittedUpdates request to suggest to the non-primary that new committed updates are available to be pulled. The non-primary acknowledges this PushCommittedUpdates with an ARSResponse:


svr1->svr2:
    <ARSRequest ReqNum='6'>
        <PushCommittedUpdates UpstreamHost='svr1.example.com'
        UpstreamPort='10201' />
    </ARSRequest>

svr2->svr1:
    <ARSResponse ReqNum='6'>
        <ARSAnswer />
    </ARSResponse>

The PushCommittedUpdates request specifies the host and port from which the request was initiated. This is done rather than relying on looking up this information from the underlying transport service (BEEP) because the transmission could arrive on a different port than the advertised port on which the server accepts requests. In fact, a local implementation may chose to split receiving and sending onto separate machines to distribute load and failure modes, similar to how some commercial email services split processing for POP [6] and SMTP [7].

At this point, the non-primary performs a PullCommittedUpdates request to request newly available updates:


svr2->svr1:
    <ARSRequest ReqNum='4'>
        <PullCommittedUpdates DownstreamHost='svr2.example.com'
        DownstreamPort='10201'>
            <ReplState>
                <TopNodeOfZoneToReplicate>
                    blocks:test.schwartz
                </TopNodeOfZoneToReplicate>
                <LastSeenCSN>
                    0
                </LastSeenCSN>
            </ReplState>
        </PullCommittedUpdates>
    </ARSRequest>

Similar to the PushCommittedUpdates request, the PullCommittedUpdates request specifies the host and port from which the request was initiated. The PullCommittedUpdates request names the URI of the zone for which it wants updates, and the last CSN it has seen for that zone. By specifying a LastSeenCSN of 0, the non-primary is requesting the entire zone content (the first valid CSN is defined to be 1).

The primary responds with the requested updates:


svr1->svr2:
    <ARSResponse ReqNum='4'>
        <ARSAnswer>
            <UpdateGroup>
                <DataWithOps>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='2' Action='write'>
                        <block name='test.schwartz.blk1' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='2' Action='write'>
                        <block name='test.schwartz.blk2' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                </DataWithOps>
            </UpdateGroup>
        </ARSAnswer>
    </ARSResponse>

Note that the documents have their CSN's set, per the value assigned by the primary at commit time. Also, the operations sent are "write" (rather than the "create" specified when the update was submitted) in order to ensure that the operation succeeds in the case where update collapsing (see [1]) is performed. Collapsing will be discussed in more detail later.

2.2 ARS Submission-Propagation Protocol (ars-s)

We begin with the client submitting an update request to the non-primary server:


client->svr2:
    <ARSRequest ReqNum='1'>
        <SubmitUpdate NotifyHost='client.example.com'
        NotifyPort='10201'>
            <UpdateGroup>
                <DataWithOps>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk1' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk2' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                </DataWithOps>
            </UpdateGroup>
        </SubmitUpdate>
    </ARSRequest>

The content of this request is identical to that discussed in the earlier example. Only the destination of the request has changed.

The server responds, noting that it has successfully received the request:


svr2->client:
    <ARSResponse ReqNum='1'>
        <ARSAnswer>
            <GlobalSubmitID SubmisSvrHost='svr2.example.com'
            SubmisSvrPort='10201' SubmisSvrIncarn='979428854'
            ssn='1'>
        </ARSAnswer>
    </ARSResponse>

Again the content is identical to that shown in the earlier example, but with a different source (and SubmisSvrHost) for the response.

At this point, the non-primary server relays the request by making a PropagateSubmittedUpdate request to the primary server:


svr2->svr1:
    <ARSRequest ReqNum='3'>
        <PropagateSubmittedUpdate SubmisSvrHost='svr2.example.com'
        SubmisSvrPort='10201' SubmisSvrIncarn='979428854'
        ssn='1' NotifyHost='svr2.example.com'
        NotifyPort='10201'>
            <UpdateGroup>
                <DataWithOps>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='2' Action='create'>
                        <block name='test.schwartz.blk1' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='2' Action='create'>
                        <block name='test.schwartz.blk2' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                </DataWithOps>
            </UpdateGroup>
        </PropagateSubmittedUpdate>
    </ARSRequest>

The PropagateSubmittedUpdate request contains the GlobalSubmitID of the request (indicating that the request was submitted at svr2), but has re-written the NotifyHost and NotifyPort to refer to svr2, so that it will find out when the request completes or fails. Otherwise, the content of the request is identical to what svr2 received from the client.

The primary then responds, acknowledging that it has successfully received the PropagateSubmittedUpdate request and has entered into the client-server promise, providing a chain of responsibility from client to svr2 to svr1:


svr1->svr2:
    <ARSResponse ReqNum='3'>
        <ARSAnswer />
    </ARSResponse>

At some later time, the primary commits the update and performs a SubmittedUpdateResultNotification to inform the non-primary that the request has completed successfully. The non-primary acknowledges this SubmittedUpdateResultNotification with an ARSResponse:


svr1->svr2:
    <ARSRequest ReqNum='1'>
        <SubmittedUpdateResultNotification
        SubmisSvrHost='svr2.example.com' SubmisSvrPort='10201'
        SubmisSvrIncarn='979428854' ssn='1' csn='2'
        ZoneTopNodeName='blocks:test.schwartz' />
    </ARSRequest>

svr2->svr1:
    <ARSResponse ReqNum='1'>
        <ARSAnswer />
    </ARSResponse>

Note that ars-s is re-using the SubmittedUpdateResultNotification element defined by ars-c, for informing a downstream server about the completion status of a pending update.

At some later time, the primary performs a PushCommittedUpdates, the non-primary follows with a PullCommittedUpdates, and the primary responds with the requested updates:


svr1->svr2:
    <ARSRequest ReqNum='2'>
        <PushCommittedUpdates UpstreamHost='svr1.example.com'
        UpstreamPort='10201' />
    </ARSRequest>

svr2->svr1:
    <ARSResponse ReqNum='2'>
        <ARSAnswer />
    </ARSResponse>

svr2->svr1:
    <ARSRequest ReqNum='4'>
        <PullCommittedUpdates DownstreamHost='svr2.example.com'
        DownstreamPort='10201'>
            <ReplState>
                <TopNodeOfZoneToReplicate>
                    blocks:test.schwartz
                </TopNodeOfZoneToReplicate>
                <LastSeenCSN>
                    0
                </LastSeenCSN>
            </ReplState>
        </PullCommittedUpdates>
    </ARSRequest>




















svr1->svr2:
    <ARSResponse ReqNum='4'>
        <ARSAnswer>
            <UpdateGroup>
                <DataWithOps>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='2' Action='write'>
                        <block name='test.schwartz.blk1' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='2' Action='write'>
                        <block name='test.schwartz.blk2' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                </DataWithOps>
            </UpdateGroup>
        </ARSAnswer>
    </ARSResponse>

At this point, the non-primary performs a SubmittedUpdateResultNotification, to notify the client that its update submission has successfully committed, and the client acknowledges receipt of this notification:


svr2->client:
    <ARSRequest ReqNum='5'>
        <SubmittedUpdateResultNotification
        SubmisSvrHost='svr2.example.com' SubmisSvrPort='10201'
        SubmisSvrIncarn='979428854' ssn='1' csn='2'
        ZoneTopNodeName='blocks:test.schwartz' />
    </ARSRequest>

client->svr2:
    <ARSResponse ReqNum='5'>
        <ARSAnswer />
    </ARSResponse>

Note that this SubmittedUpdateResultNotification indicates that the update has now committed at the non-primary. This is important because it means the client can now interact with the non-primary copy and expect to see the committed update. The client can correlate this response to the submission it had made based on the GlobalSubmitID information (host, port, incarnation stamp, and SSN) contained in the SubmittedUpdateResultNotification attributes.

2.3 ARS Encoding Negotiation Protocol (ars-e)

ContentEncodingNegotiation can be performed between any pair of ARS peers, to determine if an expanded set of encodings is available beyond the default DataWithOps encoding. As an example, the non-primary server might perform a ContentEncodingNegotiation with the primary as follows:


svr2->svr1:
    <ARSRequest ReqNum='2'>
        <ContentEncodingNegotiation
        ZoneTopNodeName='blocks:test.schwartz' />
            <ContentEncodingsSupported>
                <ContentEncodingName>
                    DataWithOps
                </ContentEncodingName>
                <ContentEncodingName>
                    AllZoneData
                </ContentEncodingName>
                <ContentEncodingName>
                    EllipsisNotation
                </ContentEncodingName>
            </ContentEncodingsSupported>
        </ContentEncodingNegotiation>
    </ARSRequest>

svr1->svr2:
    <ARSResponse ReqNum='2'>
        <ARSAnswer>
            <ContentEncodingsSupported>
                <ContentEncodingName>
                    DataWithOps
                </ContentEncodingName>
                <ContentEncodingName>
                    AllZoneData
                </ContentEncodingName>
            </ContentEncodingsSupported>
        </ARSAnswer>
    </ARSResponse>

The ContentEncodingNegotiation element contains a ZoneTopNodeName attribute specifying the URI of the top node in the zone to which this encoding is to apply, because the set of encodings supported may vary by zone. The ContentEncodingNegotiation also contains one or more ContentEncodingName elements corresponding to content encodings the initiator supports. The responder sends back the subset of the requested encodings that it supports.

2.4 ARS Service Implementing All Three Sub-Protocols

Below we put together all of the protocol pieces discussed in the last three sub-sections, showing how a system supporting all three ARS sub-protocols might function:


client->svr2:
    <ARSRequest ReqNum='1'>
        <SubmitUpdate NotifyHost='client.example.com'
        NotifyPort='10201'>
            <UpdateGroup>
                <DataWithOps>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk1' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk2' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                </DataWithOps>
            </UpdateGroup>
        </SubmitUpdate>
    </ARSRequest>

svr2->client:
    <ARSResponse ReqNum='1'>
        <ARSAnswer>
            <GlobalSubmitID SubmisSvrHost='svr2.example.com'
            SubmisSvrPort='10201' SubmisSvrIncarn='979428854'
            ssn='1'>
        </ARSAnswer>
    </ARSResponse>










svr2->svr1:
    <ARSRequest ReqNum='1'>
        <ContentEncodingNegotiation
        ZoneTopNodeName='blocks:test.schwartz' />
            <ContentEncodingsSupported>
                <ContentEncodingName>
                    DataWithOps
                </ContentEncodingName>
                <ContentEncodingName>
                    AllZoneData
                </ContentEncodingName>
                <ContentEncodingName>
                    EllipsisNotation
                </ContentEncodingName>
            </ContentEncodingsSupported>
        </ContentEncodingNegotiation>
    </ARSRequest>

svr1->svr2:
    <ARSResponse ReqNum='1'>
        <ARSAnswer>
            <ContentEncodingsSupported>
                <ContentEncodingName>
                    DataWithOps
                </ContentEncodingName>
                <ContentEncodingName>
                    AllZoneData
                </ContentEncodingName>
            </ContentEncodingsSupported>
        </ARSAnswer>
    </ARSResponse>



















svr2->svr1:
    <ARSRequest ReqNum='2'>
        <PropagateSubmittedUpdate SubmisSvrHost='svr2.example.com'
        SubmisSvrPort='10201' SubmisSvrIncarn='979428854'
        ssn='1' NotifyHost='svr2.example.com'
        NotifyPort='10201'>
            <UpdateGroup>
                <DataWithOps>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk1' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='0' Action='create'>
                        <block name='test.schwartz.blk2' csn='0'>
                            ...
                        </block>
                    </DatumAndOp>
                </DataWithOps>
            </UpdateGroup>
        </PropagateSubmittedUpdate>
    </ARSRequest>

svr1->svr2:
    <ARSResponse ReqNum='2'>
        <ARSAnswer />
    </ARSResponse>

svr1->svr2:
    <ARSRequest ReqNum='1'>
        <PushCommittedUpdates UpstreamHost='svr1.example.com'
        UpstreamPort='10201' />
    </ARSRequest>

svr2->svr1:
    <ARSResponse ReqNum='1'>
        <ARSAnswer />
    </ARSResponse>

svr1->svr2:
    <ARSRequest ReqNum='2'>
        <SubmittedUpdateResultNotification
        SubmisSvrHost='svr2.example.com' SubmisSvrPort='10201'
        SubmisSvrIncarn='979428854' ssn='1' csn='2'
        ZoneTopNodeName='blocks:test.schwartz' />
    </ARSRequest>

svr2->svr1:
    <ARSRequest ReqNum='3'>
        <PullCommittedUpdates>
            <ReplState>
                <TopNodeOfZoneToReplicate>
                    blocks:test.schwartz
                </TopNodeOfZoneToReplicate>
                <LastSeenCSN>
                    0
                </LastSeenCSN>
            </ReplState>
        </PullCommittedUpdates>
    </ARSRequest>

svr1->svr2:
    <ARSResponse ReqNum='3'>
        <ARSAnswer>
            <UpdateGroup>
                <AllZoneData>
                    <DatumAndOp Name='blocks:test.schwartz.blk1'
                    CSN='2' Action='write'>
                        <block name='test.schwartz.blk1' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                    <DatumAndOp Name='blocks:test.schwartz.blk2'
                    CSN='2' Action='write'>
                        <block name='test.schwartz.blk2' csn='2'>
                            ...
                        </block>
                    </DatumAndOp>
                </AllZoneData>
            </UpdateGroup>
        </ARSAnswer>
    </ARSResponse>

svr2->svr1:
    <ARSResponse ReqNum='2'>
        <ARSAnswer />
    </ARSResponse>

svr2->client:
    <ARSRequest ReqNum='4'>
        <SubmittedUpdateResultNotification
        SubmisSvrHost='svr2.example.com' SubmisSvrPort='10201'
        SubmisSvrIncarn='979428854' Submit SeqNum='1' csn='2'
        ZoneTopNodeName='blocks:test.schwartz' />
    </ARSRequest>

client->svr2:
    <ARSResponse ReqNum='4'>
        <ARSAnswer />
    </ARSResponse>

Several subtleties of the protocol can be observed from this example:



 TOC 

3. ARS Syntax and Semantics

In this section we present the ARS syntax and semantics. We begin with how ARS identifies and encodes information within its messages: server identification, submitted and committed update sequence numbers, default data encodings, and error signaling between ARS peers. We then describe the structure and meaning of messages exchanged between ARS peers.

3.1 Identifiers, Data Representation, and Error Signaling

3.1.1 ARS Server Identification

Each ARS server has a global server identifier (GlobalServerID), which consists of a Domain Name System (DNS [8]) name, server incarnation stamp, and port number. The GlobalServerID must be unique for all time. If the server moves to a machine with a different DNS name, its GlobalServerID changes. A level of naming indirection can be used to minimize operational problems from this (e.g., a DNS CNAME called ars.example.com that points to host3.example.com).

A GlobalServerID-identified server must never use the same SSN for two different update submissions. The incarnation stamp provides a way for a server that loses track of its last assigned SSN (e.g., due to a disk crash) to assign a new incarnation stamp and restart its SSN allocation sequence. If not for the incarnation stamp, a server losing its SSN state would be forced to move to a different host name or port number, which would be an ARS peer-visible change. Note that ARS peers contact each other using only the host and port information. The incarnation stamp is only used as part of GlobalServerID's, which in turn provide a key for looking up replication state (such as the last seen SSN from a particular server).

Note that if a new server incarnation is established, no ordering constraints are defined with respect to the previous server incarnation. For example, an update submitted to the newly incarnated server might be serialized before an update that had been submitted chronologically earlier at the previous server incarnation.

At present there is no recovery mechanism if a primary server loses track of its last assigned CSN. Primary servers must therefore be run with more failure-resilient technology than non-primary servers -- for example using RAID-5 plus hot backups. Note that an incarnation stamp approach would be problematic for primary servers because it would mean that updates committed after server re-incarnation would have no defined serialization relationship with those committed before re-incarnation, which in turn violates convergent consistency requirements.

The incarnation stamp is a 64 bit number generated from the time-of-day clock on the server for which the incarnation stamp is being generated There is no clock synchronization requirement, since the stamp for any particular server is always generated by a single machine. Nor is there a requirement that the time stamp be formed according to any particular clock format (e.g., the UNIX seconds-since-midnight-1970 epoch -- although the examples in this document use that format). The only requirement is that a newly generated incarnation stamp must be at least one greater than the previously assigned incarnation stamp for that server.

The reason for using a timestamp rather than a simple counter is that using a timestamp reduces the chances for an administrative error that would assign an incarnation number that had already been assigned. In particular, the only state needed to generate a new incarnation stamp is the current time-of-day clock, which is readily available without access to any previous replication server state (which may have been completely destroyed by a disk crash).

Incarnation stamp 0 is defined to be invalid, and thus can be used by the server implementation as a pre-initialized value to ensure a valid incarnation stamp has been received during later processing.

3.1.2 Sequence Numbers

ARS uses 64 bit unsigned integer sequence numbers to provide unique-for-all-time identification of submitted and committed updates being processed by individual servers. For example, this counter size would allow one million updates per second to a particular zone for 585,000 years without wrapping. There are two types of sequence numbers:

  1. SSN: used by ars-s, the Submit Sequence Number (SSN) is allocated per submission server per zone to serialize all update submissions to a zone/server pair. The SSN plus GlobalServerID constitutes a GlobalSubmitID that uniquely identifies a submission for all time. The SSN imposes a total ordering over all updates submitted at that server, and a partial ordering over all updates globally.
  2. CSN: used by ars-c, the Commit Sequence Number (CSN) is allocated per update per zone by the zone primary server after an entire update submission has been received and checked for various problems (discussed below). Each successfully committed UpdateGroup is assigned a CSN (the value for which is subsequently associated with all documents in that UpdateGroup), which in turn serializes all update submissions to a zone so that updates are committed in the same order globally. More formally, the CSN imposes a global ordering on all updates that respects the partial orderings imposed by the SSN's from all submission servers for the zone.

Note that there is no need for logical clocks [9] for sequence numbers because updates are not applied at database replicas until they have been serialized at the primary. In fact, logical clocks must not be used because that would cause gaps in the SSN sequence, which would appear to the primary as missing update submissions.

In the case of an UpdateGroup a single CSN must be assigned to the entire update (rather than one CSN per document within the update submission).

Sequence number 0 (for both SSN's and CSN's) is defined to be invalid. It is used in three cases:

  1. as the value of the CSN field in the SubmittedUpdateResultNotification response for a failed update;
  2. for documents that are not replicated (e.g., if local state information about the replication system is stored in a Blocks datastore, the CSN values for each of those documents should be 0); and,
  3. as the value of LastSeenCSN when requesting an entire zone in the PullCommittedUpdates request.

CSN 1 is defined to be the first valid Commit Sequence Number, and is used only for the case of a data item that lacks a current CSN attribute (i.e., CSN value 1 is the value used as the default for this IMPLIED attribute). The first CSN assigned by an ARS server in response to a successfully committed update is 2. This definition is specifically used to allow a datastore not previously replicated by ARS to be replicated without requiring a special tool to add CSN's (see the SubmitUpdate Processing section). Instead, ARS interprets a missing CSN attribute at '1', in effect treating all previous updates applied to a non-replicated datastore as being rolled up into a starting state with CSN=1. From then on, ARS assigns CSN's for successfully committed updates starting at CSN value 2.

When a zone is divided/delegated, the newly created zone initializes its CSN to be the highest CSN value set from the zone from which it has been delegated. Doing this (rather than restarting the counting sequence) preserves the monotonicity of CSN's and avoids the need for renumbering sequence numbers assigned to documents within the new zone. The original zone also continues allocating CSN's from this high-water mark CSN. Note that once a zone is delegated, the fact that the original and new zone have the same CSN implies nothing about the relative orderings of updates applied in each. ARS defines no ordering of updates across zones.

3.1.3 DataWithOps Encoding

ARS requires all clients and servers to support the DataWithOps encoding. DataWithOps is used by ARS servers that do not support the ars-e sub-protocol. It is also used in cases where ars-e is supported but has not been performed between a pair of ARS peers.

Each DataWithOps element contains zero or more DatumAndOp elements describing a set of update operations to be performed, such that either all operations succeed or all operations fail (per the ANTACID semantics defined in [1]). Each DatumAndOp element contains a set of attributes concerning the update to be performed and the content of the document being updated. The attributes are:

Name:
the URI of the document to be updated;
CSN:
the CSN of the document to be updated;
Action:
one of:
create:
verifies that the documents do not exist in the datastore before creating them;
write:
creates or overwrites the documents in the datastore (the default);
update:
verifies that the documents exist in the datastore before overwriting them; or,
delete:
removes the documents from the datastore.

3.1.4 ARSError

The ARSError element provides an error-signaling structure for exchanging ARS profile-specific errors, providing specific detail beyond BEEP error handling. The ARSError element contains three attributes:

  1. OccurredAtSvrHost specifies the DNS name or IP address of the server that flagged the error;
  2. OccurredAtSvrPort specifies the port number of the server that flagged the error; and,
  3. OccurredAtSvrIncarn specifies the incarnation stamp of the server that flagged the error.

This can provide useful information when an update propagates up several hops in a DAG, with multiple choices at each hop.

The ARSError element contains three elements:

  1. ARSErrorCode, which must be filled in with values as enumerated below. ARSErrorCode 0 is defined to be invalid. It can be used by a server implementation as a pre-initialized value to ensure a valid code was received during later processing.
  2. ARSErrorText, which must be filled in.
  3. ARSErrorSpecificsText, which may be filled in to provide additional detail. The error code enumeration below provides recommendations of what additional information should be filled in for the ARSErrorSpecificsText in cases where additional detail is warranted.

Non-zero ARSErrorCode's use positional structure encoded into unsigned 32 bit numbers, as follows:


            First digit:                                                  
                 1: client problem                                        
                 2: server problem                                        
            Second digit:                                                 
                 1: service failure                                       
                 2: service refusal                                       
            Third digit:                                                  
                 1: security                                              
                 2: timeout                                               
                 3: mis-configuration                                     
                 4: too expensive                                         
                 5: implementation-specific failure                       
                 6: data conflict                                         
                 7: protocol/format error                                 
                 8: request for unimplemented feature                     
                 9: resource overload                                     
                 0: other                                                 
            Fourth-Sixth digits: three-digit enumeration of errors.  For  
            example, error code 114001 is a client problem that caused a  
            service failure because the request was too expensive.

The error codes listed below are referenced throughout this document. These error codes cover more failure conditions than those specifically mentioned in the protocol and algorithm discussions in this document, such as disk space exhaustion. Moreover, a variety of local implementation failures are possible (such as data validity assertion failures built into the code), which also are represented in the ARSError list below.

The currently defined ARSErrorCode's are:

116001:
Attempt to delete non-existent document. [ErrorSpecificsText should specify the non-existent document URI.]
116002:
Attempt to update non-existent document. [ErrorSpecificsText should specify the non-existent document URI.]
117001:
Missing URI in document store request.
121001:
Authentication failure.
121002:
Access denied.
123001:
Request was made to submission server that does not hold zone being requested.
123002:
Request was made to upstream server that does not hold zone being requested.
123003:
Attempt to update documents spanning zone boundaries within a single UpdateGroup.
123004:
Request to update data in unknown name space.
126001:
Write-write conflict detected. [ErrorSpecificsText should show ID of server that last updated document before this conflict was detected.]
126002:
Request violates datastore operation semantics. [ErrorSpecificsText should specify more details.]
126003:
General datastore error. [ErrorSpecificsText should specify more details.]
127001:
Malformed client-server ARS protocol transmission. [ErrorSpecificsText should show XML parser error output.]
210001:
Unable to propagate update submission to any upstream servers. [ErrorSpecificsText should provide some details about how many attempts were made, over how long of a duration.]
212001:
Timeout at zone primary waiting for submitted update re-ordering.
212002:
Timeout while waiting for zone lock.
212003:
Timeout while waiting for upstream server to propagate update.
212004:
Timeout while trying to respond to request.
213001:
Content encoding from upstream server not understood. [ErrorSpecificsText should name the encoding.]
213002:
No appropriate content encoding was available for the requested operation. [ErrorSpecificsText should name the server where the problem occurred, and the operation for which no content encoding could be found.]
213003:
Malformed ARS protocol transmission (client or server). [ErrorSpecificsText should describe parse error (note: used for cases where the underlying service can't tell whether it's a client-to-server or server-to-server ARS parsing error).]
213004:
General parsing error. [ErrorSpecificsText should describe parse error (note: used for cases where the underlying service can't determine whether it's server-to-server parsing or config file parsing).]
213005:
BEEP connection attempt to remote ARS end point relayed on behalf of current request failed. [ErrorSpecificsText should describe more detail about the nature of the failure.]
219001:
Server resource overload. [ErrorSpecificsText should contain detail about what resource(s) overloaded.]
223001:
No content encodings available for full zone transfer.
223002:
PropagateSubmittedUpdate request received from server not configured as a downstream server.
223003:
PushCommittedUpdates request received from server not configured as an upstream server.
223004:
PullCommittedUpdates request received from server not configured as a downstream server.
223005:
Requested ARS sub-protocol not supported.
223006:
Update submission received at non-primary that does not support ars-s.
225001:
Implementation-specific failure. [ErrorSpecificsText should provide detail.]
226001:
Duplicate update submission detected -- could be a server retransmitting after update submission has already been successfully received, or a server configuration loop.
226002:
Request for CSN before log truncation point. Full zone transfer should be requested. [ErrorSpecificsText should show CSN & truncation point.]
227001:
Malformed server-server ARS protocol transmission. [ErrorSpecificsText should show XML parser error output.]

3.2 ARS Message Semantics

ARS consists of three sub-protocols, only the first of which must be implemented by all ARS servers: the Commit-and-Propagate Protocol (ars-c), the Submission-Propagation Protocol (ars-s), and the Encoding Negotiation Protocol (ars-e). The protocol syntax for a server supporting any subset of these protocols is defined by a DTD whose contents are constructed based on the top-level definition and inclusion content. Here the operations to be supported are defined in the "ARSREQUESTS" ENTITY, the DTD's for the supported sub-protocol(s) is/are included, and, if ars-e is not supported, the "UpdateGroup" ELEMENT is set to define the single required default encoding for all ARS servers (DataWithOps).

The "ARSRequest" element contains a "ReqNum" attribute and one of a subset of the following elements, the subset being defined by the "ARSREQUESTS" ENTITY: a "SubmitUpdate" element, a "SubmittedUpdateResultNotification" element, a "PushCommittedUpdates" element, a "PullCommittedUpdates" element, a "ContentEncodingNegotiation" element, and a "PropagateSubmittedUpdate" element.

The "ReqNum" attribute (an integer in the range 1..4294967295) is used to correlate "ARSRequest" elements sent by a BEEP peer acting in the client role with the "ARSResponse" elements sent by a BEEP peer acting in the server role. Request number 0 is defined to be invalid, and thus can be used by the server implementation as a pre-initialized value to ensure a request was received during later processing.

The semantics of each of the elements within the ARSRequest are defined in the following subsections.

3.2.1 ARS Commit-and-Propagate Protocol (ars-c)

ars-c defines four request elements: SubmitUpdate, SubmittedUpdateResultNotification, PushCommittedUpdates, and PullCommittedUpdates. For the time being we assume clients submit to the primary server; submissions to non-primary servers are discussed later. For the time being we also assume that all submitted and committed updates are transmitted between all ARS peers using the DataWithOps encoding. The ability to support other encodings is discussed non-primary servers are discussed later.

3.2.1.1 SubmitUpdate

Clients submit groups of documents and their associated operation names to be performed in an ANTACID (see [1]) fashion using the SubmitUpdate request.

The SubmitUpdate element contains three optional elements:

NotifyHost
specifies the DNS name or IP address to which asynchronous notification is to be sent after the commit fails or succeeds;
NotifyPort
specifies the port number for asynchronous notification; and,
NotifyOkOnCurrentChannel
specifies whether it is acceptable for the server to send notification on the same channel that was used for submitting the update, if that channel is still open at the time the notification is ready to be sent. This flag allows the server to avoid the overhead of opening a new BEEP channel for updates that commit relatively quickly. The flag is needed because it is possible that the submission arrives on a different host and port than that specified by NotifyHost and NotifyPort, and different applications may or may not want to allow notification to arrive on the original submission channel. The default (IMPLIED) value is "no", meaning that the server must open a new channel for notification.

If a NotifyHost is specified then a NotifyPort must also be included. If only one of these attributes is included the update must be rejected with an ARSError containing ARSErrorCode=127001. The peer that receives notification may differ from the original submitting client, for example allowing a mobile client to perform update submissions and an always-connected server to receive the SubmittedUpdateResultNotification and convert it to an email message for the user to pick up later.

If NotifyOkOnCurrentChannel='yes', then NotifyHost and NotifyPort must also be specified. If NotifyOkOnCurrentChannel='yes' and NotifyHost or NotifyPort is not specified, the update must be rejected with an ARSError containing ARSErrorCode=127001. The semantics when NotifyOkOnCurrentChannel='yes' are:

The SubmitUpdate element also contains an UpdateGroup element. The UpdateGroup contains one or more DataWithOps elements, structured as noted in the DataWithOps Encoding section. Although the DTD allows for zero or more DataWithOps, if zero elements are included in a SubmitUpdate request the update must be rejected with an ARSError containing ARSErrorCode=127001. (The case of zero elements is used elsewhere in the protocol.)

The response to a failed SubmitUpdate request contains an ARSError describing the failure. For example, an update submission requesting deletion of a non-existent document might receive the a response as follows.


    <ARSRequest ReqNum='1'>
        <SubmittedUpdateResultNotification
        SubmisSvrHost='svr2.example.com' SubmisSvrPort='10201'
        SubmisSvrIncarn='979428854' ssn='1' csn='0'
        ZoneTopNodeName='blocks:test.schwartz'>
            <ARSError OccurredAtSvrHost='svr3.example.com'
            SubmisSvrPort='10201' OccurredAtSvrIncarn='979428854'>
                <ARSErrorCode>
                    126002
                </ARSErrorCode>
                <ARSErrorText>
                    Request violates datastore operation semantics
                </ARSErrorText>
                <ARSErrorSpecificsText>
                    Request #1 [BlockNameAndStoreOp:
                    name=test.schwartz.blk01, StoreOp=delete] failed
                </ARSErrorSpecificsText>
            </ARSError>
        </SubmittedUpdateResultNotification>
    </ARSRequest>

The response to a successful SubmitUpdate request contains an ARSAnswer element, which in turn contains a GlobalSubmitID element. The GlobalSubmitID contains four attributes:

SubmisSvrHost
specifies the DNS name of the submission server (note that unlike some other parts of ARS, the GlobalSubmitID allows only DNS names, not IP addresses, in the host component);
SubmisSvrPort
specifies the port number of the submission server;
SubmisSvrIncarn
specifies the incarnation stamp of the submission server; and,
SSN
specifies the Submit Sequence Number assigned to this update submission.

A success response to a SubmitUpdate request means that the server has accepted the update and will begin processing it at some time in the future. If a client wishes to be informed of success/failure of the update commit operation it may request asynchronous notification, as noted earlier.

3.2.1.2 SubmittedUpdateResultNotification

The SubmittedUpdateResultNotification element is used to notify the client of success/failure of its submitted update. A SubmittedUpdateResultNotification is sent to the client when this status becomes known at the submission server (as opposed to when the update has committed at the primary).

A SubmittedUpdateResultNotification for a successfully committed update contains six attributes:

SubmisSvrHost
specifies the DNS name of the submission server;
SubmisSvrPort
specifies the port number of the submission server;
SubmisSvrIncarn
specifies the incarnation stamp of the submission server;
SSN
specifies the Submit Sequence Number assigned to this update submission;
CSN
specifies the Commit Sequence Number that was assigned by the primary for this update; and,
ZoneTopNodeName
specifies the URI [5] of the top node in the zone within which this update occurred.

A SubmittedUpdateResultNotification for an update that failed contains the same six attributes above, except that the CSN number is set to 0. In addition, the SubmittedUpdateResultNotification for a failed update contains a single ARSError element describing the error that occurred.

3.2.1.3 PushCommittedUpdates

A PushCommittedUpdates request is made from an upstream server to a downstream server to suggest that the downstream server perform a PullCommittedUpdates request from the upstream server. It provides a means of propagating updates quickly without the downstream servers' needing to poll the upstream server.

The PushCommittedUpdates element contains two attributes:

UpstreamHost
specifies the DNS name or IP address of the upstream server making the request; and,
UpstreamPort
specifies the port number of the upstream server making the request.

3.2.1.4 PullCommittedUpdates

The PullCommittedUpdates request specifies one or more ReplState elements, corresponding to the one or more zone's the downstream server replicates from the upstream server, for which it is making the PullCommittedUpdates request. Each ReplState element contains two elements:

TopNodeOfZoneToReplicate
specifies the top node in the name tree for the current zone being replicated; and,
LastSeenCSN
specifies the last CSN the downstream server has seen. The semantics are that the upstream server is to send committed update content (discussed shortly) for each operation that has occurred since that CSN (i.e., not including that CSN), optionally using the collapsing notion defined in [1]. A request specifying LastSeenCSN='0' indicates that the entire zone is to be transferred.

The response to a failed PullCommittedUpdates request contains a ARSError describing the failure.

The response to a successful PullCommittedUpdates request contains zero or more UpdateGroup's:

Each UpdateGroup contains a set of committed updates, encoded in the default DataWithOps encoding.

3.3 ARS Submission-Propagation Protocol (ars-s)

If the primary and a non-primary server both support ars-s, updates may also be submitted to the non-primary server.

ars-s adds two new protocol requests to those defined by ars-c:

  1. PropagateSubmittedUpdate, which is used by a non-primary to forward an update submission up the replication Directed Acyclic Graph (DAG) towards the primary; and,
  2. SubmittedUpdateResultNotification (which is used by ars-s for client notification) is used in an additional way, namely, to provide asynchronous success/failure notification to a downstream server of a request it had earlier submitted.

3.3.1 PropagateSubmittedUpdate

The PropagateSubmittedUpdate element contains six attributes:

SubmisSvrHost
specifies the DNS name of the submission server;
SubmisSvrPort
specifies the port number of the submission server;
SubmisSvrIncarn
specifies the incarnation stamp of the submission server;
SSN
specifies the Submit Sequence Number assigned by the submission server for this update submission;
NotifyHost
specifies the DNS name or IP address to which asynchronous notification is to be sent after the commit fails or succeeds; and,
NotifyPort
specifies the port number for asynchronous notification.

The PropagateSubmittedUpdate element also contains one of two possible elements:

  1. UpdateGroup containing the submitted update content, which contains one or more DataWithOps elements.
  2. FailedUpdateSubmission, which is used to indicate that all attempts to perform a PropagateSubmittedUpdate request to upstream servers fail (after timing out/retrying a configurable number of times) have failed. A FailedUpdateSubmission can also be generated by an administrative tool run to fail updates submitted to a server that is not brought down cleanly, in violation of the Client-Server Promise (see [1]).

3.3.2 SubmittedUpdateResultNotification Extended Semantics

Unlike its use in ars-c, with ars-s the notification destination (host/IP and port) is required, so that downstream servers always receive notification of update results. Note also that the GlobalSubmitID contained in a SubmittedUpdateResultNotification always specifies the globally unique identifier for the submission server (including the unique SSN it generated), which should be used by each server along the submission path as a key into a local state table of in-progress update submissions (e.g., to find where to propagate the response back down to the previous server on the submission path).

As in the ars-c case, the SubmittedUpdateResultNotification includes the ARSError if an error occurred, or the CSN that was assigned by the primary for the given SSN if no error occurred.

3.4 ARS Encoding Negotiation Protocol (ars-e)

The ContentEncodingNegotiation element is optionally initiated by an ARS peer that wishes to determine if an expanded set of encodings is available beyond the default DataWithOps encoding.

The currently defined encodings and procedures for registering new encodings are provided in an appendix.

The ContentEncodingNegotiation element contains a ZoneTopNodeName attribute specifying the URI of the top node in the zone to which this encoding is to apply, and one or more ContentEncodingName elements corresponding to content encodings the initiator supports. Each ContentEncodingName element contains an NMTOKEN specifying the name of a defined encoding (such as "DataWithOps").

The responder sends back the subset of the requested encodings that it supports.

After the ContentEncodingNegotiation has completed, each ARS peer may cache the list of ContentEncodingName's supported by the given peer and for the given zone, for the duration of the ARS channel's lifetime. Given a list of supported ContentEncodingName's, each ARS peer may select an appropriate encoding in future message exchanges.

If no ContentEncodingNegotiation has taken place before an operation, the DataWithOps encoding must be used. See Current Encodings section about cases where the DataWithOps may fail to meet the needs of the current transmission.



 TOC 

4. Algorithms and Implementation Details

Below we discuss the basic state management needed to implement an ARS server. We then discuss algorithm and implementation details for each of the three sub-protocols.

4.1 ARS Meta-Data Management

A variety of meta-data must be managed to implement an ARS service. This section discusses possible implementation approaches for managing this meta-data.

4.1.1 Document State

ARS requires two pieces of meta-data to be associated with each document: the name of the document and its current CSN. It is a local implementation matter how these meta-data are stored. One approach would be to store these meta-data in the datastore itself, as attributes in the root element of each document. Another approach would be to maintain a separate repository mapping document name to the pair (physical address for document, CSN), where the physical address might be a disk block address or a database row ID. This approach is similar to how a UNIX file system uses a directory file to map from hierarchical name to flat (inode) name plus protection attributes.

4.1.2 Committed Update State Management

To be able to respond to PullCommittedUpdates requests, an ARS server needs to track the set of operations that have committed on each document, and the corresponding CSN's. Some type of index is needed to locate all operations and these associated meta-data for which the CSN is larger than a given CSN. It is a local implementation matter how these meta-data are to be managed. We discuss two possibilities here.

Similar to the case noted in the previous section, one approach would be to store the meta-data as attributes in the datastore itself, in each document. An additional complication with doing this for managing committed update state is that there must be a way to track deleted documents (so that "delete" operations can be returned in response to a PullCommittedUpdates request after a delete has committed at the ARS server). To do this, at "delete" time the datastore could use another root element attribute to mark documents as deleted, rather than physically removing them from the datastore. Additionally, the datastore will need to provide a way for each profile (SEP, ARS, etc.) that uses the datastore to choose whether to retrieve deleted documents. For example, it must be possible to service SEP queries such that deleted documents are not returned, but it must be possible for ARS to retrieve the document name and "delete" operations that have committed since a given CSN.

In addition to the above complication, there is a potential performance problem with tracking deleted documents in the datastore. Specifically, the "greater than CSN" lookup needs to be very efficient, potentially retrieving millions of results. If the underlying datastore supports only a text-based index (e.g., designed primarily to support SEP textual queries), "greater than" queries will probably be slow. In this case, it would be preferable to implement a more specialized indexing structure to track committed updates. That leads to the second approach, namely, tracking committed updates in some type of log. The log could be implemented as a flat file with a corresponding numeric index, or perhaps in a relational database table.

If a log implementation is chosen, a local implementation decision needs to be made about how far back in history to keep update logs. Generally speaking, the larger the content held by an ARS server and the more expensive the network links, the longer back in history the server should retain logs. Note also that systems supporting mobile clients should provision for more log data to be kept around, more clients, longer-running transactions, etc.

4.1.3 Committed Update Collapsing

Regardless of whether committed update state is tracked inside the datastore or in an auxiliary log, ARS servers may choose to implement "collapsing" updates as defined in [1]. Doing so could yield significant savings in network transmissions as well as space required for committed update state. For the sake of simplicity below we describe only how to implement update collapsing assuming a log-based implementation of committed update state management.

To implement update collapsing, the ARS server does as follows:

In this fashion, for example, the following update sequence run at the primary:

will be "played back" for the downstream server that requests all operations since csn=1 as:

ARS servers are not required to perform update collapsing when responding to a PullCommittedUpdates request. However, ARS servers must be prepared to process PullCommittedUpdates responses that have been collapsed. Specifically:

4.1.4 Per Server Sequence Number State

Sequence numbers are tracked as follows. For each zone it handles, an ARS server tracks the last assigned SSN for that server for that zone. In addition,

If a site's ARS service is implemented by multiple physical servers (all identified by a single DNS name at the site), those servers must coordinate assignment of sequence numbers among each other to meet the uniqueness requirement, for example by retrieving the SSN from a shared backend database.

Note that per-server sequence number state need not be saved in the datastore, and in fact for the sake of efficiency should be saved to a lighter weight storage system such as flat files. (The datastore implements ACID semantics, which is overkill for managing individual data items.)

4.1.5 Locking

A zone-wide lock is obtained in the process of committing an update. This is accomplished using 'lock' and 'release' primitives at the top-level node for the zone before and after (respectively) performing the individual document writes, which implement the following semantics:

lock:
specifies the URI of the document defining a subset of a zone to which the requesting user instance is requesting exclusive write access. The zone subset to be locked consists of the named document and all documents beneath that document in the subtree, down to but not including any zone delegation cut points in the subtree. (If there are no zone delegation points, the zone subset consists of the entire subtree under the specified node, down to and including the leaves.) A lock must be performed successfully before any document writes may be performed. While a zone subset is locked, no other user instance may lock or write documents successfully within the zone subset, and any document write operations are journaled until a subsequent release operation.
release:
specifies whether to commit or rollback any journaled document write operations. All document write operations performed while a zone subset is locked have atomic update semantics -- either they all succeed or they all fail. If they all succeed, they must all become visible to other clients of the local datastore atomically.

Note: for performance reasons it may be preferable to implement a more optimistic concurrency control technique so that write operations from multiple updates can be overlapped and conflicts cause rollback/replay. For simplicity we talk about zone-wide locking in the current document.

If the ARS implementation is threaded additional synchronization is required, because datastore lock semantics disallow a single process from locking nesting subtrees (e.g., locking "a.b.c" when "a.b" is already locked). Threaded implementations therefore need to maintain a table of threads currently holding or waiting for a lock, listing the thread identifier and the tree node locked / to be locked. When a new lock request is to be performed, this table needs to be checked to see if any other threads currently hold locks on tree nodes above or below the current request in the tree, and if so to create a queue of such requests. When a lock is released, this table again needs to be checked to see if any threads are currently waiting for a lock that may now be allowed to issue the datastore lock request.

Finally, appropriate synchronization is needed around accesses to the above table.

4.1.6 Server Configuration Data

4.1.6.1 Replication Topology (Normative)

ARS Topology Configuration DTD provides the DTD for configuring the replication topology of a ARS server. While the storage management mechanism for this configuration data (local file, database table, etc.) is a local implementation matter, the document structure is defined here for two reasons:

Each server specifies the set of zones it handles, whether it is primary for each zone, and the immediate upstream and downstream servers for each zone it serves. The configuration data also specifies the frequency of PushCommittedUpdates and PullCommittedUpdates requests, as well as preferences for the order that servers are to be contacted when propagating submitted updates.

As an example, the following is the configuration data for a primary server running on host s1.example.com and port 5682, which replicates content in the Blocks name space:


<?xml version='1.0'?>
<!DOCTYPE ARSExportedConfig SYSTEM 'ARSExportedConfig.dtd'>
<ARSExportedConfig>
    <GlobalServerID SvrHost='s1.example.com' SvrPort="5682"
     SvrIncarn='979428854'/>
    <ZonePrimaryConfig>
        <ZoneTopNode Name='blocks:.'/>
        <ZoneCutPoint Name='blocks:doc.rfc'/>
        <ZoneCutPoint Name='blocks:doc.edgar'/>
        <DownstreamServer>
            <ServerLocation SvrHost='s2.example.com'
             SvrPort='5682'/>
            <PushProperties Period='600'/>
        </DownstreamServer>
        <DownstreamServer>
            <ServerLocation SvrHost='s3.example.com'
             SvrPort='5682'/>
            <PushProperties Period='600'/>
        </DownstreamServer>
    </ZonePrimaryConfig>
</ARSExportedConfig>

This is the primary server for the global name tree root, delegating at cut points "doc.rfc" and "doc.edgar". It is replicated by two downstream servers, running on s2.example.com and s3.example.com. It pushes updates to those servers every 10 minutes.

Here is a configuration file for a non-primary server running on host s2.example.com and port 5682, which also replicates content in the Blocks name space:


<?xml version='1.0'?>
<!DOCTYPE ARSExportedConfig SYSTEM 'ARSExportedConfig.dtd'>
<ARSExportedConfig>
    <GlobalServerID SvrHost='s2.example.com' SvrPort='5682'
     SvrIncarn='979428854'/>
    <NonZonePrimaryConfig>
        'ZoneTopNode Name='blocks:.'/>
        <ZoneCutPoint Name='blocks:doc.rfc'/>
        <ZoneCutPoint Name='blocks:doc.edgar'/>
        <UpstreamServer>
            <Preference Weight='10'/>
            <ServerLocation SvrHost='s1.example.com'
             SvrPort='5682'/>
            'TopNodeOfZoneToReplicate Name='blocks:.'/>
            <PullProperties Period="-1"/>
        </UpstreamServer>
        <UpstreamServer>
            <Preference Weight='20'/>
            <ServerLocation SvrHost='s6.example.com'
             SvrPort='5682'/>
            <TopNodeOfZoneToReplicate Name='blocks:.'/>
            <PullProperties Period='-1'/>
        </UpstreamServer>
        <DownstreamServer>
            <ServerLocation SvrHost='s4.example.com'
             SvrPort='5682'/>
            <PushProperties Period='600'/>
        </DownstreamServer>
        <DownstreamServer>
            <ServerLocation SvrHost='s5.example.com'
             SvrPort='5682'/>
            <PushProperties Period='600'/>
        </DownstreamServer>
    </NonZonePrimaryConfig>
</ARSExportedConfig>

This server replicates the "." zone from two upstream servers (s1.example.com and s6.example.com). It does not schedule any periodic update pull requests from the upstream servers, because in this set of servers only pushes are scheduled. The server specifies preference weights for each upstream server, used to determine the order that the upstream servers are tried when attempting to propagate update submissions. Finally, this server is replicated by two downstream servers, running on s4.example.com and s5.example.com, respectively.

4.1.6.2 Local Implementation Settings (Non-Normative)

In addition to replica topology information, ARS servers will also need various local configuration data. What follows is not part of the normative specification for ARS, but rather is included to provide a concrete example to implementors, based on the author's server implementation. The author's ARS implementation has the following local configuration data:

HomeDirectory:
Root directory under which data, logs, and configuration information are stored.
ValidateARSMessages:
Whether to validate ARS protocol messages against DTD. Note that this setting can adversely affect server performance.
DetectWriteWriteConflicts:
Whether to detect write-write conflicts. Only matters at Zone primary. The spec makes this required to be on but I included the option to turn it off to allow experimentation (since it adds overhead) and to allow easier testing (since otherwise you need to have the right CSN before sending an update).
OutOfOrderTimeoutInSecs:
How long to wait (in seconds) for out-of-order update submissions while earlier submissions are propagated before timing out the update for the current attempt period.
LockWaitTimeoutInSecs:
Number of seconds to allow update submissions to wait for for the real subtree lock while trying to apply a committed update before timing out.
SingleARSRequestAttemptTimeoutInSecs:
How long to wait (in seconds) for PropagateSubmittedUpdate requests to complete before timing out the update for the current attempt period.
ServiceFailedTransmitRetryPeriodInSecs:
How long to wait after a retryable request fails due to service failure out before retrying, in seconds.
ServiceFailedTransmitMaxAttempts:
Number of times to retry a service failed request before giving up and reporting it failed to client.
LogicallyIndentBlocks:
If true, we put logical indentation into XML document start and end elements (not the character content) as we write them out. Else will be left margin aligned. Note that this setting is only meaningful for XML documents that are parsed.
CacheSeqNumBlocks:
Enable/disable SeqNumBlock caching.
ARSDTDFileName:
Location of ARS DTD.
ARSContentEncodingsFileName:
File name where to find ARS content encodings DTD. This file can be edited locally to add new (non-standardized) content encodings, and is included here so that the ARS runtime can validate content encodings if ValidateARSMessages is enabled.

4.2 Protocol Processing

An ARS server must implement ars-c, and may implement one or both of ars-s and ars-e. As part of the required protocol handling support, ARS servers must reject requests for a non-supported sub-protocol with an ARSError containing ARSErrorCode=223005.

4.2.1 ARS Commit-and-Propagate Protocol (ars-c)

4.2.1.1 SubmitUpdate Processing

Upon receipt of a SubmitUpdate request, an ARS server performs the following steps:

  1. If a non-primary ARS server that does not support ars-s receives an update submission (either via a SubmitUpdate or PropagateSubmittedUpdate request), it must reject the request by responding with an ARSError containing ARSErrorCode=223006.
  2. If ARS server receives an update submission specifying an unsupported name space it must reject the request by responding with an ARSError containing ARSErrorCode=123004.
  3. The server parses the DataWithOps encoding, saves the enclosed documents and their associated CSN's and update operations to temporary stable storage (using temporary identifiers guaranteed not to clash with other concurrently arriving updates), and performs the following checks:
    • access control denial;
    • update request to a document in a zone not served by the current ARS server; and,
    • update request that spans more than one zone.

    Note that the temporary document copies need not be saved in the datastore, and in fact for the sake of efficiency should be saved to a lighter weight storage system such as a journaling file system. (The datastore implements ACID semantics, which is overkill for managing temporary data.)

  4. If no failures occurred during the above checks, the server allocates a new GlobalSubmitID for the UpdateGroup, for the zone within which the submission falls.
  5. At this point the server responds to the client either with an ARSError describing the error that occurred or an ARSAnswer containing the GlobalSubmitID to indicate that the submission has been successfully received. It also saves the OptionalNotificationDest information provided (if any), for use in asynchronous notification once the update has completed.
  6. Now that the update has been completely received, the server enqueues it for commit processing. The server processes elements in this queue one at a time, as follows:
    • If the local implementation uses log-based committed update state management, create a temporary list into which document names, operation names and CSN's can be stored.
    • Acquire a zone-wide lock, setting an implementation-specified timeout period that will result in an ARSError containing ARSErrorCode=212002 being sent to the client if the lock is not acquired before the timeout expires. Note that this rough-grain locking is required to implement zone-wide serialization, and can become a source of contention if the operations performed while locking are not implemented efficiently.
    • Allocate a new CSN for this zone.
    • Loop on all DatumAndOp elements within the UpdateGroup and perform the following steps in the order the operations occur in the UpdateGroup:
      • load data and operation from saved temporary state.
      • If no 'csn' attribute is currently set in the document, treat that document as having CSN=1. Doing this allows a datastore not previously replicated by ARS to be replicated without running a special tool to add CSN's.
      • At this point the local implementation may perform write-write conflict detection by comparing the value of the 'CSN' attribute contained in the DatumAndOp against the corresponding value stored in the primary's local datastore. If any of these values differ, the implementation may reject the update by responding with a ARSError containing ARSErrorCode=126001.
      • Update the 'csn' attribute in each document per the assigned CSN, and then perform the needed datastore operation, trapping any errors/exceptions that arise. (Note that datastore lock/release semantics do not make the operation visible until the corresponding release occurs).
      • If the local implementation uses log-based committed update state management, save the document name, datastore operation name, and CSN in the temporary list.
      • Continue to the next operation.

    • If an error/exception arises during the above loop:
      • Release the zone-wide lock, requesting that all contained updates be aborted.
      • If the local implementation uses log-based committed update state management, discard the temporary document name/operation list.
      • Reset the CSN counter so that this CSN will be used for the next commit attempt. (Each CSN must represent a successful update.)
      • Generate an ARSError to be transmitted to the client (if notification was requested).

    • Else, if the local implementation uses log-based committed update state management, append the temporary document name/operation list to the log of all operations performed on the datastore (which is used by the PullCommittedUpdates request; details about this log are discussed later). This log is only written when the zone lock is held, and therefore the log will be serialized in the same update order as applied to the local datastore.
    • Finally, release the zone-wide lock, requesting that all contained updates be committed.

  7. Upon completion (either successful or not), if notification was requested the server performs a SubmittedUpdateResultNotification operation (discussed below).

Note: as an optimization for step 1 above, incoming documents can be written directly to the datastore (rather than saving first to temporary storage), and the update simply aborted if an error is detected. However, we recommend against this approach because:

4.2.1.2 SubmittedUpdateResultNotification Processing

SubmittedUpdateResultNotification must be implemented as a timeout-and-retry style of operation, so that if the client is temporarily unreachable the server will retry over a period of time. The number of retries and timeout period is determined by the local implementation.

For failed updates, the SubmittedUpdateResultNotification contains the ARSError that occurred and the GlobalServerID where the error occurred.

For successful updates, the SubmittedUpdateResultNotification contains an empty ARSError element, as well as the CSN that was assigned by the primary for the given SSN.

4.2.1.3 PushCommittedUpdates Processing

The processing of PullCommittedUpdates requests is implementation-dependent. The downstream server may ignore the request, or may use it to schedule a PullCommittedUpdates request.

4.2.1.4 PullCommittedUpdates Processing

Downstream servers must synchronize PullCommittedUpdates requests so that at most on request/response is in progress for a given zone at any time.

If a server maintains committed update state in a log and a request is received for updates further back in history than are stored in that log, the upstream server responds with an ARSError containing ARSErrorCode=226002. In response the downstream server may either re-issue the same request at a different ARS server, or (if both servers support an appropriate ars-e encoding) request a full zone transfer. Note in particular that if a server performs committed update log truncation it will be unable to support new ARS replicas' requests to join the replication network (since they will need to perform a request for all updates since CSN 0) unless both servers also support an appropriate ars-e encoding. As a consequence, an implementation that does not support ars-e and that wishes to allow new replicas to join over time must not perform committed update log truncation.

The upstream server must lock the requested zone while processing a PullCommittedUpdates request, so that the underlying datastore contents are not changed while the content is being sent (which could result in inconsistent content being transmitted to the downstream server). Since this may cause the zone to be locked for a long time, an alternative implementation would be to lock the zone, make a copy of the documents to be sent, and unlock, before transmitting those documents. Copy-on-write implementations are also possible.

The upstream server must send the UpdateGroup's in increasing order of CSN for that zone.

When a downstream server receives a committed set of UpdateGroup's from an upstream server (in response to the PullCommittedUpdates request) the downstream ARS server:

The downstream server must ignore datastore delete failures to function correctly in response to upstream servers that implement collapsing updates.

4.2.1.5 Submitted Update Collapsing for Infrequently Synchronized Peers

If an ARS server performs write-write conflict detection, clients cannot submit two updates in a row to a document without getting a commit response after each submission. That can be an annoying limitation for infrequently synchronized nodes, such as mobile PDAs. To mitigate this problem ARS peers may collapse updates as follows.

If a pending update submission has not yet been propagated up the DAG, the ARS server may choose to replace the pending submission with another update to the same document, reusing the SSN. To maintain the correct submitted update ordering, the SSN's for all updates between the previous and recent submission must be reordered whenever this algorithm is applied, by dropping the original submission, shifting each of the following SSN's back by one, and decrementing the current not-yet-assigned SSN at that ARS server. For example, consider the update submission sequence: a.b.c (SSN 4), a.b.d (SSN 5), a.b.e (SSN 6), a.b.c (SSN 7). If none of these updates has yet been propagated up the DAG, this update sequence can be replaced with the sequence a.b.d (SSN 4), a.b.e (SSN 5), a.b.c (SSN 6), and then reusing SSN 7 for the next update that is submitted.

4.2.2 ARS Submission-Propagation Protocol (ars-s) Processing

ars-s requires more complex synchronization for performing the ars-c SubmittedUpdateResultNotification operation. Each of these operations is discussed below.

4.2.2.1 Non-Primary SubmitUpdate Processing

If a non-primary ARS server that supports ars-s receives a SubmitUpdate request, it performs the following steps:

  1. Steps 1-5 listed under SubmitUpdate Processing. Note: a SSN should not be used in place of a temporary identifier in step 3 because if a failure occurs during these steps a FailedUpdateSubmission request will have to be propagated upstream (discussed below), adding additional load to all upstream servers and delaying other update submissions until this FailedUpdateSubmission has completed at the primary.
  2. The server generates a PropagateSubmittedUpdate request, consisting of the same content as the received submission, but filling in the GlobalSubmitID attribute with the server's SubmisSvrHost, SubmisSvrPort, SubmisSvrIncarn, and the SSN.
  3. The server attempts to send this PropagateSubmittedUpdate request to each upstream server in turn, until one successfully receives it.
  4. If the PropagateSubmittedUpdate request cannot be successfully forwarded to any upstream server, a timer must be set to retry the sequence of upstream servers again later (because of the client-server promise discussed in [1]). The timeout duration and number of attempts is determined by the local implementation.
  5. Once the PropagateSubmittedUpdate transmission has completed, the server saves stable state to indicate that the update has been propagated, so that it can look up this state when the update later completes (successfully or not) and notify the client if notification was requested. The PropagateSubmittedUpdate request must not be transmitted again once it has been successfully received by an upstream server.
  6. If all attempts to send/timeout/re-send the PropagateSubmittedUpdate request upstream fail and notification was requested, the server sends an ARSError containing ARSErrorCode=210001. If all attempts fail the server always generates a PropagateSubmittedUpdate request containing a FailedUpdateSubmission element, which it attempts to send upstream using the same timeout-and-retry logic as noted in step (4), with the exception that it never stops trying until it succeeds. The reason is that the primary must learn of the failed update submission, else all future submissions from the submission server will fail because of the requirement to serialize updates by SSN (see below).

4.2.2.2 Non-Primary PropagateSubmittedUpdate Processing

If a non-primary ARS server that supports ars-s receives a PropagateSubmittedUpdate request (which came either from a non-primary that received a SubmitUpdate request and generated a corresponding PropagateSubmittedUpdate request, or from a server propagating a PropagateSubmittedUpdate request it received), it does the following:

  1. If the request contains a "FailedUpdateSubmission" element, it responds to the downstream server with an ARSAnswer containing the GlobalSubmitID to indicate that it has successfully received the request. It then it attempts to send this PropagateSubmittedUpdate request to each upstream server in turn, until one successfully receives it. Note that at this point the responsibility for completing the FailedUpdateSubmission transmission has passed from the previous server to the current server, so the current server must retry transmitting the request indefinitely until an upstream server has accepted it.
  2. Otherwise, the server performs steps 1-6 listed under Non-Primary, with four changes:
    • In addition to the other checks performed during step 1 (more specifically, during step 3 of SubmitUpdate Processing), it checks for a duplicate GlobalSubmitID to the one already seen. This check is done in two places:
      1. During step 3 of SubmitUpdate Processing a check is made that the SSN contained within the given GlobalSubmitID is greater than the SSN of the last successfully committed update for the given zone & submission GlobalServerID; and,
      2. After step 3 of SubmitUpdate Processing a check is made that the given GlobalSubmitID is not currently being processed (which could happen if duplicate submissions arrive so close together that one has started processing and not yet completed).

      This check ensures that:

      • DAG cycles (caused by configuration errors) cannot result in infinite loops or deadlocks; and,
      • PropagateSubmittedUpdate operations are idempotent, which provides greater resilience in dealing with partitions.

    • It rewrites the NotifyHost with its own GlobalServerID, so that the SubmittedUpdateResultNotification from the upstream server to which the submission was propagated will be sent the current server.
    • Upon completion (successful or not) the server sends a SubmittedUpdateResultNotification to the server from which the PropagateSubmittedUpdate request was received. See also the discussion of SubmittedUpdateResultNotification Synchronization.
    • If all attempts to send the PropagateSubmittedUpdate request to upstream servers fail (step 6 of Non-Primary PropagateSubmittedUpdate Processing) the server sends the appropriate ARSError to the downstream server from which the PropagateSubmittedUpdate request was received, but does not generate the PropagateSubmittedUpdate request containing a FailedUpdateSubmission element. That request must be generated by the submission server.

Note that each server in the submission path assumes responsibility for the client-server promise (see [1]) as the update submission is passed up the tree. This promise allows an ARS never to retransmit a submission once it has been accepted by an upstream server (step 5 under "Non-Primary SubmitUpdate Processing").

4.2.2.3 Primary PropagateSubmittedUpdate Processing

If a primary ARS server that supports ars-s receives a PropagateSubmittedUpdate request, it performs the following steps:

4.2.2.4 PushCommittedUpdates and PullCommittedUpdates Scheduling

ARS does not specify how PushCommittedUpdates and PullCommittedUpdates operations are to be scheduled. As a local implementation matter, ARS servers may schedule PushCommittedUpdates and PullCommittedUpdates operations a variety of different ways, perhaps offering configuration options that can support any/all of the following:

  1. Periodic PullCommittedUpdates requests.
  2. PushCommittedUpdates requests down the submission path immediately following any update that was propagated up that path, to minimize committed update propagation latency back down to the submitting client.
  3. PushCommittedUpdates requests down to other servers immediately following an update, to minimize committed update propagation latency for servers that need to keep in close synchronization.
  4. PullCommittedUpdates requests only upon new replica join, server re-boot, mobile device reconnection, or partition repair, to "catch up".

If a server implements (2) and/or (3) above, care should be taken to prevent backlogging the downstream server with many PushCommittedUpdates requests. For example, if the primary is experiencing high update rates and performs a PushCommittedUpdates each time it completes an update, it may not be possible to process the ensuing PullCommittedUpdates requests that the downstream server(s) make as fast as new PushCommittedUpdates requests are being made. This can create excess network traffic and lock contention at the primary, at precisely the worst time. To avoid this problem, the following algorithm (reminiscent of delayed acknowledgements and Nagle's algorithm used by TCP [10]) should be used:

With this approach, updates can be propagated when they complete, but during times of high update submission load many PushCommittedUpdates operations will be batched together.

4.2.2.5 PullCommittedUpdates Synchronization

The downstream server must perform synchronization to ensure that at most one PullCommittedUpdates request can be running at a time for a given zone. For example, a server configured with two different upstream servers for a zone must not run concurrent PullCommittedUpdates requests from the two upstream servers. (This synchronization requirement is one reason why PushCommittedUpdates is simply a suggestion for a PullCommittedUpdates request to be performed. If PushCommittedUpdates actually transmitted data, it would be difficult to synchronize because the PushCommittedUpdates and PullCommittedUpdates data transfers would be initiated by different servers. Instead, the downstream server controls the scheduling of committed update transmissions.) This is important not only because concurrent data transfers for PullCommittedUpdates' for the same zone would waste traffic and server load, but also because this concurrency could result in incorrect committed state. For example, consider the sequence:

4.2.2.6 SubmittedUpdateResultNotification Synchronization

The scheduling of SubmittedUpdateResultNotification request to a downstream server is complicated by two factors:

  1. Because the response to a PullCommittedUpdates request can contain more than one UpdateGroup, receipt of a PullCommittedUpdates request in a server that supports ars-s may trigger multiple SubmittedUpdateResultNotification's to be generated to downstream servers and/or clients.
  2. It is possible that the committed update content for an update reaches the downstream server before the SubmittedUpdateResultNotification from its upstream server reaches that downstream server.

To illustrate the second case above, consider the following replication topology:


                              svr3
                            (primary)
                             |    |
                            \|/  \|/
                           svr2  svr4
                             |    |
                            \|/  \|/
                              svr1

Given this topology, consider the following event ordering sequence:


              1. client->svr1: SubmitUpdate
              2. svr1->svr2: PropagateSubmittedUpdate
              3. link between svr1 and svr2 goes down
              4. svr2->svr3: PropagateSubmittedUpdate
              5. svr3->svr2: SubmittedUpdateResultNotification
              6. svr3->svr2: PushCommittedUpdates
              7. svr2->svr3: PullCommittedUpdates
              8. svr3->svr4: PushCommittedUpdates
              9. svr4->svr3: PullCommittedUpdates
             10. svr4->svr1: PushCommittedUpdates
             11. svr1->svr4: PullCommittedUpdates
             12. link between svr1 and svr2 comes back up
             13. svr2->svr1: SubmittedUpdateResultNotification
             14. svr2->svr1: PushCommittedUpdates
             15. svr1->svr2: PullCommittedUpdates
             16. svr1->client: SubmittedUpdateResultNotification

Because the link between svr1 and svr2 goes down after the submitted update has been propagated, the committed update content reaches svr1 via an alternate path through the DAG (svr3->svr4->svr1, completing in event 11) before the SubmittedUpdateResultNotification reaches it (event 13).

Because of these complications, SubmittedUpdateResultNotification (as well as scheduling of PushCommittedUpdates operations to propagate the newly arrived committed content downstream) should be triggered as follows:

Implementations should not attempt to simplify the synchronization requirements here by forcing the SubmittedUpdateResultNotification to complete before the committed content propagates, because doing so could mean that a single unavailable downstream server would hold up transmissions of committed updates to all servers in the network.

4.2.2.7 Submitted Update Reordering Details

Non-primary servers must not hold-and-re-order update submissions. They simply forward all updates up the DAG, and the primary performs any needed re-ordering. Non-primary servers need not hold-and-re-order committed updates coming back down the DAG, because all ARS servers are required to send committed updates in order and without gaps in the numbering sequence since the requested CSN.

The following figure provides an example of the dynamics that can result from the update submission re-ordering/time-out mechanism.


                       primary:E2 (E4, E5 queued)
                            / \
                           /   \
                         \|/   \|/
                        repA   repB
                        /  \
                       ~    \
                     \|/    \|/
                  repC:E3   repD:E4,E5
                      ~      /
                      |     /
                     \|/  \|/
                       repE

In this figure, the last update to be serialized by the primary from replica server E is E's SSN number 2 (denoted by "primary:E2"). After E2 committed, E propagated three more update submissions: it propagated SSN number 3 to replica server C, and SSN's 4 and 5 to replica server D. Replica server C became partitioned from the network after it accepted submission E3 (indicated by the tilde's in the figure), but SSN 4 and 5 made it to the primary via repD->repA->primary. Because it has not yet seen E3, the primary queues E4 and E5, waiting for E's SSN 3 to arrive. If replica server C stays partitioned for a long time, the primary will time out SSN's 4 and 5 (sending an ARSError for each containing ARSErrorCode=127001). Replica server C might then repair its partition and propagate E3 upstream, at which point the primary will serialize and pass the corresponding committed update back down. Replica server E could therefore see E4 and E5 fail and then see E3 succeed. It is up to E to decide whether and when to resubmit the failed submissions. Possibilities include:

Submission re-ordering is not performed in the downstream direction. Instead, updates are only propagated in the downstream direction in CSN order. Note that this happens naturally because the primary generates updates and sends them in CSN order, and its downstream servers likewise send updates only up through the last CSN they have seen, so all ARS servers will always see updates in complete CSN order.

4.2.3 ARS Encoding Negotiation Protocol (ars-e) Processing

In response to a ContentEncodingNegotiation request, the responder makes a zone-specific decision (e.g., different zones can have different underlying databases, supporting correspondingly different proprietary encoding formats). The local implementation may also consider other issues (e.g., source IP address to decide if encryption allowed based on country's export control restrictions).

Encoding negotiation results may be cached as long as a BEEP channel is open to the remote server. Thus, to change the set of encodings it supports a server must first close any open channels.

If a server that does not support the ars-e protocol receives a ContentEncodingNegotiation request, it responds an ARSError containing ARSErrorCode=223005.

After receiving a response to the ContentEncodingNegotiation request, the initiator should check that the responded set is indeed a subset of the original encodings.

A request specifying LastSeenCSN='0' indicates that the entire zone is to be transferred. This case may be used by the upstream server to trigger a special full-zone encoding, if ars-e is supported by both servers.

The basic algorithm used for 'plumbing' into a content encoding is to define an API which the encoding can upcall to save documents to their stable (temporary) storage, passing the document name, content, CSN, and operation to be performed. On the reverse side (sending a set of documents from stable storage out through a encoding), the encoding upcalls to get a list of document names needing transmission, and then upcalls to get the document data content for each. The encoding can then perform whatever transformations are needed on the way to/from stable storage. Importantly, the whole process must be implemented as a pipeline so as not to assume an entire update will fit in memory -- as they arrive documents should be saved to stable storage, and they should be read as they are to be sent.

4.3 Example State Transition Diagrams

The state transitions needed to implement ARS will depend on which subset of ARS sub-protocols is implemented, and what scheduling and synchronization mechanisms are implemented. The following three figures provide a set of state transition diagrams that could be used to implement all three ARS sub-protocols (ars-c, ars-s, and ars-e), with support for PushCommittedUpdates requests down the submission path immediately following an update that was propagated up that path. In these state diagrams the paths running straight down represent the transitions taken when the current state completes successfully, while the paths to the left represent the transitions when a failure occurs.

The first state transition diagram can be used for handling submitted updates arriving at a non-primary from a client (via SubmitUpdate) or from a downstream ARS server (via PropagateSubmittedUpdate):


                        |
                       \|/
              /<---Incomplete
             /          |
            /          \|/
           /    /<-CompletelyReceived
          /    /        |
         /    /        \|/
        /    /<-PropagatingUpstream<------\ timeout + retrans
       |     |          |          \______/ N times, then infinite
       | f   |         \|/                 FailedSubmissionPropagation
       | a   |<--AwaitingCommitNotif
       | i   |          |
       | l   |         \|/
       | u   |<--AwaitingLocalCommit
       | r   |          |
       | e   |         \|/
       | s   |<-KickingDownstreamScheds
       |     |          |
       |      \        \|/
       \       \-->NotifyingSubmitter<------\ timeout + retrans
        \               |            \______/ N times
         \             \|/
          \------->CleaningUp
                        |
                       \|/
                      Done

The second state transition diagram can be used for handling submitted updates arriving at the primary from a client or from a downstream ARS server:


                        |
                       \|/
              /<---Incomplete
             /          |
            /          \|/
           /    /<-CompletelyReceived
          /    /        |
         /    /        \|/
        /    /<-QueuedForReordering
       |     |          |
       | f   |         \|/
       | a   |<---WaitingForLock
       | i   |          |
       | l   |         \|/
       | u   |<--ApplyingToLocalDatastore
       | r   |          |
       | e   |         \|/
       | s   |<-KickingDownstreamScheds
       |     |          |
       |      \        \|/
       \       \-->NotifyingSubmitter<------\ timeout + retrans
        \               |            \______/ N times
         \             \|/
          \------->CleaningUp
                        |
                       \|/
                      Done

The third state transition diagram can be used for handling committed updates arriving at a non-primary from an upstream ARS server:


                       |
                      \|/
                  /Incomplete
                 /     |
                /     \|/
               /<-CompletelyReceived
          f   /        |
          a  /        \|/
          i /<---WaitingForLock
          l |          |
          u |         \|/
          r |<--ApplyingToLocalDatastore
          e |          |
          s  \        \|/
              \--->CleaningUp
                       |
                      \|/
                     Done

The AwaitingCommitNotif state is used to represent the case where a submitted update has been propagated upstream and the local server has not yet received notification that the update has committed (along with the CSN). The AwaitingLocalCommit state is used to represent the case where the commit notification has been received but the committed update content has not yet been propagated back down the DAG to the local server. (See the discussion of SubmittedUpdateResultNotification Processing.)

The KickingDownstreamScheds state is used to represent the case where PushCommittedUpdates operations are scheduled to run periodically at the upstream server, and when a new update arrives the schedules need to be changed such that a PushCommittedUpdates runs immediately and then the normal schedule period is re-started. Again, this is needed for the case of an implementation that performs PushCommittedUpdates requests down the submission path immediately following an update that was propagated up that path.



 TOC 

5. Security Considerations

See [1]'s Section 10 for a discussion of ARS security issues.



 TOC 

References

[1] Schwartz, M., "The ANTACID Replication Service: Rationale and Architecture", draft-schwartz-antacid-service-00 (work in progress), October 2001.
[2] World Wide Web Consortium, "Extensible Markup Language (XML) 1.0", W3C XML, February 1998.
[3] Rose, M., "The Blocks Extensible Exchange Protocol Framework", draft-mrose-blocks-protocol-04 (work in progress), May 2000.
[4] Rose, M., Gazzetta, M. and M. Schwartz, "The Blocks Datastore Model", Draft Technical Memo, January 2001.
[5] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998.
[6] Reynolds, J., "Post Office Protocol", RFC 918, Oct 1984.
[7] Postel, J., "Simple Mail Transfer Protocol", RFC 788, Nov 1981.
[8] Mockapetris, P., "Domain names - concepts and facilities", RFC 1034, STD 13, Nov 1987.
[9] Lamport, L., "Time, Clocks, and the Ordering of Events in a Distributed System", Communications of the ACM Vol. 21, No. 7, July 1978.
[10] Stevens, W., "TCP/IP Illustrated, Volume 1 - The Protocols", Addison-Wesley Professional Computing Series , 1994.
[11] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", RFC 1521, September 1993.


 TOC 

Author's Address

  Michael F. Schwartz
  Code On The Road, LLC
EMail:  schwartz@CodeOnTheRoad.com
URI:  http://www.CodeOnTheRoad.com


 TOC 

Appendix A. Acknowledgements

The author would like to thank the following people for their reviews of this specification: Marco Gazzetta, Carl Malamud, Darren New, and Marshall Rose.



 TOC 

Appendix B. Future Enhancements and Investigations

A possible future enhancement to the protocol and implementation would be to use an attribute that specifies payload length for update content. By doing this, an implementation could copy the payload directly to stable storage instead of first parsing it. This could provide a significant performance improvement, and would also allow the update content to be saved in exactly the format it was sent in (as opposed to the rewriting/reindenting/etc. that happen when XML content is parsed and then output).

Another possible future enhancement to the protocol and implementation would be to allow serialization-only primary servers, whose only job is to serialize update submissions and distribute the work for applying and propagating serialized updates among the first tier of zone replica servers. That would offload query and update processing from the inherently centralized serialization server.

A possible future enhancement to the protocol would be to allow a replication topology containing cycles, rather than requiring DAGs. This generalization would allow more resilience to network partitions with fewer servers than DAGs. For example, consider the following cyclic replication topology:


                 s1
                 | \
                 |  \ 
                s2---s3

In this figure updates can to s2 if s3 is down and vice versa. With ARS's DAG-based topology an additional server would be required to achieve the same level of redundancy:


               s1------>s4
               | \      /
               |  \    /
               |    \ /
               |    / \
               |   /   \
              \|/\|/   \|/
                s2----->s3

Another possible future enhancement to the protocol would be to allow batching of submitted updates before propagating up the DAG.

An area for further work is defining SNMP-based monitoring/management interfaces.

An area for further work is automating the approach to laying out replication topology.



 TOC 

Appendix C. ANTACID Replication Service Registration

Profile Identification:
http://xml.resource.org/profiles/ARS
Messages exchanged during Channel Creation:
none
Messages in "REQ" frames:
"ARSRequest"
Messages in positive "RSP" frames:
"ARSResponse"
Messages in negative "RSP" frames:
"ARSError"
Message Syntax:
c.f., ARS Top-Level DTD, ars-c DTD, ars-s DTD, and ars-e DTD.
Message Semantics:
c.f., ARS Message Semantics



 TOC 

Appendix D. ARS Top-Level DTD



<!--
  Top-level (implementation choices) for ANTACID Replication
  Service, as of 2001-10-07.

  Copyright 2001 Code On The Road, LLC.
  -->




<!-- Entity declarations for ARS sub-protocols ->
    <!ENTITY % ARSC PUBLIC "-//IETF//DTD ARSC//EN" "">
    <!ENTITY % ARSE PUBLIC "-//IETF//DTD ARSE//EN" "">
    <!ENTITY % ARSS PUBLIC "-//IETF//DTD ARSS//EN" "">





<!-- Implementations supporting only the Commit-and-Propagate Protocol
     (ars-c) use the following -->
<!ENTITY % ARSREQUESTS '
	   SubmitUpdate                       |
	   SubmittedUpdateResultNotification  |
	   PushCommittedUpdates               |
	   PullCommittedUpdates
'>
<!ELEMENT UpdateGroup                DataWithOps>
%ARSC;





<!-- Implementations supporting Commit-and-Propagate Protocol
     (ars-c) and Submission-Propagation Protocol (ars-s) use the
     following -->
<!ENTITY % ARSREQUESTS '
	   SubmitUpdate                       |
	   SubmittedUpdateResultNotification  |
	   PropagateSubmittedUpdate           |
	   PushCommittedUpdates               |
	   PullCommittedUpdates
'>
<!ELEMENT UpdateGroup                DataWithOps>
%ARSC;
%ARS;





<!-- Implementations supporting Commit-and-Propagate Protocol
     (ars-c) and Encoding Negotiation Protocol (ars-e) use the
     following -->
<!ENTITY % ARSREQUESTS '
	   SubmitUpdate                       |
	   SubmittedUpdateResultNotification  |
	   PushCommittedUpdates               |
	   PullCommittedUpdates               |
	   ContentEncodingNegotiation'
'>
%ARSC;
%ARSE;





<!-- Implementations supporting Commit-and-Propagate Protocol
     (ars-c), Submission-Propagation Protocol (ars-s), and Encoding
     Negotiation Protocol (ars-e) use the following -->
<!ENTITY % ARSREQUESTS '
	   SubmitUpdate                       |
	   SubmittedUpdateResultNotification  |
	   PropagateSubmittedUpdate           |
	   PushCommittedUpdates               |
	   PullCommittedUpdates               |
	   ContentEncodingNegotiation'
'>
%ARSC;
%ARS;
%ARSE;




 TOC 

Appendix E. ars-c DTD



<!--
  DTD for ANTACID Replication Service Commit-and-Propagate Protocol
  (ars-c), as of 2001-10-07.

  Copyright 2001 Code On The Road, LLC.

  This document is a DTD and is in full conformance with all
  provisions of Section 10 of RFC2026 except that the right to
  produce derivative works is not granted.


  Refer to this DTD as:

    <!ENTITY % ARSC PUBLIC "-//IETF//DTD ARS//EN" "">
    %ARSC;
  -->





<!--
  Contents

    DTD inclusion

    Data types

    Entities

    ARS messages
          The SubmitUpdate operation
          The SubmittedUpdateResultNotification operation
          The PushCommittedUpdates operation
          The PullCommittedUpdates operation

  -->







<!--
  DTD inclusion

  Caller should already have included the BEEP Channel Management DTD.  
  -->




<!--
  Data types:

        entity   syntax/reference                     example
        ======   ================                     =======
    names
        DNSNAME    ([A-Za-z][-A-Za-z0-9]*)              a3.example.com
                       ("." ([A-Za-z][-A-Za-z0-9]*))*
                   c.f. [RFC-1036]
        DNIP       ([A-Za-z][-A-Za-z0-9]*)              a3.example.com
                       ("." ([A-Za-z][-A-Za-z0-9]*))*      - or -
                   *OR* [0-9]+.[0-9]+.[0-9]+.[0-9]+     204.62.247.64
                   c.f. [RFC-1036]
        DOCURI     ([A-Za-z][-+.A-Za-z0-9]*):           blocks:doc.rfc.2629
                       ([A-Za-z0-9][-_A-Za-z0-9]*)
                           ("." ([A-Za-z0-9][-_A-Za-z0-9]*))*
                   c.f., [RFC-2396]

    integers
        UINT16     0..32767                             42
        UINT32     0..4294967295                        17
        UINT64     0 .. 1.8447x10^^19                   329412431233

    multiline character data
        TEXT
  -->


<!ENTITY % DNSNAME           "NMTOKEN">
<!ENTITY % DNIP              "NMTOKEN">
<!ENTITY % DOCURI            "NMTOKEN">
<!ENTITY % UINT16            "CDATA">
<!ENTITY % UINT32            "CDATA">
<!ENTITY % UINT64            "CDATA">
<!ENTITY % TEXT              "#PCDATA">




<!--
  Entities

    entity                   use
    ======                   ===
    GlobalSubmitID           globally unique-for-all-time identifier
                             of an update submission.
    OptionalNotificationDest hostname/IP address + port number where
                             notifications are to be sent (for use
                             in contexts where optional/implied)
    RequiredNotificationDest hostname/IP address + port number where
                             notifications are to be sent (for use
                             in contexts where required)
    UpdateOpNames            string names of update operations
                             supported

  -->

<!ENTITY % GlobalSubmitID '
           SubmisSvrHost              (%DNSNAME;)            #REQUIRED
           SubmisSvrPortNum           (%UINT16;)             #REQUIRED
           SubmisSvrIncarn            (%UINT64;)             #REQUIRED
           SSN                        (%UINT64;)             #REQUIRED
'>

<!ENTITY % OptionalNotificationDest '
           NotifyHost                 (%DNIP;)               #IMPLIED
           NotifyPort                 (%UINT16;)             #IMPLIED
'>

<!ENTITY % RequiredNotificationDest '
           NotifyHost                 (%DNIP;)               #REQUIRED
           NotifyPort                 (%UINT16;)             #REQUIRED
'>

<!ENTITY % UpdateOpNames              "(create|write|update|delete|noop)">




<!--
  ARS messages

     role           REQ               RSP   
     ====           ===               ===
      I             ARSRequest        ARSResponse
                                          +: ARSAnswer
                                          -: ARSError

  The following defines the ELEMENT's that are returned within the
  ARSAnswer part of successful ARSResponse messages, to show which
  type of ARSResponse MUST be paired with each type of ARSRequest:

     ARSRequest                          ELEMENT within ARSAnswer
     ==========                          ========================
     SubmitUpdate                        GlobalSubmitID
     SubmittedUpdateResultNotification   (none)
     PushCommittedUpdates                (none)
     PullCommittedUpdates                (UpdateGroup)*
 
  -->


<!ELEMENT ARSRequest                 (%ARSREQUESTS;)>
<!ATTLIST ARSRequest
          ReqNum                     (%UINT32;)              #REQUIRED
>

<!ELEMENT ARSResponse                (ARSAnswer|ARSError)>
<!ATTLIST ARSResponse 
          ReqNum                     (%UINT32;)              #REQUIRED
>

<!ELEMENT ARSAnswer                  (
                                      (GlobalSubmitID |
                                      (UpdateGroup)*)?
                                     )
>

<!ELEMENT GlobalSubmitID             EMPTY>
<!ATTLIST GlobalSubmitID
          %GlobalSubmitID;
>

<!ELEMENT ARSError                   (
                                      ARSErrorCode,
                                      ARSErrorText,
                                      ARSErrorSpecificsText?
                                     )
>
<!ATTLIST ARSError
          OccurredAtSvrHost          (%DNIP;)               #REQUIRED
          OccurredAtSvrPortNum       (%UINT16;)             #REQUIRED
          OccurredAtSvrIncarn        (%UINT64;)             #REQUIRED
>
<!ELEMENT ARSErrorCode               (%UINT32;)>
<!ELEMENT ARSErrorText               (%TEXT;)>
<!ELEMENT ARSErrorSpecificsText      (%TEXT;)>

<!ELEMENT SubmitUpdate               UpdateGroup>
<!ATTLIST SubmitUpdate
          %OptionalNotificationDest;
          NotifyOkOnCurrentChannel   (yes|no)                #IMPLIED

<!ELEMENT SubmittedUpdateResultNotification (ARSError?)>
<!ATTLIST SubmittedUpdateResultNotification
          %GlobalSubmitID;
          CSN                        (%UINT64;)              #REQUIRED
          ZoneTopNodeName            (%DOCURI;)              #REQUIRED
>


<!ELEMENT PushCommittedUpdates       EMPTY>
<!ATTLIST PushCommittedUpdates
          UpstreamHost               (%DNIP;)                #REQUIRED
          UpstreamPortNum            (%UINT16;)              #REQUIRED
>

<!ELEMENT PullCommittedUpdates       ReplState>

<!ELEMENT ReplState
          (TopNodeOfZoneToReplicate,LastSeenCSN)+
>

<!ELEMENT TopNodeOfZoneToReplicate   (%DOCURI;)>

<!ELEMENT LastSeenCSN                (%UINT64;)>

<!-- 
     ContentEncodingName: DataWithOps
     This encoding can be used for submitted and committed updates.
     Associated with each data unit is a store (update, delete, etc.)
     operation and attributes for the data unit's name and CSN.  This
     encoding MUST be supported by all ARS clients and servers (and is
     the encoding used for servers that do not support the ars-e
     protocol).
  -->
<!ELEMENT DataWithOps                (DatumAndOp*)>

<!-- ANY contains a single data unit's content, meeting the requirements
     defined in datastore.dtd
  -->
<!ELEMENT DatumAndOp                 ANY>
<!ATTLIST DatumAndOp
          Name                       (%DOCURI;)              #REQUIRED
          CSN                        (%UNIT64;)              #REQUIRED
          Action                     (%UpdateOpNames;)       #REQUIRED
>




 TOC 

Appendix F. ars-s DTD



<!--
  DTD for ANTACID Replication Service ARS Submission-Propagation
  Protocol (ars-s), as of 2001-10-07.

  Copyright 2001 Code On The Road, LLC.

  This document is a DTD and is in full conformance with all
  provisions of Section 10 of RFC2026 except that the right to
  produce derivative works is not granted.


  Refer to this DTD as:

    <!ENTITY % ARSS PUBLIC "-//IETF//DTD ARS//EN" "">
    %ARSS;
  -->





<!--
  Contents

    ARS messages
	  The PropagateSubmittedUpdate operation

  -->



<!--
  ARS messages

  The following defines the ELEMENT's that are returned within the
  ARSAnswer part of successful ARSResponse messages, to show which
  type of ARSResponse MUST be paired with each type of ARSRequest:

     ARSRequest                          ELEMENT within ARSAnswer
     ==========                          ========================
     PropagateSubmittedUpdate            (none)
 
  -->


<!ELEMENT PropagateSubmittedUpdate   (
				      FailedUpdateSubmission |
				      UpdateGroup
				     )
>
<!ATTLIST PropagateSubmittedUpdate
          %GlobalSubmitID;
          %RequiredNotificationDest;
>

<!ELEMENT FailedUpdateSubmission     EMPTY>




 TOC 

Appendix G. ars-e DTD



<!--
  DTD for ANTACID Replication Service Encoding Negotiation Protocol
  (ars-e), as of 2001-10-07.

  Copyright 2001 Code On The Road, LLC.

  This document is a DTD and is in full conformance with all
  provisions of Section 10 of RFC2026 except that the right to
  produce derivative works is not granted.


  Refer to this DTD as:

    <!ENTITY % ARSE PUBLIC "-//IETF//DTD ARSE//EN" "">
    %ARSE;
  -->





<!--
  Contents

    DTD inclusion

    Data types

    ARS messages
          The ContentEncodingNegotiation operation

  -->





<!--
  DTD inclusion

  Caller should already have included the defined encodings.
  -->




<!--
  Data types:

        entity         syntax/reference           example
        ======         ================           =======
    names
        ENCODINGNAME   NMTOKEN                    unixFileStore.tar.gz
  -->

<!ENTITY % ENCODINGNAME "NMTOKEN">



<!--
  ARS messages

  The following defines the ELEMENT's that are returned within the
  ARSAnswer part of successful ARSResponse messages, to show which
  type of ARSResponse MUST be paired with each type of ARSRequest:

     ARSRequest                          ELEMENT within ARSAnswer
     ==========                          ========================
     ContentEncodingNegotiation          ContentEncodingsSupported
 
  -->



<!ELEMENT ContentEncodingsSupported  (ContentEncodingName+)>

<!ELEMENT ContentEncodingName        (%ENCODINGNAME;)>

<!ELEMENT ContentEncodingNegotiation ContentEncodingsSupported>
<!ATTLIST ContentEncodingNegotiation ContentEncodingsSupported
          ZoneTopNodeName            (%DOCURI;)              #REQUIRED
>




 TOC 

Appendix H. ARS Topology Configuration DTD


<!--
  DTD for configuration file for specifying characteristics of
  significance to a ANTACID Replication Service (ARS) server that are
  relevant to other clients/servers with which it must interact,
  especially topology configuration.  As of 2001-10-07.

  Copyright 2001 Code On The Road, LLC.

  This document is a DTD and is in full conformance with all
  provisions of Section 10 of RFC2026 except that the right to
  produce derivative works is not granted.

  Refer to this DTD as:

    <!ENTITY % ARSEXPCONFIG PUBLIC "-//IETF//DTD ARS CONFIG//EN" "">
    %ARSEXPCONFIG;

  -->


<!--
  DTD inclusions

  Caller should already have included the BEEP Channel Management DTD.  

  Caller should already have included the ars-c DTD>
  -->



<!ELEMENT ARSExportedConfig     (
                                 AdminContactInfo,
                                 GlobalServerID,
                                 (
                                  (ZonePrimaryConfig |
                                   NonZonePrimaryConfig
                                  )+
                                 )
                                )
>

<!ELEMENT AdminContactInfo            (name?, organization?, address?)>

<!ELEMENT name                        (%TEXT;)>

<!-- The GlobalServerID of this ARS server.  Note: the hostname
     specified in the downstream and upstream servers must be identical
     Strings or else the immediate schedule 'kick' will not be triggered
     by upstream server in response to downstream server's successful
     PropagateSubmittedUpdate -->
<!ELEMENT GlobalServerID              EMPTY>
<!ATTLIST GlobalServerID
          %GlobalServerID;
>

<!ELEMENT ZonePrimaryConfig           (
                                       ZoneTopNode,
                                       (ZoneCutPoint*),
                                       (DownstreamServer*)
                                      )
>

<!ELEMENT NonZonePrimaryConfig        (
                                       ZoneTopNode,
                                       (ZoneCutPoint*),
                                       (UpstreamServer*),
                                       (DownstreamServer*)
                                      )
>

<!ELEMENT ZoneTopNode                 EMPTY>
<!ATTLIST ZoneTopNode
          Name                        (%DOCURI;)        #REQUIRED
>

<!ELEMENT ZoneCutPoint                EMPTY>
<!ATTLIST ZoneCutPoint
          Name                        (%DOCURI;)        #REQUIRED
>

<!ELEMENT ServerLocation              EMPTY>
<!ATTLIST ServerLocation
          SvrHost                     %DNSNAME;         #REQUIRED
          SvrPort                     %UINT16;          #REQUIRED
>

<!-- All that would really be needed for proper server operation is a
     ServerLocation for each DownstreamServer and each UpstreamServer.  We
     use a GlobalServerID here (which adds the SvrIncarnNum) because that
     makes it possible to do better state dumping of the ARS server.  Note
     that if you don't update the SvrIncarnNum in the ARSExcportedConfig
     file the server will still work correctly in all respects except it
     will not be able to dump the last seen SubmitSeqNum for any downstream
     servers that are using a SvrIncarNum different from what is listed in
     the config file.  -->

<!ELEMENT DownstreamServer            (GlobalServerID, PushProperties)*>

<!ELEMENT UpstreamServer              (
                                       Preference,
                                       GlobalServerID,
                                       (TopNodeOfZoneToReplicate|ZoneFilter),
                                       PullProperties
                                      )
>

<!ELEMENT Preference                  EMPTY>
<!ATTLIST Preference
          Weight                      (%UINT32;)        #REQUIRED
>

<!ELEMENT TopNodeOfZoneToReplicate    EMPTY>
<!ATTLIST TopNodeOfZoneToReplicate
          Name                        (%DOCURI;)        #REQUIRED
>

<!-- Push and pull periods are # of seconds between attempts to
     send/retrieve updates.  0 for a DownstreamServer means no delay
     should be imposed between receipt of an update and attempting to
     propagate that update to that DownstreamServer.  (0 is not
     meaningful/allowed for an UpstreamServer).  -1 means no attempt
     should be made to push or pull updates.
  -->
<!ELEMENT PushProperties              EMPTY>
<!ATTLIST PushProperties
          Period                      (%INT32;)         #REQUIRED
>
<!ELEMENT PullProperties              EMPTY>
<!ATTLIST PullProperties
          Period                      (%INT32;)         #REQUIRED
>




 TOC 

Appendix I. Current Encodings and Registration Procedures

ARS encodings are defined as MIME [11] Content-Type "application/ars", with the single parameter "encoding_name" naming which encoding is being used (e.g., DataWithOps). ars-e is NOT a MIME Content Transfer Encoding, since it is not application-independent.

As is the case with MIME primary types, encodings being used privately (that is, between peers that understand the encoding by mutual prior arrangement) must be given names that begin with "X-" to indicate the encodings' non-standard status and to avoid a potential conflict with a future official name. Following the "X-" must be a URI [5] that identifies the encoding uniquely (for example, X-http://xml.resource.org/encodings/mysqlRaw.html). This URI should refer to a document that describes the encoding (whether formally or informally), but the existence of a document is not required. The only requirement is that the URI must provide a globally unique identification of the encoding, to prevent clashes in the name space of privately defined encodings.

ARS Encodings are afforded official status when they have been registered with the Internet Assigned Numbers Authority (IANA), using the template provided below. The currently defined ARS encodings are also listed below, for convenience.

Note that ARS references the encoding_name within the ContentEncodingsSupported and UpdateGroup elements, without the MIME "Content-Type:" syntax.

I.1 Currently Defined Encodings



<!--
  DTD for ANTACID Replication Service Registered Encodings
  (ars-encs), as of 2001-10-07.


  Copyright 2001 Code On The Road, LLC.

  This document is a DTD and is in full conformance with all
  provisions of Section 10 of RFC2026 except that the right to
  produce derivative works is not granted.


  Refer to this DTD as:

    <!ENTITY % ARSENC PUBLIC "-//IETF//DTD ARSENC//EN" "">
    %ARSENC;
  -->




<!--
  DTD inclusion

  Caller should already have included the ars-c DTD, which defines the
  DataWithOps encoding.

  -->





<!-- each UpdateGroup contains some encoded data.  The UpdateGroup
     element defines the set of encodings understood by this server
  -->
<!ELEMENT UpdateGroup                (
                                      DataWithOps |
                                      AllZoneData |
                                      EllipsisNotation
                                     )
>

<!ELEMENT AllZoneData               (DatumAndOp*)>
<!ATTLIST AllZoneData
          TopNodeOfZoneToReplicate    (%DOCURI;)             #REQUIRED
>

<!ELEMENT EllipsisNotation           (GlobalUpdateSubmitID+)>
<!ELEMENT GlobalUpdateSubmitID       EMPTY>
<!ATTLIST GlobalUpdateSubmitID       (%GlobalUpdateSubmitID;)
          CSN		             (%UNIT64;)              #REQUIRED
>


The AllZoneData encoding is used to send (and receive) all documents within a datastore, used for two cases: (a) starting up a new replica, (b) updating a downstream replica from an upstream server that uses log-based committed update state management, when the downstream server's last seen CSN is earlier than the upstream replica's log truncation point. The encoding is similar to that used for DataWithOps, with the following differences:

The EllipsisNotation encoding may be used during committed update propagation when transmitting to a downstream server on the DAG path along which an update was originally submitted. Instead of sending the documents to be updated inside the UpdateGroup, the upstream server sends the GlobalUpdateSubmitID that was assigned when the update was originally submitted. The downstream server then commits the content that it had saved in temporary stable storage. This encoding avoids transmitting the update content down the same link(s) along which it was originally submitted. When using this encoding it is the responsibility of the upstream server to track where it received updates in order to determine when the ellipsis notation may be applied. It is a local implementation matter whether the state needed for tracking this information is kept on stable storage vs as in-memory current-server-incarnation state.

I.2 Encoding Registration Procedures

Similar to the MIME IANA Registration Procedures, this appendix provides an email template for registering new ARS encodings. Note that this template has not yet been registered with the IANA.


     To:  IANA@isi.edu
     Subject:  Registration of new ARS Encoding (MIME Content-Type:
     application/ars)

     Encoding name:

     Dependence on proprietary formats:

     Security considerations:

     Published specification:

     (The published specification must be an Internet RFC or
     RFC-to-be if a new top-level type is being defined, and must be
     a publicly available specification in any case.)

     Person & email address to contact for further information:



 TOC 

Full Copyright Statement

Acknowledgement