Document (#27074)

Editor
Chaudhri, A.B. et al.
Title
XML data management : native XML and XML-enabled database systems
Imprint
Boston, MA : Addison-Wesley
Year
2003
Pages
641 S
Isbn
0-201-84452-4
Footnote
Rez. in: JASIST 55(2004) no.1, S.90-91 (N. Rhodes): "The recent near-exponential increase in XML-based technologies has exposed a gap between these technologies and those that are concerned with more fundamental data management issues. This very comprehensive and well-organized book has quite neatly filled the gap, thus achieving most of its stated intentions. The target audiences are database and XML professionals wishing to combine XML with modern database technologies and such is the breadth of scope of this book (hat few would not find it useful in some way. The editors have assembled a collection of chapters from a wide selection of industry heavyweights and as with most books of this type, it exhibits many disparate styles but thanks to careful editing it reads well as a cohesive whole. Certain sections have already appeared in print elsewhere and there is a deal of corporate flag-waving but nowhere does it become over-intrusive. The preface provides only the very brietest of introductions to XML but instead sets the tone for the remainder of the book. The twin terms of data- and document-centric XML (Bourret, 2003) that have achieved so much recent currency are re-iterated before XML data management issues are considered. lt is here that the book's aims are stated, mostly concerned with the approaches and features of the various available XML data management solutions. Not surprisingly, in a specialized book such as this one an introduction to XML consists of a single chapter. For issues such as syntax, DTDs and XML Schemas the reader is referred elsewhere, here, Chris Brandin provides a practical guide to achieving good grammar and style and argues convincingly for the use of XML as an information-modeling tool. Using a well-chosen and simple example, a practical guide to modeling information is developed, replete with examples of the pitfalls. This brief but illuminating chapter (incidentally available as a "taster" from the publisher's web site) notes that one of the most promising aspects of XML is that applications can be built to use a single mutable information model, obviating the need to change the application code but that good XML design is the basis of such mutability.
There is some debate over what exactly constitutes a native XML database. Bourret (2003) favors the wider definition; other authors such as the Butler Group (2002) restrict the use of the term to databases systems designed and built solely for storage and manipulation of XML. Two examples of the lauer (Tamino and eXist) are covered in detailed chapters here but also included in this section is the embedded XML database system, Berkeley DB XML, considered by makers Sleepycat Software to be "native" in that it is capable of storing XML natively but built an top of the Berkeley DB engine. To the uninitiated, the revelation that schemas and DTDs are not required by either Tamino or eXist might seem a little strange. Tamino implements "loose coupling" where the validation behavior can be set to "strict," "lax" (i.e., apply only to parts of a document) or "skip" (no checking), in eXist, schemas are simply optional. Many DTDs and schemas evolve as the XML documents are acquired and so these may adhere to slightly different schemas, thus the database should support queries an similar documents that do not share the same structune. In fact, because of the difficulties in mappings between XML and database (especially relational) schemas native XML databases are very useful for storage of semi-structured data, a point not made in either chapter. The chapter an embedded databases represents a "third way," being neither native nor of the XML-enabled relational type. These databases run inside purpose-written applications and are accessed via an API or similar, meaning that the application developer does not need to access database files at the operating system level but can rely an supplied routines to, for example, fetch and update database records. Thus, end-users do not use the databases directly; the applications do not usually include ad hoc end-user query tools. This property renders embedded databases unsuitable for a large number of situations and they have become very much a niche market but this market is growing rapidly. Embedded databases share an address space with the application so the overhead of calls to the server is reduced, they also confer advantages in that they are easier to deploy, manage and administer compared to a conventional client-server solution. This chapter is a very good introduction to the subject, primers an generic embedded databases and embedded XML databases are helpfully provided before the author moves to an overview of the Open Source Berkeley system. Building an embedded database application makes far greater demands an the software developer and the remainder of the chapter is devoted to consideration of these programming issues.
Relational database Management systems have been one of the great success stories of recent times and sensitive to the market, Most major vendors have responded by extending their products to handle XML data while still exploiting the range of facilities that a modern RDBMS affords. No book of this type would be complete without consideration of the "big these" (Oracle 9i, DB2, and SQL Server 2000 which each get a dedicated chapter) and though occasionally overtly piece-meal and descriptive the authors all note the shortcomings as well as the strengths of the respective systems. This part of the book is somewhat dichotomous, these chapters being followed by two that propose detailed solutions to somewhat theoretical problems, a generic architecture for storing XML in a RDBMS and using an object-relational approach to building an XML repository. The biography of the author of the latter (Paul Brown) contains the curious but strangely reassuring admission that "he remains puzzled by XML." The first five components are in-depth case studies of XMLdatabase applications. Necessarily diverse, few will be interested in all the topics presented but I was particularly interested in the first case study an bioinformatics. One of the twentieth century's greatest scientific undertakings was the Human Genome Project, the quest to list the information encoded by the sequence of DNA that makes up our genes and which has been referred to as "a paradigm for information Management in the life sciences" (Pearson & Soll, 1991). After a brief introduction to molecular biology to give the background to the information management problems, the authors turn to the use of XML in bioinformatics. Some of the data are hierarchical (e.g., the Linnaean classification of a human as a primate, primates as mammals, mammals are all vertebrates, etc.) but others are far more difficult to model. The Human Genome Project is virtually complete as far as the data acquisition phase is concerned and the immense volume of genome sequence data is no longer a very significant information Management issue per se. However bioinformaticians now need to interpret this information. Some data are relatively straightforward, e.g., the positioning of genes and sequence elements (e.g., promoters) within the sequences, but there is often little or no knowledge available an the direct and indirect interactions between them. There are vast numbers of such interrelationships; many complex data types and novel ones are constantly emerging, necessitating an extensible approach and the ability to manage semi-structured data. In the past, object databases such as AceDB (Durbin & Mieg, 1991) have gone some way to Meeting these aims but it is the combination of XML and databases that more completely addresses knowledge Management requirements of bioinformatics. XML is being enthusiastically adopted with a plethora of XML markup standards being developed, as authors Direen and Jones note "The unprecedented degree and flexibility of XML in terms of its ability to capture information is what makes it ideal for knowledge Management and for use in bioinformatics."
After several detailed examples of XML, Direen and Jones discuss sequence comparisons. The ability to create scored comparisons by such techniques as sequence alignment is fundamental to bioinformatics. For example, the function of a gene product may be inferred from similarity with a gene of known function but originating from a different organism and any information modeling method must facilitate such comparisons. One such comparison tool, BLAST utilizes a heuristic method has become the tool of choice for many years and is integrated into the NeoCore XMS (XML Management System) described herein. Any set of sequences that can be identified using an XPath query may thus become the targets of an embedded search. Again examples are given, though a BLASTp (protein) search is labeled as being BLASTn (nucleotide sequence) in one of them. Some variants of BLAST are computationally intensive, e.g., tBLASTx where a nucleotide sequence is dynamically translated in all six reading frames and compared against similarly translated database sequences. Though these variants are implemented in NeoCore XMS, it would be interesting to see runtimes for such comparisons. Obviously the utility of this and the other four quite specific examples will depend an your interest in the application area but two that are more research-oriented and general follow them. These chapters (on using XML with inductive databases and an XML warehouses) are both readable critical reviews of their respective subject areas. For those involved in the implementation of performance-critical applications an examination of benchmark results is mandatory, however very few would examine the benchmark tests themselves. The picture that emerges from this section is that no single set is comprehensive and that some functionalities are not addressed by any available benchmark. As always, there is no Substitute for an intimate knowledge of your data and how it is used. In a direct comparison of an XML-enabled and a native XML database system (unfortunately neither is named), the authors conclude that though the native system has the edge in handling large documents this comes at the expense of increasing index and data file size. The need to use legacy data and software will certainly favor the all-pervasive XML-enabled RDBMS such as Oracle 9i and IBM's DB2. Of more general utility is the chapter by Schmauch and Fellhauer comparing the approaches used by database systems for the storing of XML documents. Many of the limitations of current XML-handling systems may be traced to problems caused by the semi-structured nature of the documents and while the authors have no panacea, the chapter forms a useful discussion of the issues and even raises the ugly prospect that a return to the drawing board may be unavoidable. The book concludes with an appraisal of the current status of XML by the editors that perhaps focuses a little too little an the database side but overall I believe this book to be very useful indeed. Some of the indexing is a little idiosyncratic, for example some tags used in the examples are indexed (perhaps a separate examples index would be better) and Ron Bourret's excellent web site might be better placed under "Bourret" rather than under "Ron" but this doesn't really detract from the book's qualities. The broad spectrum and careful balance of theory and practice is a combination that both database and XML professionals will find valuable."
Theme
Internet
Object
XML

Similar documents (content)

  1. Gillman, P.: Data handling and text compression (1992) 0.45
    0.453982 = sum of:
      0.453982 = product of:
        0.5447784 = sum of:
          0.056926385 = weight(abstract_txt:data in 5306) [ClassicSimilarity], result of:
            0.056926385 = score(doc=5306,freq=4.0), product of:
              0.13649988 = queryWeight, product of:
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.040912967 = queryNorm
              0.41704348 = fieldWeight in 5306, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.04304922 = weight(abstract_txt:systems in 5306) [ClassicSimilarity], result of:
            0.04304922 = score(doc=5306,freq=2.0), product of:
              0.1427502 = queryWeight, product of:
                1.0226387 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.040912967 = queryNorm
              0.3015703 = fieldWeight in 5306, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.058462556 = weight(abstract_txt:management in 5306) [ClassicSimilarity], result of:
            0.058462556 = score(doc=5306,freq=1.0), product of:
              0.22056085 = queryWeight, product of:
                1.2711537 = boost
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.040912967 = queryNorm
              0.26506317 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.060085945 = weight(abstract_txt:database in 5306) [ClassicSimilarity], result of:
            0.060085945 = score(doc=5306,freq=1.0), product of:
              0.2246252 = queryWeight, product of:
                1.2828122 = boost
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.040912967 = queryNorm
              0.26749423 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
          0.3262543 = weight(abstract_txt:native in 5306) [ClassicSimilarity], result of:
            0.3262543 = score(doc=5306,freq=1.0), product of:
              0.6939274 = queryWeight, product of:
                2.254711 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.040912967 = queryNorm
              0.47015625 = fieldWeight in 5306, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.0625 = fieldNorm(doc=5306)
        0.8333333 = coord(5/6)
    
  2. Stein, R.M.: Object databases (1994) 0.37
    0.37437403 = sum of:
      0.37437403 = product of:
        0.56156105 = sum of:
          0.071157984 = weight(abstract_txt:data in 1101) [ClassicSimilarity], result of:
            0.071157984 = score(doc=1101,freq=1.0), product of:
              0.13649988 = queryWeight, product of:
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.040912967 = queryNorm
              0.52130437 = fieldWeight in 1101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.15625 = fieldNorm(doc=1101)
          0.13181077 = weight(abstract_txt:systems in 1101) [ClassicSimilarity], result of:
            0.13181077 = score(doc=1101,freq=3.0), product of:
              0.1427502 = queryWeight, product of:
                1.0226387 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.040912967 = queryNorm
              0.9233666 = fieldWeight in 1101, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.15625 = fieldNorm(doc=1101)
          0.14615639 = weight(abstract_txt:management in 1101) [ClassicSimilarity], result of:
            0.14615639 = score(doc=1101,freq=1.0), product of:
              0.22056085 = queryWeight, product of:
                1.2711537 = boost
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.040912967 = queryNorm
              0.6626579 = fieldWeight in 1101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.15625 = fieldNorm(doc=1101)
          0.2124359 = weight(abstract_txt:database in 1101) [ClassicSimilarity], result of:
            0.2124359 = score(doc=1101,freq=2.0), product of:
              0.2246252 = queryWeight, product of:
                1.2828122 = boost
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.040912967 = queryNorm
              0.9457349 = fieldWeight in 1101, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.15625 = fieldNorm(doc=1101)
        0.6666667 = coord(4/6)
    
  3. Kind, J.: Database and document management systems : current usage and future trends (1992) 0.37
    0.36743793 = sum of:
      0.36743793 = product of:
        0.5511569 = sum of:
          0.056926385 = weight(abstract_txt:data in 3759) [ClassicSimilarity], result of:
            0.056926385 = score(doc=3759,freq=1.0), product of:
              0.13649988 = queryWeight, product of:
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.040912967 = queryNorm
              0.41704348 = fieldWeight in 3759, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.125 = fieldNorm(doc=3759)
          0.12176158 = weight(abstract_txt:systems in 3759) [ClassicSimilarity], result of:
            0.12176158 = score(doc=3759,freq=4.0), product of:
              0.1427502 = queryWeight, product of:
                1.0226387 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.040912967 = queryNorm
              0.8529696 = fieldWeight in 3759, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.125 = fieldNorm(doc=3759)
          0.20252024 = weight(abstract_txt:management in 3759) [ClassicSimilarity], result of:
            0.20252024 = score(doc=3759,freq=3.0), product of:
              0.22056085 = queryWeight, product of:
                1.2711537 = boost
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.040912967 = queryNorm
              0.91820574 = fieldWeight in 3759, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.125 = fieldNorm(doc=3759)
          0.16994871 = weight(abstract_txt:database in 3759) [ClassicSimilarity], result of:
            0.16994871 = score(doc=3759,freq=2.0), product of:
              0.2246252 = queryWeight, product of:
                1.2828122 = boost
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.040912967 = queryNorm
              0.7565879 = fieldWeight in 3759, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.125 = fieldNorm(doc=3759)
        0.6666667 = coord(4/6)
    
  4. Temmerman, P.: ISAD(G): de definitieve standaard? (1994) 0.32
    0.31524396 = sum of:
      0.31524396 = product of:
        0.6304879 = sum of:
          0.04981059 = weight(abstract_txt:data in 7797) [ClassicSimilarity], result of:
            0.04981059 = score(doc=7797,freq=1.0), product of:
              0.13649988 = queryWeight, product of:
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.040912967 = queryNorm
              0.36491305 = fieldWeight in 7797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.109375 = fieldNorm(doc=7797)
          0.10230947 = weight(abstract_txt:management in 7797) [ClassicSimilarity], result of:
            0.10230947 = score(doc=7797,freq=1.0), product of:
              0.22056085 = queryWeight, product of:
                1.2711537 = boost
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.040912967 = queryNorm
              0.46386054 = fieldWeight in 7797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.109375 = fieldNorm(doc=7797)
          0.47836784 = weight(abstract_txt:enabled in 7797) [ClassicSimilarity], result of:
            0.47836784 = score(doc=7797,freq=1.0), product of:
              0.61672634 = queryWeight, product of:
                2.125593 = boost
                7.0917172 = idf(docFreq=99, maxDocs=44218)
                0.040912967 = queryNorm
              0.7756566 = fieldWeight in 7797, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0917172 = idf(docFreq=99, maxDocs=44218)
                0.109375 = fieldNorm(doc=7797)
        0.5 = coord(3/6)
    
  5. Vries, A.P. de: Content independence in multimedia databases (2001) 0.30
    0.2960006 = sum of:
      0.2960006 = product of:
        0.4440009 = sum of:
          0.04269479 = weight(abstract_txt:data in 6534) [ClassicSimilarity], result of:
            0.04269479 = score(doc=6534,freq=1.0), product of:
              0.13649988 = queryWeight, product of:
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.040912967 = queryNorm
              0.31278262 = fieldWeight in 6534, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3363478 = idf(docFreq=4274, maxDocs=44218)
                0.09375 = fieldNorm(doc=6534)
          0.045660593 = weight(abstract_txt:systems in 6534) [ClassicSimilarity], result of:
            0.045660593 = score(doc=6534,freq=1.0), product of:
              0.1427502 = queryWeight, product of:
                1.0226387 = boost
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.040912967 = queryNorm
              0.3198636 = fieldWeight in 6534, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4118783 = idf(docFreq=3963, maxDocs=44218)
                0.09375 = fieldNorm(doc=6534)
          0.17538767 = weight(abstract_txt:management in 6534) [ClassicSimilarity], result of:
            0.17538767 = score(doc=6534,freq=4.0), product of:
              0.22056085 = queryWeight, product of:
                1.2711537 = boost
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.040912967 = queryNorm
              0.7951895 = fieldWeight in 6534, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.2410107 = idf(docFreq=1729, maxDocs=44218)
                0.09375 = fieldNorm(doc=6534)
          0.18025784 = weight(abstract_txt:database in 6534) [ClassicSimilarity], result of:
            0.18025784 = score(doc=6534,freq=4.0), product of:
              0.2246252 = queryWeight, product of:
                1.2828122 = boost
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.040912967 = queryNorm
              0.8024827 = fieldWeight in 6534, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.2799077 = idf(docFreq=1663, maxDocs=44218)
                0.09375 = fieldNorm(doc=6534)
        0.6666667 = coord(4/6)