Search (5 results, page 1 of 1)

  • × author_ss:"Moffat, A."
  1. Bell, T.C.; Moffat, A.; Nevill-Manning, C.G.; Witten, I.H.; Zobel, J.: Data compression in full-text retrieval system (1993) 0.10
    0.10202718 = product of:
      0.20405436 = sum of:
        0.20405436 = product of:
          0.4081087 = sum of:
            0.4081087 = weight(_text_:compression in 5643) [ClassicSimilarity], result of:
              0.4081087 = score(doc=5643,freq=8.0), product of:
                0.36069217 = queryWeight, product of:
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.049309507 = queryNorm
                1.1314598 = fieldWeight in 5643, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=5643)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    When data compression is applied to full-text retrieval systems, intricate relationships emerge between the amount of compression, access speed, and computing resources required. We propose compression methods, and explore corresponding tradeoffs, for all components of static full-text systems such as text databases on CD-ROM. These components include lexical indexes, and the mein text itself. Results are reported on the application of the methods to several substantial full-text databases, and show that a large, unindexed text can be stored, along with indexes that facilitate fast searching, in less than half its original size - at some appreciable cost in primary memory requirements
  2. Moffat, A.; Isal, R.Y.K.: Word-based text compression using the Burrows-Wheeler transform (2005) 0.09
    0.08745186 = product of:
      0.17490372 = sum of:
        0.17490372 = product of:
          0.34980744 = sum of:
            0.34980744 = weight(_text_:compression in 1044) [ClassicSimilarity], result of:
              0.34980744 = score(doc=1044,freq=8.0), product of:
                0.36069217 = queryWeight, product of:
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.049309507 = queryNorm
                0.96982265 = fieldWeight in 1044, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1044)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the Burrows-Wheeler transform (bwt); applying a move-to-front (mtf) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. Until now, block-sorting implementations have assumed that the input message is a sequence of characters. In this paper we extend the block-sorting mechanism to word-based models. We also consider other recency transformations, and are able to show improved compression results compared to mtf and uniform arithmetic coding. For large files of text, the combination of word-based modeling, bwt, and mtf-like transformations allows excellent compression effectiveness to be attained within reasonable resource costs.
  3. Wan, R.; Moffat, A.: Block merging for off-line compression (2007) 0.07
    0.072144106 = product of:
      0.14428821 = sum of:
        0.14428821 = product of:
          0.28857642 = sum of:
            0.28857642 = weight(_text_:compression in 81) [ClassicSimilarity], result of:
              0.28857642 = score(doc=81,freq=4.0), product of:
                0.36069217 = queryWeight, product of:
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.049309507 = queryNorm
                0.8000629 = fieldWeight in 81, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=81)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once - usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat (2000), which processes large messages as disjoint blocks to limit memory consumption. We show that the blocks emitted by Re-Pair can be postprocessed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
  4. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.04
    0.04372593 = product of:
      0.08745186 = sum of:
        0.08745186 = product of:
          0.17490372 = sum of:
            0.17490372 = weight(_text_:compression in 2648) [ClassicSimilarity], result of:
              0.17490372 = score(doc=2648,freq=2.0), product of:
                0.36069217 = queryWeight, product of:
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.049309507 = queryNorm
                0.48491132 = fieldWeight in 2648, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2648)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    An inverted index stores, for each term that appears in a collection of documents, a list of document numbers containing that term. Such an index is indispensible when Boolean or informal ranked queries are to be answered. Construction of the index ist, however, a non trivial task. Simple methods using in.memory data structures cannot be used for large collections because they require too much random access storage, and traditional disc based methods require large amounts of temporary file space. Describes a new indexing algorithm designed to create large compressed inverted indexes in situ. It makes use of simple compression codes for the positive integers and an in place external multi way merge sort. The new techniques has been used to invert a 2-gigabyte text collection in under 4 hours, using less than 40 megabytes of temporary disc space, and less than 20 megabytes of main memory
  5. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 0.04
    0.04372593 = product of:
      0.08745186 = sum of:
        0.08745186 = product of:
          0.17490372 = sum of:
            0.17490372 = weight(_text_:compression in 9) [ClassicSimilarity], result of:
              0.17490372 = score(doc=9,freq=2.0), product of:
                0.36069217 = queryWeight, product of:
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.049309507 = queryNorm
                0.48491132 = fieldWeight in 9, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  7.314861 = idf(docFreq=79, maxDocs=44218)
                  0.046875 = fieldNorm(doc=9)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Shows that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly 2 million short documents. The self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness