Document (#33337)

Author
Simeoni, F.
Yakici, M.
Neely, S.
Crestani, F.
Title
Metadata harvesting for content-based distributed information retrieval
Source
Journal of the American Society for Information Science and Technology. 59(2008) no.1, S.12-24
Year
2008
Abstract
We propose an approach to content-based Distributed Information Retrieval based on the periodic and incremental centralization of full-content indices of widely dispersed and autonomously managed document sources. Inspired by the success of the Open Archive Initiative's (OAI) Protocol for metadata harvesting, the approach occupies middle ground between content crawling and distributed retrieval. As in crawling, some data move toward the retrieval process, but it is statistics about the content rather than content itself; this grants more efficient use of network resources and wider scope of application. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval; this reduces the costs of content provision while promoting the simplicity, effectiveness, and responsiveness of retrieval. Overall, we argue that the approach retains the good properties of centralized retrieval without renouncing to cost-effective, large-scale resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure. In particular, we define a minimal extension of the OAI protocol which supports the coordinated harvesting of full-content indices and descriptive metadata for content resources. Finally, we report on the implementation of a proof-of-concept prototype service for multimodel content-based retrieval of distributed file collections.

Similar documents (author)

  1. Crestani, F.: Combination of similarity measures for effective spoken document retrieval (2003) 5.44
    5.438222 = sum of:
      5.438222 = weight(author_txt:crestani in 4690) [ClassicSimilarity], result of:
        5.438222 = fieldWeight in 4690, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.701155 = idf(docFreq=19, maxDocs=44218)
          0.625 = fieldNorm(doc=4690)
    
  2. Crestani, F.; Lee, P.L.: Searching the web by constraining spreading activities (2000) 4.35
    4.3505774 = sum of:
      4.3505774 = weight(author_txt:crestani in 1326) [ClassicSimilarity], result of:
        4.3505774 = fieldWeight in 1326, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.701155 = idf(docFreq=19, maxDocs=44218)
          0.5 = fieldNorm(doc=1326)
    
  3. Tombros, T.; Crestani, F.: Users' perception of relevance of spoken documents (2000) 4.35
    4.3505774 = sum of:
      4.3505774 = weight(author_txt:crestani in 4996) [ClassicSimilarity], result of:
        4.3505774 = fieldWeight in 4996, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.701155 = idf(docFreq=19, maxDocs=44218)
          0.5 = fieldNorm(doc=4996)
    
  4. Crestani, F.; Du, H.: Written versus spoken queries : a qualitative and quantitative comparative analysis (2006) 4.35
    4.3505774 = sum of:
      4.3505774 = weight(author_txt:crestani in 5047) [ClassicSimilarity], result of:
        4.3505774 = fieldWeight in 5047, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.701155 = idf(docFreq=19, maxDocs=44218)
          0.5 = fieldNorm(doc=5047)
    
  5. Crestani, F.; Wu, S.: Testing the cluster hypothesis in distributed information retrieval (2006) 4.35
    4.3505774 = sum of:
      4.3505774 = weight(author_txt:crestani in 984) [ClassicSimilarity], result of:
        4.3505774 = fieldWeight in 984, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.701155 = idf(docFreq=19, maxDocs=44218)
          0.5 = fieldNorm(doc=984)
    

Similar documents (content)

  1. Van de Sompel, H.; Young, J.A.; Hickey, T.B.: Using the OAI-PMH ... differently (2003) 0.25
    0.25444007 = sum of:
      0.25444007 = product of:
        0.90871453 = sum of:
          0.01890033 = weight(abstract_txt:resources in 1191) [ClassicSimilarity], result of:
            0.01890033 = score(doc=1191,freq=1.0), product of:
              0.0716145 = queryWeight, product of:
                1.057877 = boost
                4.2226825 = idf(docFreq=1761, maxDocs=44218)
                0.01603162 = queryNorm
              0.26391765 = fieldWeight in 1191, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2226825 = idf(docFreq=1761, maxDocs=44218)
                0.0625 = fieldNorm(doc=1191)
          0.12197038 = weight(abstract_txt:initiative's in 1191) [ClassicSimilarity], result of:
            0.12197038 = score(doc=1191,freq=1.0), product of:
              0.1970218 = queryWeight, product of:
                1.2407286 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.01603162 = queryNorm
              0.6190705 = fieldWeight in 1191, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.0625 = fieldNorm(doc=1191)
          0.12805831 = weight(abstract_txt:protocol in 1191) [ClassicSimilarity], result of:
            0.12805831 = score(doc=1191,freq=3.0), product of:
              0.17779496 = queryWeight, product of:
                1.6668419 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.01603162 = queryNorm
              0.72025836 = fieldWeight in 1191, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.0625 = fieldNorm(doc=1191)
          0.10726633 = weight(abstract_txt:metadata in 1191) [ClassicSimilarity], result of:
            0.10726633 = score(doc=1191,freq=6.0), product of:
              0.14354134 = queryWeight, product of:
                1.8342934 = boost
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.01603162 = queryNorm
              0.7472853 = fieldWeight in 1191, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.0625 = fieldNorm(doc=1191)
          0.23712857 = weight(abstract_txt:harvesting in 1191) [ClassicSimilarity], result of:
            0.23712857 = score(doc=1191,freq=2.0), product of:
              0.35131583 = queryWeight, product of:
                2.8696516 = boost
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.01603162 = queryNorm
              0.67497265 = fieldWeight in 1191, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.0625 = fieldNorm(doc=1191)
          0.20373166 = weight(abstract_txt:distributed in 1191) [ClassicSimilarity], result of:
            0.20373166 = score(doc=1191,freq=2.0), product of:
              0.40002838 = queryWeight, product of:
                4.3305264 = boost
                5.761993 = idf(docFreq=377, maxDocs=44218)
                0.01603162 = queryNorm
              0.509293 = fieldWeight in 1191, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.761993 = idf(docFreq=377, maxDocs=44218)
                0.0625 = fieldNorm(doc=1191)
          0.09165897 = weight(abstract_txt:content in 1191) [ClassicSimilarity], result of:
            0.09165897 = score(doc=1191,freq=1.0), product of:
              0.3508553 = queryWeight, product of:
                5.235808 = boost
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.01603162 = queryNorm
              0.2612444 = fieldWeight in 1191, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.0625 = fieldNorm(doc=1191)
        0.28 = coord(7/25)
    
  2. Nelson, M.L.; Harrison, T.L.; Rocker, J.A.: OAI and NASA's scientific and technical information (2003) 0.15
    0.15179 = sum of:
      0.15179 = product of:
        0.9486875 = sum of:
          0.13069896 = weight(abstract_txt:protocol in 3340) [ClassicSimilarity], result of:
            0.13069896 = score(doc=3340,freq=2.0), product of:
              0.17779496 = queryWeight, product of:
                1.6668419 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.01603162 = queryNorm
              0.73511064 = fieldWeight in 3340, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.078125 = fieldNorm(doc=3340)
          0.09481093 = weight(abstract_txt:metadata in 3340) [ClassicSimilarity], result of:
            0.09481093 = score(doc=3340,freq=3.0), product of:
              0.14354134 = queryWeight, product of:
                1.8342934 = boost
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.01603162 = queryNorm
              0.6605131 = fieldWeight in 3340, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.078125 = fieldNorm(doc=3340)
          0.3630275 = weight(abstract_txt:harvesting in 3340) [ClassicSimilarity], result of:
            0.3630275 = score(doc=3340,freq=3.0), product of:
              0.35131583 = queryWeight, product of:
                2.8696516 = boost
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.01603162 = queryNorm
              1.0333366 = fieldWeight in 3340, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.078125 = fieldNorm(doc=3340)
          0.3601501 = weight(abstract_txt:distributed in 3340) [ClassicSimilarity], result of:
            0.3601501 = score(doc=3340,freq=4.0), product of:
              0.40002838 = queryWeight, product of:
                4.3305264 = boost
                5.761993 = idf(docFreq=377, maxDocs=44218)
                0.01603162 = queryNorm
              0.9003114 = fieldWeight in 3340, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.761993 = idf(docFreq=377, maxDocs=44218)
                0.078125 = fieldNorm(doc=3340)
        0.16 = coord(4/25)
    
  3. Fu, T.; Abbasi, A.; Chen, H.: ¬A focused crawler for Dark Web forums (2010) 0.15
    0.14792496 = sum of:
      0.14792496 = product of:
        0.616354 = sum of:
          0.08572086 = weight(abstract_txt:periodic in 3471) [ClassicSimilarity], result of:
            0.08572086 = score(doc=3471,freq=1.0), product of:
              0.1557408 = queryWeight, product of:
                1.1031152 = boost
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.01603162 = queryNorm
              0.55040723 = fieldWeight in 3471, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.806516 = idf(docFreq=17, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.016265204 = weight(abstract_txt:based in 3471) [ClassicSimilarity], result of:
            0.016265204 = score(doc=3471,freq=1.0), product of:
              0.08163399 = queryWeight, product of:
                1.5972952 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.01603162 = queryNorm
              0.19924548 = fieldWeight in 3471, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.045684017 = weight(abstract_txt:approach in 3471) [ClassicSimilarity], result of:
            0.045684017 = score(doc=3471,freq=3.0), product of:
              0.11267662 = queryWeight, product of:
                1.8765779 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.01603162 = queryNorm
              0.40544364 = fieldWeight in 3471, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.21632218 = weight(abstract_txt:crawling in 3471) [ClassicSimilarity], result of:
            0.21632218 = score(doc=3471,freq=2.0), product of:
              0.28867692 = queryWeight, product of:
                2.1239326 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.01603162 = queryNorm
              0.7493574 = fieldWeight in 3471, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.047406003 = weight(abstract_txt:retrieval in 3471) [ClassicSimilarity], result of:
            0.047406003 = score(doc=3471,freq=1.0), product of:
              0.21826349 = queryWeight, product of:
                3.9177027 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.01603162 = queryNorm
              0.21719621 = fieldWeight in 3471, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
          0.2049557 = weight(abstract_txt:content in 3471) [ClassicSimilarity], result of:
            0.2049557 = score(doc=3471,freq=5.0), product of:
              0.3508553 = queryWeight, product of:
                5.235808 = boost
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.01603162 = queryNorm
              0.5841602 = fieldWeight in 3471, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.0625 = fieldNorm(doc=3471)
        0.24 = coord(6/25)
    
  4. Van de Sompel, H.; Nelson, M.L.; Lagoze, C.; Warner, S.: Resource harvesting within the OAI-PMH framework (2004) 0.14
    0.14018515 = sum of:
      0.14018515 = product of:
        0.7009257 = sum of:
          0.06339364 = weight(abstract_txt:resources in 4110) [ClassicSimilarity], result of:
            0.06339364 = score(doc=4110,freq=5.0), product of:
              0.0716145 = queryWeight, product of:
                1.057877 = boost
                4.2226825 = idf(docFreq=1761, maxDocs=44218)
                0.01603162 = queryNorm
              0.88520676 = fieldWeight in 4110, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.2226825 = idf(docFreq=1761, maxDocs=44218)
                0.09375 = fieldNorm(doc=4110)
          0.11090176 = weight(abstract_txt:protocol in 4110) [ClassicSimilarity], result of:
            0.11090176 = score(doc=4110,freq=1.0), product of:
              0.17779496 = queryWeight, product of:
                1.6668419 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.01603162 = queryNorm
              0.6237621 = fieldWeight in 4110, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.09375 = fieldNorm(doc=4110)
          0.13137388 = weight(abstract_txt:metadata in 4110) [ClassicSimilarity], result of:
            0.13137388 = score(doc=4110,freq=4.0), product of:
              0.14354134 = queryWeight, product of:
                1.8342934 = boost
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.01603162 = queryNorm
              0.91523385 = fieldWeight in 4110, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.09375 = fieldNorm(doc=4110)
          0.03956352 = weight(abstract_txt:approach in 4110) [ClassicSimilarity], result of:
            0.03956352 = score(doc=4110,freq=1.0), product of:
              0.11267662 = queryWeight, product of:
                1.8765779 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.01603162 = queryNorm
              0.3511245 = fieldWeight in 4110, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.09375 = fieldNorm(doc=4110)
          0.3556929 = weight(abstract_txt:harvesting in 4110) [ClassicSimilarity], result of:
            0.3556929 = score(doc=4110,freq=2.0), product of:
              0.35131583 = queryWeight, product of:
                2.8696516 = boost
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.01603162 = queryNorm
              1.012459 = fieldWeight in 4110, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.09375 = fieldNorm(doc=4110)
        0.2 = coord(5/25)
    
  5. Suleman, H.; Fox, E.A.: Leveraging OAI harvesting to disseminate theses (2003) 0.14
    0.13974273 = sum of:
      0.13974273 = product of:
        0.69871366 = sum of:
          0.020331504 = weight(abstract_txt:based in 4779) [ClassicSimilarity], result of:
            0.020331504 = score(doc=4779,freq=1.0), product of:
              0.08163399 = queryWeight, product of:
                1.5972952 = boost
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.01603162 = queryNorm
              0.24905685 = fieldWeight in 4779, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1879277 = idf(docFreq=4958, maxDocs=44218)
                0.078125 = fieldNorm(doc=4779)
          0.09241813 = weight(abstract_txt:protocol in 4779) [ClassicSimilarity], result of:
            0.09241813 = score(doc=4779,freq=1.0), product of:
              0.17779496 = queryWeight, product of:
                1.6668419 = boost
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.01603162 = queryNorm
              0.51980174 = fieldWeight in 4779, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.653462 = idf(docFreq=154, maxDocs=44218)
                0.078125 = fieldNorm(doc=4779)
          0.109478235 = weight(abstract_txt:metadata in 4779) [ClassicSimilarity], result of:
            0.109478235 = score(doc=4779,freq=4.0), product of:
              0.14354134 = queryWeight, product of:
                1.8342934 = boost
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.01603162 = queryNorm
              0.76269484 = fieldWeight in 4779, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.881247 = idf(docFreq=911, maxDocs=44218)
                0.078125 = fieldNorm(doc=4779)
          0.2964107 = weight(abstract_txt:harvesting in 4779) [ClassicSimilarity], result of:
            0.2964107 = score(doc=4779,freq=2.0), product of:
              0.35131583 = queryWeight, product of:
                2.8696516 = boost
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.01603162 = queryNorm
              0.8437158 = fieldWeight in 4779, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.636444 = idf(docFreq=57, maxDocs=44218)
                0.078125 = fieldNorm(doc=4779)
          0.18007505 = weight(abstract_txt:distributed in 4779) [ClassicSimilarity], result of:
            0.18007505 = score(doc=4779,freq=1.0), product of:
              0.40002838 = queryWeight, product of:
                4.3305264 = boost
                5.761993 = idf(docFreq=377, maxDocs=44218)
                0.01603162 = queryNorm
              0.4501557 = fieldWeight in 4779, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.761993 = idf(docFreq=377, maxDocs=44218)
                0.078125 = fieldNorm(doc=4779)
        0.2 = coord(5/25)