Search (9 results, page 1 of 1)

Rasmussen, E.M.: Indexing and retrieval for the Web (2002) 0.02
```
0.022696622 = product of:
  0.09078649 = sum of:
    0.045056276 = weight(_text_:wide in 4285) [ClassicSimilarity], result of:
      0.045056276 = score(doc=4285,freq=8.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.342674 = fieldWeight in 4285, product of:
          2.828427 = tf(freq=8.0), with freq of:
            8.0 = termFreq=8.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.02734375 = fieldNorm(doc=4285)
    0.045730207 = weight(_text_:web in 4285) [ClassicSimilarity], result of:
      0.045730207 = score(doc=4285,freq=28.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.47219574 = fieldWeight in 4285, product of:
          5.2915025 = tf(freq=28.0), with freq of:
            28.0 = termFreq=28.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.02734375 = fieldNorm(doc=4285)
  0.25 = coord(2/8)
```
Abstract

The introduction and growth of the World Wide Web (WWW, or Web) have resulted in a profound change in the way individuals and organizations access information. In terms of volume, nature, and accessibility, the characteristics of electronic information are significantly different from those of even five or six years ago. Control of, and access to, this flood of information rely heavily an automated techniques for indexing and retrieval. According to Gudivada, Raghavan, Grosky, and Kasanagottu (1997, p. 58), "The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential." Almost 93 percent of those surveyed consider the Web an "indispensable" Internet technology, second only to e-mail (Graphie, Visualization & Usability Center, 1998). Although there are other ways of locating information an the Web (browsing or following directory structures), 85 percent of users identify Web pages by means of a search engine (Graphie, Visualization & Usability Center, 1998). A more recent study conducted by the Stanford Institute for the Quantitative Study of Society confirms the finding that searching for information is second only to e-mail as an Internet activity (Nie & Ebring, 2000, online). In fact, Nie and Ebring conclude, "... the Internet today is a giant public library with a decidedly commercial tilt. The most widespread use of the Internet today is as an information search utility for products, travel, hobbies, and general information. Virtually all users interviewed responded that they engaged in one or more of these information gathering activities."
Techniques for automated indexing and information retrieval (IR) have been developed, tested, and refined over the past 40 years, and are well documented (see, for example, Agosti & Smeaton, 1996; BaezaYates & Ribeiro-Neto, 1999a; Frakes & Baeza-Yates, 1992; Korfhage, 1997; Salton, 1989; Witten, Moffat, & Bell, 1999). With the introduction of the Web, and the capability to index and retrieve via search engines, these techniques have been extended to a new environment. They have been adopted, altered, and in some Gases extended to include new methods. "In short, search engines are indispensable for searching the Web, they employ a variety of relatively advanced IR techniques, and there are some peculiar aspects of search engines that make searching the Web different than more conventional information retrieval" (Gordon & Pathak, 1999, p. 145). The environment for information retrieval an the World Wide Web differs from that of "conventional" information retrieval in a number of fundamental ways. The collection is very large and changes continuously, with pages being added, deleted, and altered. Wide variability between the size, structure, focus, quality, and usefulness of documents makes Web documents much more heterogeneous than a typical electronic document collection. The wide variety of document types includes images, video, audio, and scripts, as well as many different document languages. Duplication of documents and sites is common. Documents are interconnected through networks of hyperlinks. Because of the size and dynamic nature of the Web, preprocessing all documents requires considerable resources and is often not feasible, certainly not an the frequent basis required to ensure currency. Query length is usually much shorter than in other environments-only a few words-and user behavior differs from that in other environments. These differences make the Web a novel environment for information retrieval (Baeza-Yates & Ribeiro-Neto, 1999b; Bharat & Henzinger, 1998; Huang, 2000).
Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.00
```
0.004827458 = product of:
  0.038619664 = sum of:
    0.038619664 = weight(_text_:wide in 5480) [ClassicSimilarity], result of:
      0.038619664 = score(doc=5480,freq=2.0), product of:
        0.13148437 = queryWeight, product of:
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.029675366 = queryNorm
        0.29372054 = fieldWeight in 5480, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4307585 = idf(docFreq=1430, maxDocs=44218)
          0.046875 = fieldNorm(doc=5480)
  0.125 = coord(1/8)
```
Abstract

(Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods
Snajder, J.; Dalbelo Basic, B.D.; Tadic, M.: Automatic acquisition of inflectional lexica for morphological normalisation (2008) 0.00
```
0.003914421 = product of:
  0.031315368 = sum of:
    0.031315368 = product of:
      0.062630735 = sum of:
        0.062630735 = weight(_text_:mining in 2910) [ClassicSimilarity], result of:
          0.062630735 = score(doc=2910,freq=2.0), product of:
            0.16744171 = queryWeight, product of:
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.029675366 = queryNorm
            0.37404498 = fieldWeight in 2910, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.642448 = idf(docFreq=425, maxDocs=44218)
              0.046875 = fieldNorm(doc=2910)
      0.5 = coord(1/2)
  0.125 = coord(1/8)
```
Abstract

Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.

Hlava, M.M.K.: Automatic indexing : comparing rule-based and statistics-based indexing systems (2005) 0.00

0.003518027 = product of:
  0.028144216 = sum of:
    0.028144216 = product of:
      0.056288432 = sum of:
        0.056288432 = weight(_text_:22 in 6265) [ClassicSimilarity], result of:
          0.056288432 = score(doc=6265,freq=2.0), product of:
            0.103918076 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029675366 = queryNorm
            0.5416616 = fieldWeight in 6265, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=6265)
      0.5 = coord(1/2)
  0.125 = coord(1/8)

Source: Information outlook. 9(2005) no.8, S.22-23

Souza, R.R.; Raghavan, K.S.: ¬A methodology for noun phrase-based automatic indexing (2006) 0.00
```
0.002618981 = product of:
  0.020951848 = sum of:
    0.020951848 = weight(_text_:web in 173) [ClassicSimilarity], result of:
      0.020951848 = score(doc=173,freq=2.0), product of:
        0.096845865 = queryWeight, product of:
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.029675366 = queryNorm
        0.21634221 = fieldWeight in 173, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.2635105 = idf(docFreq=4597, maxDocs=44218)
          0.046875 = fieldNorm(doc=173)
  0.125 = coord(1/8)
```
Abstract

The scholarly community is increasingly employing the Web both for publication of scholarly output and for locating and accessing relevant scholarly literature. Organization of this vast body of digital information assumes significance in this context. The sheer volume of digital information to be handled makes traditional indexing and knowledge representation strategies ineffective and impractical. It is, therefore, worth exploring new approaches. An approach being discussed considers the intrinsic semantics of texts of documents. Based on the hypothesis that noun phrases in a text are semantically rich in terms of their ability to represent the subject content of the document, this approach seeks to identify and extract noun phrases instead of single keywords, and use them as descriptors. This paper presents a methodology that has been developed for extracting noun phrases from Portuguese texts. The results of an experiment carried out to test the adequacy of the methodology are also presented.
Roberts, D.; Souter, C.: ¬The automation of controlled vocabulary subject indexing of medical journal articles (2000) 0.00
```
0.0024586683 = product of:
  0.019669347 = sum of:
    0.019669347 = weight(_text_:data in 711) [ClassicSimilarity], result of:
      0.019669347 = score(doc=711,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.2096163 = fieldWeight in 711, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=711)
  0.125 = coord(1/8)
```
Abstract

This article discusses the possibility of the automation of sophisticated subject indexing of medical journal articles. Approaches to subject descriptor assignment in information retrieval research are usually either based upon the manual descriptors in the database or generation of search parameters from the text of the article. The principles of the Medline indexing system are described, followed by a summary of a pilot project, based upon the Amed database. The results suggest that a more extended study, based upon Medline, should encompass various components: Extraction of 'concept strings' from titles and abstracts of records, based upon linguistic features characteristic of medical literature. Use of the Unified Medical Language System (UMLS) for identification of controlled vocabulary descriptors. Coordination of descriptors, utilising features of the Medline indexing system. The emphasis should be on system manipulation of data, based upon input, available resources and specifically designed rules.

Mongin, L.; Fu, Y.Y.; Mostafa, J.: Open Archives data Service prototype and automated subject indexing using D-Lib archive content as a testbed (2003) 0.00

0.0024586683 = product of:
  0.019669347 = sum of:
    0.019669347 = weight(_text_:data in 1167) [ClassicSimilarity], result of:
      0.019669347 = score(doc=1167,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.2096163 = fieldWeight in 1167, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=1167)
  0.125 = coord(1/8)

Medelyan, O.; Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets (2008) 0.00
```
0.0024586683 = product of:
  0.019669347 = sum of:
    0.019669347 = weight(_text_:data in 1871) [ClassicSimilarity], result of:
      0.019669347 = score(doc=1871,freq=2.0), product of:
        0.093835 = queryWeight, product of:
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.029675366 = queryNorm
        0.2096163 = fieldWeight in 1871, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.1620505 = idf(docFreq=5088, maxDocs=44218)
          0.046875 = fieldNorm(doc=1871)
  0.125 = coord(1/8)
```
Abstract

Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloging rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domain-specific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents.

Newman, D.J.; Block, S.: Probabilistic topic decomposition of an eighteenth-century American newspaper (2006) 0.00

0.0017590135 = product of:
  0.014072108 = sum of:
    0.014072108 = product of:
      0.028144216 = sum of:
        0.028144216 = weight(_text_:22 in 5291) [ClassicSimilarity], result of:
          0.028144216 = score(doc=5291,freq=2.0), product of:
            0.103918076 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.029675366 = queryNorm
            0.2708308 = fieldWeight in 5291, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0546875 = fieldNorm(doc=5291)
      0.5 = coord(1/2)
  0.125 = coord(1/8)

Date: 22. 7.2006 17:32:00

Search (9 results, page 1 of 1)

Authors

Types

Themes