Search (52 results, page 1 of 3)

Riloff, E.: ¬An empirical study of automated dictionary construction for information extraction in three domains (1996) 0.11

0.108065836 = product of:
  0.27016458 = sum of:
    0.24772175 = weight(_text_:dictionaries in 6752) [ClassicSimilarity], result of:
      0.24772175 = score(doc=6752,freq=4.0), product of:
        0.2864761 = queryWeight, product of:
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.041411664 = queryNorm
        0.86472046 = fieldWeight in 6752, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.0625 = fieldNorm(doc=6752)
    0.022442836 = product of:
      0.044885673 = sum of:
        0.044885673 = weight(_text_:22 in 6752) [ClassicSimilarity], result of:
          0.044885673 = score(doc=6752,freq=2.0), product of:
            0.1450166 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.041411664 = queryNorm
            0.30952093 = fieldWeight in 6752, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.0625 = fieldNorm(doc=6752)
      0.5 = coord(1/2)
  0.4 = coord(2/5)

Abstract: AutoSlog is a system that addresses the knowledge engineering bottleneck for information extraction. AutoSlog automatically creates domain specific dictionaries for information extraction, given an appropriate training corpus. Describes experiments with AutoSlog in terrorism, joint ventures and microelectronics domains. Compares the performance of AutoSlog across the 3 domains, discusses the lessons learned and presents results from 2 experiments which demonstrate that novice users can generate effective dictionaries using AutoSlog
Date: 6. 3.1997 16:22:15

Damerau, F.J.: Generating an evaluating domain-oriented multi-word terms from texts (1993) 0.04
```
0.035033148 = product of:
  0.17516573 = sum of:
    0.17516573 = weight(_text_:dictionaries in 5814) [ClassicSimilarity], result of:
      0.17516573 = score(doc=5814,freq=2.0), product of:
        0.2864761 = queryWeight, product of:
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.041411664 = queryNorm
        0.6114497 = fieldWeight in 5814, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.0625 = fieldNorm(doc=5814)
  0.2 = coord(1/5)
```
Abstract

Examines techniques for automatically generating domain vocabularies from large text collections. Focuses on the problem of generating multi-word vocabulary terms (specifically pairs). Discusses statistical issues associated with word co-occurrences likely to be of use in a natural language interface. Provides a more objective evaluation of the selection procedures. As substantial experimentation with subjects using a working query system is absent, all evaluation is necessarily subjective. Uses surrogate for experimentation by relying on pre-existing dictionaries as indicators of domain relevance
Gödert, W.: Detecting multiword phrases in mathematical text corpora (2012) 0.04
```
0.035033148 = product of:
  0.17516573 = sum of:
    0.17516573 = weight(_text_:dictionaries in 466) [ClassicSimilarity], result of:
      0.17516573 = score(doc=466,freq=2.0), product of:
        0.2864761 = queryWeight, product of:
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.041411664 = queryNorm
        0.6114497 = fieldWeight in 466, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.0625 = fieldNorm(doc=466)
  0.2 = coord(1/5)
```
Abstract

We present an approach for detecting multiword phrases in mathematical text corpora. The method used is based on characteristic features of mathematical terminology. It makes use of a software tool named Lingo which allows to identify words by means of previously defined dictionaries for specific word classes as adjectives, personal names or nouns. The detection of multiword groups is done algorithmically. Possible advantages of the method for indexing and information retrieval and conclusions for applying dictionary-based methods of automatic indexing instead of stemming procedures are discussed.
Galvez, C.; Moya-Anegón, F. de: ¬An evaluation of conflation accuracy using finite-state transducers (2006) 0.03
```
0.02627486 = product of:
  0.1313743 = sum of:
    0.1313743 = weight(_text_:dictionaries in 5599) [ClassicSimilarity], result of:
      0.1313743 = score(doc=5599,freq=2.0), product of:
        0.2864761 = queryWeight, product of:
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.041411664 = queryNorm
        0.4585873 = fieldWeight in 5599, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.046875 = fieldNorm(doc=5599)
  0.2 = coord(1/5)
```
Abstract

Purpose - To evaluate the accuracy of conflation methods based on finite-state transducers (FSTs). Design/methodology/approach - Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm. Findings - The conclusion is that the main strength of lemmatization is its accuracy, whereas its main limitation is the underanalysis of variant forms. Originality/value - The report outlines the potential of transducers in their application to normalization processes.
Cui, H.; Boufford, D.; Selden, P.: Semantic annotation of biosystematics literature without training examples (2010) 0.03
```
0.02627486 = product of:
  0.1313743 = sum of:
    0.1313743 = weight(_text_:dictionaries in 3422) [ClassicSimilarity], result of:
      0.1313743 = score(doc=3422,freq=2.0), product of:
        0.2864761 = queryWeight, product of:
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.041411664 = queryNorm
        0.4585873 = fieldWeight in 3422, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.046875 = fieldNorm(doc=3422)
  0.2 = coord(1/5)
```
Abstract

This article presents an unsupervised algorithm for semantic annotation of morphological descriptions of whole organisms. The algorithm is able to annotate plain text descriptions with high accuracy at the clause level by exploiting the corpus itself. In other words, the algorithm does not need lexicons, syntactic parsers, training examples, or annotation templates. The evaluation on two real-life description collections in botany and paleontology shows that the algorithm has the following desirable features: (a) reduces/eliminates manual labor required to compile dictionaries and prepare source documents; (b) improves annotation coverage: the algorithm annotates what appears in documents and is not limited by predefined and often incomplete templates; (c) learns clean and reusable concepts: the algorithm learns organ names and character states that can be used to construct reusable domain lexicons, as opposed to collection-dependent patterns whose applicability is often limited to a particular collection; (d) insensitive to collection size; and (e) runs in linear time with respect to the number of clauses to be annotated.
Witschel, H.F.: Terminology extraction and automatic indexing : comparison and qualitative evaluation of methods (2005) 0.02
```
0.021895718 = product of:
  0.109478585 = sum of:
    0.109478585 = weight(_text_:dictionaries in 1842) [ClassicSimilarity], result of:
      0.109478585 = score(doc=1842,freq=2.0), product of:
        0.2864761 = queryWeight, product of:
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.041411664 = queryNorm
        0.38215607 = fieldWeight in 1842, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          6.9177637 = idf(docFreq=118, maxDocs=44218)
          0.0390625 = fieldNorm(doc=1842)
  0.2 = coord(1/5)
```
Abstract

Many terminology engineering processes involve the task of automatic terminology extraction: before the terminology of a given domain can be modelled, organised or standardised, important concepts (or terms) of this domain have to be identified and fed into terminological databases. These serve in further steps as a starting point for compiling dictionaries, thesauri or maybe even terminological ontologies for the domain. For the extraction of the initial concepts, extraction methods are needed that operate on specialised language texts. On the other hand, many machine learning or information retrieval applications require automatic indexing techniques. In Machine Learning applications concerned with the automatic clustering or classification of texts, often feature vectors are needed that describe the contents of a given text briefly but meaningfully. These feature vectors typically consist of a fairly small set of index terms together with weights indicating their importance. Short but meaningful descriptions of document contents as provided by good index terms are also useful to humans: some knowledge management applications (e.g. topic maps) use them as a set of basic concepts (topics). The author believes that the tasks of terminology extraction and automatic indexing have much in common and can thus benefit from the same set of basic algorithms. It is the goal of this paper to outline some methods that may be used in both contexts, but also to find the discriminating factors between the two tasks that call for the variation of parameters or application of different techniques. The discussion of these methods will be based on statistical, syntactical and especially morphological properties of (index) terms. The paper is concluded by the presentation of some qualitative and quantitative results comparing statistical and morphological methods.

Greiner-Petter, A.; Schubotz, M.; Cohl, H.S.; Gipp, B.: Semantic preserving bijective mappings for expressions involving special functions between computer algebra systems and document preparation systems (2019) 0.02

0.016835874 = product of:
  0.08417937 = sum of:
    0.08417937 = sum of:
      0.06173654 = weight(_text_:german in 5499) [ClassicSimilarity], result of:
        0.06173654 = score(doc=5499,freq=2.0), product of:
          0.24051933 = queryWeight, product of:
            5.808009 = idf(docFreq=360, maxDocs=44218)
            0.041411664 = queryNorm
          0.25668016 = fieldWeight in 5499, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            5.808009 = idf(docFreq=360, maxDocs=44218)
            0.03125 = fieldNorm(doc=5499)
      0.022442836 = weight(_text_:22 in 5499) [ClassicSimilarity], result of:
        0.022442836 = score(doc=5499,freq=2.0), product of:
          0.1450166 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.041411664 = queryNorm
          0.15476047 = fieldWeight in 5499, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.03125 = fieldNorm(doc=5499)
  0.2 = coord(1/5)

Date: 20. 1.2015 18:30:22
Footnote: Beitrag in einem Special Issue: Information Science in the German-speaking Countries.

Junger, U.: Can indexing be automated? : the example of the Deutsche Nationalbibliothek (2012) 0.02

0.015279015 = product of:
  0.07639507 = sum of:
    0.07639507 = product of:
      0.15279014 = sum of:
        0.15279014 = weight(_text_:german in 1717) [ClassicSimilarity], result of:
          0.15279014 = score(doc=1717,freq=4.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.635251 = fieldWeight in 1717, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1717)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Abstract: The German subject headings authority file (Schlagwortnormdatei/SWD) provides a broad controlled vocabulary for indexing documents of all subjects. Traditionally used for intellectual subject cataloguing primarily of books the Deutsche Nationalbibliothek (DNB, German National Library) has been working on developping and implementing procedures for automated assignment of subject headings for online publications. This project, its results and problems are sketched in the paper.

Junger, U.: Can indexing be automated? : the example of the Deutsche Nationalbibliothek (2014) 0.02

0.015279015 = product of:
  0.07639507 = sum of:
    0.07639507 = product of:
      0.15279014 = sum of:
        0.15279014 = weight(_text_:german in 1969) [ClassicSimilarity], result of:
          0.15279014 = score(doc=1969,freq=4.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.635251 = fieldWeight in 1969, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1969)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Abstract: The German Integrated Authority File (Gemeinsame Normdatei, GND), provides a broad controlled vocabulary for indexing documents on all subjects. Traditionally used for intellectual subject cataloging primarily for books, the Deutsche Nationalbibliothek (DNB, German National Library) has been working on developing and implementing procedures for automated assignment of subject headings for online publications. This project, its results, and problems are outlined in this article.

Siebenkäs, A.; Markscheffel, B.: Conception of a workflow for the semi-automatic construction of a thesaurus for the German printing industry (2015) 0.02

0.015279015 = product of:
  0.07639507 = sum of:
    0.07639507 = product of:
      0.15279014 = sum of:
        0.15279014 = weight(_text_:german in 2091) [ClassicSimilarity], result of:
          0.15279014 = score(doc=2091,freq=4.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.635251 = fieldWeight in 2091, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2091)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Abstract: During the BMWI granted project "Print-IT", the need of a thesaurus based uniform and consistent language for the German printing industry became evident. In this paper we introduce a semi-automatic construction approach for such a thesaurus and present a workflow which supports users to generate thesaurus typical information structures from relevant digitalized resources with the help of common IT-tools.

Keller, A.: Attitudes among German- and English-speaking librarians toward (automatic) subject indexing (2015) 0.02
```
0.015279015 = product of:
  0.07639507 = sum of:
    0.07639507 = product of:
      0.15279014 = sum of:
        0.15279014 = weight(_text_:german in 2629) [ClassicSimilarity], result of:
          0.15279014 = score(doc=2629,freq=4.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.635251 = fieldWeight in 2629, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2629)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

The survey described in this article investigates the attitudes of librarians in German- and English-speaking countries toward subject indexing in general, and automatic subject indexing in particular. The results show great similarity between attitudes in both language areas. Respondents agree that the current quality standards should be upheld and dismiss critical voices claiming that subject indexing has lost relevance. With regard to automatic subject indexing, respondents demonstrate considerable skepticism-both with regard to the likely timeframe and the expected quality of such systems. The author considers how this low acceptance poses a difficulty for those involved in change management.
Munkelt, J.; Schaer, P.; Lepsky, K.: Towards an IR test collection for the German National Library (2018) 0.01
```
0.013096297 = product of:
  0.065481484 = sum of:
    0.065481484 = product of:
      0.13096297 = sum of:
        0.13096297 = weight(_text_:german in 4311) [ClassicSimilarity], result of:
          0.13096297 = score(doc=4311,freq=4.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.5445008 = fieldWeight in 4311, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.046875 = fieldNorm(doc=4311)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Automatic content indexing is one of the innovations that are increasingly changing the way libraries work. In theory, it promises a cataloguing service that would hardly be possible with humans in terms of speed, quantity and maybe quality. The German National Library (DNB) has also recognised this potential and is increasingly relying on the automatic indexing of their catalogue content. The DNB took a major step in this direction in 2017, which was announced in two papers. The announcement was rather restrained, but the content of the papers is all the more explosive for the library community: Since September 2017, the DNB has discontinued the intellectual indexing of series Band H and has switched to an automatic process for these series. The subject indexing of online publications (series O) has been purely automatical since 2010; from September 2017, monographs and periodicals published outside the publishing industry and university publications will no longer be indexed by people. This raises the question: What is the quality of the automatic indexing compared to the manual work or in other words to which degree can the automatic indexing replace people without a signi cant drop in regards to quality?

Stegentritt, E.: Evaluationsresultate des mehrsprachigen Suchsystems CANAL/LS (1998) 0.01

0.012347308 = product of:
  0.06173654 = sum of:
    0.06173654 = product of:
      0.12347308 = sum of:
        0.12347308 = weight(_text_:german in 7216) [ClassicSimilarity], result of:
          0.12347308 = score(doc=7216,freq=2.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.5133603 = fieldWeight in 7216, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0625 = fieldNorm(doc=7216)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Abstract: The search system CANAL/LS simplifies the searching of library catalogues by analyzing search questions linguistically and translating them if required. The linguistic analysis reduces the search question words to their basic forms so that they can be compared with basic title forms. Consequently all variants of words and parts of compounds in German can be found. Presents the results of an analysis of search questions in a catalogue of 45.000 titles in the field of psychology

Krüger, C.: Evaluation des WWW-Suchdienstes GERHARD unter besonderer Beachtung automatischer Indexierung (1999) 0.01
```
0.0109135825 = product of:
  0.05456791 = sum of:
    0.05456791 = product of:
      0.10913582 = sum of:
        0.10913582 = weight(_text_:german in 1777) [ClassicSimilarity], result of:
          0.10913582 = score(doc=1777,freq=4.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.45375073 = fieldWeight in 1777, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1777)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Die vorliegende Arbeit beinhaltet eine Beschreibung und Evaluation des WWW - Suchdienstes GERHARD (German Harvest Automated Retrieval and Directory). GERHARD ist ein Such- und Navigationssystem für das deutsche World Wide Web, weiches ausschließlich wissenschaftlich relevante Dokumente sammelt, und diese auf der Basis computerlinguistischer und statistischer Methoden automatisch mit Hilfe eines bibliothekarischen Klassifikationssystems klassifiziert. Mit dem DFG - Projekt GERHARD ist der Versuch unternommen worden, mit einem auf einem automatischen Klassifizierungsverfahren basierenden World Wide Web - Dienst eine Alternative zu herkömmlichen Methoden der Interneterschließung zu entwickeln. GERHARD ist im deutschsprachigen Raum das einzige Verzeichnis von Internetressourcen, dessen Erstellung und Aktualisierung vollständig automatisch (also maschinell) erfolgt. GERHARD beschränkt sich dabei auf den Nachweis von Dokumenten auf wissenschaftlichen WWW - Servern. Die Grundidee dabei war, kostenintensive intellektuelle Erschließung und Klassifizierung von lnternetseiten durch computerlinguistische und statistische Methoden zu ersetzen, um auf diese Weise die nachgewiesenen Internetressourcen automatisch auf das Vokabular eines bibliothekarischen Klassifikationssystems abzubilden. GERHARD steht für German Harvest Automated Retrieval and Directory. Die WWW - Adresse (URL) von GERHARD lautet: http://www.gerhard.de. Im Rahmen der vorliegenden Diplomarbeit soll eine Beschreibung des Dienstes mit besonderem Schwerpunkt auf dem zugrundeliegenden Indexierungs- bzw. Klassifizierungssystem erfolgen und anschließend mit Hilfe eines kleinen Retrievaltests die Effektivität von GERHARD überprüft werden.
Cohen, J.D.: Highlights: language- and domain-independent automatic indexing terms for abstracting (1995) 0.01
```
0.010803895 = product of:
  0.054019473 = sum of:
    0.054019473 = product of:
      0.10803895 = sum of:
        0.10803895 = weight(_text_:german in 1793) [ClassicSimilarity], result of:
          0.10803895 = score(doc=1793,freq=2.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.4491903 = fieldWeight in 1793, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1793)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

Presents a model of drawing index terms from text. The approach uses no stop list, stemmer, or other language and domain specific component, allowing operation in any language or domain with only trivial modification. The method uses n-grams counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, called 'highlights', are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Presents some experimental results, showing operation in English, Spanish, German, Georgian, Russian and Japanese
Lichtenstein, A.; Plank, M.; Neumann, J.: TIB's portal for audiovisual media : combining manual and automatic indexing (2014) 0.01
```
0.010803895 = product of:
  0.054019473 = sum of:
    0.054019473 = product of:
      0.10803895 = sum of:
        0.10803895 = weight(_text_:german in 1981) [ClassicSimilarity], result of:
          0.10803895 = score(doc=1981,freq=2.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.4491903 = fieldWeight in 1981, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1981)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

The German National Library of Science and Technology (TIB) developed a Web-based platform for audiovisual media. The audiovisual portal optimizes access to scientific videos such as computer animations and lecture and conference recordings. TIB's AV-Portal combines traditional cataloging and automatic indexing of audiovisual media. The article describes metadata standards for audiovisual media and introduces the TIB's metadata schema in comparison to other metadata standards for non-textual materials. Additionally, we give an overview of multimedia retrieval technologies used for the Portal and present the AV-Portal in detail as well as the additional value for libraries and their users.
Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.01
```
0.009260482 = product of:
  0.046302408 = sum of:
    0.046302408 = product of:
      0.092604816 = sum of:
        0.092604816 = weight(_text_:german in 5480) [ClassicSimilarity], result of:
          0.092604816 = score(doc=5480,freq=2.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.38502026 = fieldWeight in 5480, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.046875 = fieldNorm(doc=5480)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

(Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods

Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval (1986) 0.01

0.008977135 = product of:
  0.044885673 = sum of:
    0.044885673 = product of:
      0.089771345 = sum of:
        0.089771345 = weight(_text_:22 in 402) [ClassicSimilarity], result of:
          0.089771345 = score(doc=402,freq=2.0), product of:
            0.1450166 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.041411664 = queryNorm
            0.61904186 = fieldWeight in 402, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.125 = fieldNorm(doc=402)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Source: Information processing and management. 22(1986) no.6, S.465-476

Salton, G.: Automatic processing of foreign language documents (1985) 0.01
```
0.008730865 = product of:
  0.043654326 = sum of:
    0.043654326 = product of:
      0.08730865 = sum of:
        0.08730865 = weight(_text_:german in 3650) [ClassicSimilarity], result of:
          0.08730865 = score(doc=3650,freq=4.0), product of:
            0.24051933 = queryWeight, product of:
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.041411664 = queryNorm
            0.36300057 = fieldWeight in 3650, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.808009 = idf(docFreq=360, maxDocs=44218)
              0.03125 = fieldNorm(doc=3650)
      0.5 = coord(1/2)
  0.2 = coord(1/5)
```
Abstract

The attempt to computerize a process, such as indexing, abstracting, classifying, or retrieving information, begins with an analysis of the process into its intellectual and nonintellectual components. That part of the process which is amenable to computerization is mechanical or algorithmic. What is not is intellectual or creative and requires human intervention. Gerard Salton has been an innovator, experimenter, and promoter in the area of mechanized information systems since the early 1960s. He has been particularly ingenious at analyzing the process of information retrieval into its algorithmic components. He received a doctorate in applied mathematics from Harvard University before moving to the computer science department at Cornell, where he developed a prototype automatic retrieval system called SMART. Working with this system he and his students contributed for over a decade to our theoretical understanding of the retrieval process. On a more practical level, they have contributed design criteria for operating retrieval systems. The following selection presents one of the early descriptions of the SMART system; it is valuable as it shows the direction automatic retrieval methods were to take beyond simple word-matching techniques. These include various word normalization techniques to improve recall, for instance, the separation of words into stems and affixes; the correlation and clustering, using statistical association measures, of related terms; and the identification, using a concept thesaurus, of synonymous, broader, narrower, and sibling terms. They include, as weIl, techniques, both linguistic and statistical, to deal with the thorny problem of how to automatically extract from texts index terms that consist of more than one word. They include weighting techniques and various documentrequest matching algorithms. Significant among the latter are those which produce a retrieval output of citations ranked in relevante order. During the 1970s, Salton and his students went an to further refine these various techniques, particularly the weighting and statistical association measures. Many of their early innovations seem commonplace today. Some of their later techniques are still ahead of their time and await technological developments for implementation. The particular focus of the selection that follows is an the evaluation of a particular component of the SMART system, a multilingual thesaurus. By mapping English language expressions and their German equivalents to a common concept number, the thesaurus permitted the automatic processing of German language documents against English language queries and vice versa. The results of the evaluation, as it turned out, were somewhat inconclusive. However, this SMART experiment suggested in a bold and optimistic way how one might proceed to answer such complex questions as What is meant by retrieval language compatability? How it is to be achieved, and how evaluated?

Fuhr, N.; Niewelt, B.: ¬Ein Retrievaltest mit automatisch indexierten Dokumenten (1984) 0.01

0.0078549925 = product of:
  0.03927496 = sum of:
    0.03927496 = product of:
      0.07854992 = sum of:
        0.07854992 = weight(_text_:22 in 262) [ClassicSimilarity], result of:
          0.07854992 = score(doc=262,freq=2.0), product of:
            0.1450166 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.041411664 = queryNorm
            0.5416616 = fieldWeight in 262, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=262)
      0.5 = coord(1/2)
  0.2 = coord(1/5)

Date: 20.10.2000 12:22:23

Search (52 results, page 1 of 3)

Authors

Years

Languages

Types

Themes