Search (50 results, page 1 of 3)

  • × year_i:[2000 TO 2010}
  • × theme_ss:"Computerlinguistik"
  1. Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.36
    0.36031273 = product of:
      0.6305472 = sum of:
        0.047687992 = product of:
          0.14306398 = sum of:
            0.14306398 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.14306398 = score(doc=562,freq=2.0), product of:
                0.25455406 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.03002521 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.33333334 = coord(1/3)
        0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.14306398 = score(doc=562,freq=2.0), product of:
            0.25455406 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03002521 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.03496567 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
          0.03496567 = score(doc=562,freq=6.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.3656675 = fieldWeight in 562, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.07153199 = product of:
          0.14306398 = sum of:
            0.14306398 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
              0.14306398 = score(doc=562,freq=2.0), product of:
                0.25455406 = queryWeight, product of:
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.03002521 = queryNorm
                0.56201804 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  8.478011 = idf(docFreq=24, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.5 = coord(1/2)
        0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.14306398 = score(doc=562,freq=2.0), product of:
            0.25455406 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03002521 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.03496567 = weight(_text_:classification in 562) [ClassicSimilarity], result of:
          0.03496567 = score(doc=562,freq=6.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.3656675 = fieldWeight in 562, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.14306398 = weight(_text_:2f in 562) [ClassicSimilarity], result of:
          0.14306398 = score(doc=562,freq=2.0), product of:
            0.25455406 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.03002521 = queryNorm
            0.56201804 = fieldWeight in 562, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=562)
        0.0122040035 = product of:
          0.024408007 = sum of:
            0.024408007 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
              0.024408007 = score(doc=562,freq=2.0), product of:
                0.10514317 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.03002521 = queryNorm
                0.23214069 = fieldWeight in 562, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046875 = fieldNorm(doc=562)
          0.5 = coord(1/2)
      0.5714286 = coord(8/14)
    
    Abstract
    Document representations for text classification are typically based on the classical Bag-Of-Words paradigm. This approach comes with deficiencies that motivate the integration of features on a higher semantic level than single words. In this paper we propose an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting is used for actual classification. Experimental evaluations on two well known text corpora support our approach through consistent improvement of the results.
    Content
    Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
    Date
    8. 1.2013 10:22:32
  2. Argamon, S.; Whitelaw, C.; Chase, P.; Hota, S.R.; Garg, N.; Levitan, S.: Stylistic text classification using functional lexical features (2007) 0.02
    0.021394843 = product of:
      0.0998426 = sum of:
        0.03496567 = weight(_text_:classification in 280) [ClassicSimilarity], result of:
          0.03496567 = score(doc=280,freq=6.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.3656675 = fieldWeight in 280, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=280)
        0.03496567 = weight(_text_:classification in 280) [ClassicSimilarity], result of:
          0.03496567 = score(doc=280,freq=6.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.3656675 = fieldWeight in 280, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=280)
        0.029911257 = product of:
          0.059822515 = sum of:
            0.059822515 = weight(_text_:texts in 280) [ClassicSimilarity], result of:
              0.059822515 = score(doc=280,freq=2.0), product of:
                0.16460659 = queryWeight, product of:
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.03002521 = queryNorm
                0.36342722 = fieldWeight in 280, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.046875 = fieldNorm(doc=280)
          0.5 = coord(1/2)
      0.21428572 = coord(3/14)
    
    Abstract
    Most text analysis and retrieval work to date has focused on the topic of a text; that is, what it is about. However, a text also contains much useful information in its style, or how it is written. This includes information about its author, its purpose, feelings it is meant to evoke, and more. This article develops a new type of lexical feature for use in stylistic text classification, based on taxonomies of various semantic functions of certain choice words or phrases. We demonstrate the usefulness of such features for the stylistic text classification tasks of determining author identity and nationality, the gender of literary characters, a text's sentiment (positive/ negative evaluation), and the rhetorical character of scientific journal articles. We further show how the use of functional features aids in gaining insight about stylistic differences among different kinds of texts.
  3. Ibekwe-SanJuan, F.; SanJuan, E.: From term variants to research topics (2002) 0.02
    0.01774993 = product of:
      0.08283301 = sum of:
        0.023791125 = weight(_text_:classification in 1853) [ClassicSimilarity], result of:
          0.023791125 = score(doc=1853,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.24880521 = fieldWeight in 1853, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
        0.023791125 = weight(_text_:classification in 1853) [ClassicSimilarity], result of:
          0.023791125 = score(doc=1853,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.24880521 = fieldWeight in 1853, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1853)
        0.035250753 = product of:
          0.07050151 = sum of:
            0.07050151 = weight(_text_:texts in 1853) [ClassicSimilarity], result of:
              0.07050151 = score(doc=1853,freq=4.0), product of:
                0.16460659 = queryWeight, product of:
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.03002521 = queryNorm
                0.42830306 = fieldWeight in 1853, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1853)
          0.5 = coord(1/2)
      0.21428572 = coord(3/14)
    
    Abstract
    In a scientific and technological watch (STW) task, an expert user needs to survey the evolution of research topics in his area of specialisation in order to detect interesting changes. The majority of methods proposing evaluation metrics (bibliometrics and scientometrics studies) for STW rely solely an statistical data analysis methods (Co-citation analysis, co-word analysis). Such methods usually work an structured databases where the units of analysis (words, keywords) are already attributed to documents by human indexers. The advent of huge amounts of unstructured textual data has rendered necessary the integration of natural language processing (NLP) techniques to first extract meaningful units from texts. We propose a method for STW which is NLP-oriented. The method not only analyses texts linguistically in order to extract terms from them, but also uses linguistic relations (syntactic variations) as the basis for clustering. Terms and variation relations are formalised as weighted di-graphs which the clustering algorithm, CPCL (Classification by Preferential Clustered Link) will seek to reduce in order to produces classes. These classes ideally represent the research topics present in the corpus. The results of the classification are subjected to validation by an expert in STW.
  4. Bird, S.; Dale, R.; Dorr, B.; Gibson, B.; Joseph, M.; Kan, M.-Y.; Lee, D.; Powley, B.; Radev, D.; Tan, Y.F.: ¬The ACL Anthology Reference Corpus : a reference dataset for bibliographic research in computational linguistics (2008) 0.02
    0.01608469 = product of:
      0.07506188 = sum of:
        0.023310447 = weight(_text_:classification in 2804) [ClassicSimilarity], result of:
          0.023310447 = score(doc=2804,freq=6.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.24377833 = fieldWeight in 2804, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03125 = fieldNorm(doc=2804)
        0.028440988 = weight(_text_:bibliographic in 2804) [ClassicSimilarity], result of:
          0.028440988 = score(doc=2804,freq=4.0), product of:
            0.11688946 = queryWeight, product of:
              3.893044 = idf(docFreq=2449, maxDocs=44218)
              0.03002521 = queryNorm
            0.24331525 = fieldWeight in 2804, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.893044 = idf(docFreq=2449, maxDocs=44218)
              0.03125 = fieldNorm(doc=2804)
        0.023310447 = weight(_text_:classification in 2804) [ClassicSimilarity], result of:
          0.023310447 = score(doc=2804,freq=6.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.24377833 = fieldWeight in 2804, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03125 = fieldNorm(doc=2804)
      0.21428572 = coord(3/14)
    
    Abstract
    The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized reference corpus derived from the ACL Anthology that can be used for research in scholarly document processing. This corpus, which we call the ACL Anthology Reference Corpus (ACL ARC), brings together the recent activities of a number of research groups around the world. Our goal is to make the corpus widely available, and to encourage other researchers to use it as a standard testbed for experiments in both bibliographic and bibliometric research.
    Content
    Vgl. auch: Automatic Term Recognition (ATR) is a research task that deals with the identification of domain-specific terms. Terms, in simple words, are textual realization of significant concepts in an expertise domain. Additionally, domain-specific terms may be classified into a number of categories, in which each category represents a significant concept. A term classification task is often defined on top of an ATR procedure to perform such categorization. For instance, in the biomedical domain, terms can be classified as drugs, proteins, and genes. This is a reference dataset for terminology extraction and classification research in computational linguistics. It is a set of manually annotated terms in English language that are extracted from the ACL Anthology Reference Corpus (ACL ARC). The ACL ARC is a canonicalised and frozen subset of scientific publications in the domain of Human Language Technologies (HLT). It consists of 10,921 articles from 1965 to 2006. The dataset, called ACL RD-TEC, is comprised of more than 69,000 candidate terms that are manually annotated as valid and invalid terms. Furthermore, valid terms are classified as technology and non-technology terms. Technology terms refer to a method, process, or in general a technological concept in the domain of HLT, e.g. machine translation, word sense disambiguation, and language modelling. On the other hand, non-technology terms refer to important concepts other than technological; examples of such terms in the domain of HLT are multilingual lexicon, corpora, word sense, and language model. The dataset is created to serve as a gold standard for the comparison of the algorithms of term recognition and classification. [http://catalog.elra.info/product_info.php?products_id=1236].
  5. Melzer, C.: ¬Der Maschine anpassen : PC-Spracherkennung - Programme sind mittlerweile alltagsreif (2005) 0.02
    0.015459906 = product of:
      0.10821934 = sum of:
        0.10110034 = weight(_text_:henry in 4044) [ClassicSimilarity], result of:
          0.10110034 = score(doc=4044,freq=4.0), product of:
            0.23560001 = queryWeight, product of:
              7.84674 = idf(docFreq=46, maxDocs=44218)
              0.03002521 = queryNorm
            0.42911857 = fieldWeight in 4044, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              7.84674 = idf(docFreq=46, maxDocs=44218)
              0.02734375 = fieldNorm(doc=4044)
        0.0071190023 = product of:
          0.014238005 = sum of:
            0.014238005 = weight(_text_:22 in 4044) [ClassicSimilarity], result of:
              0.014238005 = score(doc=4044,freq=2.0), product of:
                0.10514317 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.03002521 = queryNorm
                0.1354154 = fieldWeight in 4044, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.02734375 = fieldNorm(doc=4044)
          0.5 = coord(1/2)
      0.14285715 = coord(2/14)
    
    Content
    Billiger geht es mit "Via Voice Standard" von IBM. Die Software kostet etwa 50 Euro, hat aber erhebliche Schwächen in der Lernfähigkeit: Sie schneidet jedoch immer noch besser ab als das gut drei Mal so teure "Voice Office Premium 10"; das im Test der sechs Programme als einziges nur ein "Befriedigend" bekam. "Man liest über Spracherkennung nicht mehr so viel" weil es funktioniert", glaubt Dorothee Wiegand von der in Hannover erscheinenden Computerzeitschrift "c't". Die Technik" etwa "Dragon Naturally Speaking" von ScanSoft, sei ausgereift, "Spracherkennung ist vor allem Statistik, die Auswertung unendlicher Wortmöglichkeiten. Eigentlich war eher die Hardware das Problem", sagt Wiegand. Da jetzt selbst einfache Heimcomputer schnell und leistungsfähig seien, hätten die Entwickler viel mehr Möglichkeiten."Aber selbst ältere Computer kommen mit den Systemen klar. Sie brauchen nur etwas länger! "Jedes Byte macht die Spracherkennung etwas schneller, ungenauer ist sie sonst aber nicht", bestätigt Kristina Henry von linguatec in München. Auch für die Produkte des Herstellers gelte jedoch, dass "üben und deutlich sprechen wichtiger sind als jede Hardware". Selbst Stimmen von Diktiergeräten würden klar, erkannt, versichert Henry: "Wir wollen einen Schritt weiter gehen und das Diktieren von unterwegs möglich machen." Der Benutzer könnte dann eine Nummer anwählen, etwa im Auto einen Text aufsprechen und ihn zu Hause "getippt" vorfinden. Grundsätzlich passt die Spracherkennungssoftware inzwischen auch auf den privaten Computer. Klar ist aber, dass selbst der bestgesprochene Text nachbearbeitet werden muss. Zudem ist vom Nutzer Geduld gefragt: Ebenso wie sein System lernt, muss der Mensch sich in Aussprache und Geschwindigkeit dem System anpassen. Dann sind die Ergebnisse allerdings beachtlich - und "Sexterminvereinbarung" statt "zwecks Terminvereinbarung" gehört der Vergangenheit an."
    Date
    3. 5.1997 8:44:22
  6. Peng, F.; Huang, X.: Machine learning for Asian language text classification (2007) 0.02
    0.015199591 = product of:
      0.10639714 = sum of:
        0.05319857 = weight(_text_:classification in 831) [ClassicSimilarity], result of:
          0.05319857 = score(doc=831,freq=20.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.55634534 = fieldWeight in 831, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=831)
        0.05319857 = weight(_text_:classification in 831) [ClassicSimilarity], result of:
          0.05319857 = score(doc=831,freq=20.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.55634534 = fieldWeight in 831, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=831)
      0.14285715 = coord(2/14)
    
    Abstract
    Purpose - The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Design/methodology/approach - Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation-based approach was compared with the non-segmentation-based approach. Findings - There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications - Apply the findings to real web text classification is ongoing work. Originality/value - The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
  7. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.02
    0.015061316 = product of:
      0.07028614 = sum of:
        0.02018744 = weight(_text_:classification in 3389) [ClassicSimilarity], result of:
          0.02018744 = score(doc=3389,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 3389, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=3389)
        0.02018744 = weight(_text_:classification in 3389) [ClassicSimilarity], result of:
          0.02018744 = score(doc=3389,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 3389, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=3389)
        0.029911257 = product of:
          0.059822515 = sum of:
            0.059822515 = weight(_text_:texts in 3389) [ClassicSimilarity], result of:
              0.059822515 = score(doc=3389,freq=2.0), product of:
                0.16460659 = queryWeight, product of:
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.03002521 = queryNorm
                0.36342722 = fieldWeight in 3389, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.046875 = fieldNorm(doc=3389)
          0.5 = coord(1/2)
      0.21428572 = coord(3/14)
    
    Abstract
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based an machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
  8. Witschel, H.F.: Terminology extraction and automatic indexing : comparison and qualitative evaluation of methods (2005) 0.01
    0.014763533 = product of:
      0.06889649 = sum of:
        0.016822865 = weight(_text_:classification in 1842) [ClassicSimilarity], result of:
          0.016822865 = score(doc=1842,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.17593184 = fieldWeight in 1842, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1842)
        0.016822865 = weight(_text_:classification in 1842) [ClassicSimilarity], result of:
          0.016822865 = score(doc=1842,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.17593184 = fieldWeight in 1842, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1842)
        0.035250753 = product of:
          0.07050151 = sum of:
            0.07050151 = weight(_text_:texts in 1842) [ClassicSimilarity], result of:
              0.07050151 = score(doc=1842,freq=4.0), product of:
                0.16460659 = queryWeight, product of:
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.03002521 = queryNorm
                0.42830306 = fieldWeight in 1842, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1842)
          0.5 = coord(1/2)
      0.21428572 = coord(3/14)
    
    Abstract
    Many terminology engineering processes involve the task of automatic terminology extraction: before the terminology of a given domain can be modelled, organised or standardised, important concepts (or terms) of this domain have to be identified and fed into terminological databases. These serve in further steps as a starting point for compiling dictionaries, thesauri or maybe even terminological ontologies for the domain. For the extraction of the initial concepts, extraction methods are needed that operate on specialised language texts. On the other hand, many machine learning or information retrieval applications require automatic indexing techniques. In Machine Learning applications concerned with the automatic clustering or classification of texts, often feature vectors are needed that describe the contents of a given text briefly but meaningfully. These feature vectors typically consist of a fairly small set of index terms together with weights indicating their importance. Short but meaningful descriptions of document contents as provided by good index terms are also useful to humans: some knowledge management applications (e.g. topic maps) use them as a set of basic concepts (topics). The author believes that the tasks of terminology extraction and automatic indexing have much in common and can thus benefit from the same set of basic algorithms. It is the goal of this paper to outline some methods that may be used in both contexts, but also to find the discriminating factors between the two tasks that call for the variation of parameters or application of different techniques. The discussion of these methods will be based on statistical, syntactical and especially morphological properties of (index) terms. The paper is concluded by the presentation of some qualitative and quantitative results comparing statistical and morphological methods.
  9. Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.01
    0.014128265 = product of:
      0.09889785 = sum of:
        0.049448926 = weight(_text_:classification in 5480) [ClassicSimilarity], result of:
          0.049448926 = score(doc=5480,freq=12.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.5171319 = fieldWeight in 5480, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=5480)
        0.049448926 = weight(_text_:classification in 5480) [ClassicSimilarity], result of:
          0.049448926 = score(doc=5480,freq=12.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.5171319 = fieldWeight in 5480, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=5480)
      0.14285715 = coord(2/14)
    
    Abstract
    (Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods
  10. Hull, D.; Ait-Mokhtar, S.; Chuat, M.; Eisele, A.; Gaussier, E.; Grefenstette, G.; Isabelle, P.; Samulesson, C.; Segand, F.: Language technologies and patent search and classification (2001) 0.01
    0.01153568 = product of:
      0.08074976 = sum of:
        0.04037488 = weight(_text_:classification in 6318) [ClassicSimilarity], result of:
          0.04037488 = score(doc=6318,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.42223644 = fieldWeight in 6318, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.09375 = fieldNorm(doc=6318)
        0.04037488 = weight(_text_:classification in 6318) [ClassicSimilarity], result of:
          0.04037488 = score(doc=6318,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.42223644 = fieldWeight in 6318, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.09375 = fieldNorm(doc=6318)
      0.14285715 = coord(2/14)
    
  11. Moens, M.F.; Dumortier, J.: Use of a text grammar for generating highlight abstracts of magazine articles (2000) 0.01
    0.011293716 = product of:
      0.07905601 = sum of:
        0.029704956 = weight(_text_:subject in 4540) [ClassicSimilarity], result of:
          0.029704956 = score(doc=4540,freq=2.0), product of:
            0.10738805 = queryWeight, product of:
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.03002521 = queryNorm
            0.27661324 = fieldWeight in 4540, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.576596 = idf(docFreq=3361, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4540)
        0.04935105 = product of:
          0.0987021 = sum of:
            0.0987021 = weight(_text_:texts in 4540) [ClassicSimilarity], result of:
              0.0987021 = score(doc=4540,freq=4.0), product of:
                0.16460659 = queryWeight, product of:
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.03002521 = queryNorm
                0.5996243 = fieldWeight in 4540, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=4540)
          0.5 = coord(1/2)
      0.14285715 = coord(2/14)
    
    Abstract
    Browsing a database of article abstracts is one way to select and buy relevant magazine articles online. Our research contributes to the design and development of text grammars for abstracting texts in unlimited subject domains. We developed a system that parses texts based on the text grammar of a specific text type and that extracts sentences and statements which are relevant for inclusion in the abstracts. The system employs knowledge of the discourse patterns that are typical of news stories. The results are encouraging and demonstrate the importance of discourse structures in text summarisation.
  12. Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.01
    0.009516451 = product of:
      0.06661515 = sum of:
        0.033307575 = weight(_text_:classification in 1595) [ClassicSimilarity], result of:
          0.033307575 = score(doc=1595,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.34832728 = fieldWeight in 1595, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
        0.033307575 = weight(_text_:classification in 1595) [ClassicSimilarity], result of:
          0.033307575 = score(doc=1595,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.34832728 = fieldWeight in 1595, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1595)
      0.14285715 = coord(2/14)
    
    Source
    Advances in classification research, vol.10: proceedings of the 10th ASIS SIG/CR Classification Research Workshop. Ed.: Albrechtsen, H. u. J.E. Mai
  13. Martínez, F.; Martín, M.T.; Rivas, V.M.; Díaz, M.C.; Ureña, L.A.: Using neural networks for multiword recognition in IR (2003) 0.01
    0.008156957 = product of:
      0.057098698 = sum of:
        0.028549349 = weight(_text_:classification in 2777) [ClassicSimilarity], result of:
          0.028549349 = score(doc=2777,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.29856625 = fieldWeight in 2777, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2777)
        0.028549349 = weight(_text_:classification in 2777) [ClassicSimilarity], result of:
          0.028549349 = score(doc=2777,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.29856625 = fieldWeight in 2777, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2777)
      0.14285715 = coord(2/14)
    
    Abstract
    In this paper, a supervised neural network has been used to classify pairs of terms as being multiwords or non-multiwords. Classification is based an the values yielded by different estimators, currently available in literature, used as inputs for the neural network. Lists of multiwords and non-multiwords have been built to train the net. Afterward, many other pairs of terms have been classified using the trained net. Results obtained in this classification have been used to perform information retrieval tasks. Experiments show that detecting multiwords results in better performance of the IR methods.
  14. Kettunen, K.: Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval : an overview (2009) 0.01
    0.008156957 = product of:
      0.057098698 = sum of:
        0.028549349 = weight(_text_:classification in 2835) [ClassicSimilarity], result of:
          0.028549349 = score(doc=2835,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.29856625 = fieldWeight in 2835, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2835)
        0.028549349 = weight(_text_:classification in 2835) [ClassicSimilarity], result of:
          0.028549349 = score(doc=2835,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.29856625 = fieldWeight in 2835, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2835)
      0.14285715 = coord(2/14)
    
    Abstract
    Purpose - The purpose of this article is to discuss advantages and disadvantages of various means to manage morphological variation of keywords in monolingual information retrieval. Design/methodology/approach - The authors present a compilation of query results from 11 mostly European languages and a new general classification of the language dependent techniques for management of morphological variation. Variants of the different techniques are compared in some detail in terms of retrieval effectiveness and other criteria. The paper consists mainly of an overview of different management methods for keyword variation in information retrieval. Typical IR retrieval results of 11 languages and a new classification for keyword management methods are also presented. Findings - The main results of the paper are an overall comparison of reductive and generative keyword management methods in terms of retrieval effectiveness and other broader criteria. Originality/value - The paper is of value to anyone who wants to get an overall picture of keyword management techniques used in IR.
  15. Stock, W.G.: Textwortmethode : Norbert Henrichs zum 65. (3) (2000) 0.01
    0.007690453 = product of:
      0.053833168 = sum of:
        0.026916584 = weight(_text_:classification in 4891) [ClassicSimilarity], result of:
          0.026916584 = score(doc=4891,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.28149095 = fieldWeight in 4891, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0625 = fieldNorm(doc=4891)
        0.026916584 = weight(_text_:classification in 4891) [ClassicSimilarity], result of:
          0.026916584 = score(doc=4891,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.28149095 = fieldWeight in 4891, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0625 = fieldNorm(doc=4891)
      0.14285715 = coord(2/14)
    
    Abstract
    Nur wenige Dokumentationsmethoden werden mit dem Namen ihrer Entwickler assoziiert. Ausnahmen sind Melvil Dewey (DDC), S.R. Ranganathan (Colon Classification) - und Norbert Henrichs. Seine Textwortmethode ermöglicht die Indexierung und das Retrieval von Literatur aus Fachgebieten, die keine allseits akzeptierte Fachterminologie vorweisen, also viele Sozial- und Geisteswissenschaften, vorneweg die Philosophie. Für den Einsatz in der elektronischen Philosophie-Dokumentation hat Henrichs in den späten sechziger Jahren die Textwortmethode entworfen. Er ist damit nicht nur einer der Pioniere der Anwendung der elektronischen Datenverarbeitung in der Informationspraxis, sondern auch der Pionier bei der Dokumentation terminologisch nicht starrer Fachsprachen
  16. Mustafa el Hadi, W.: Human language technology and its role in information access and management (2003) 0.01
    0.0067974646 = product of:
      0.04758225 = sum of:
        0.023791125 = weight(_text_:classification in 5524) [ClassicSimilarity], result of:
          0.023791125 = score(doc=5524,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.24880521 = fieldWeight in 5524, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5524)
        0.023791125 = weight(_text_:classification in 5524) [ClassicSimilarity], result of:
          0.023791125 = score(doc=5524,freq=4.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.24880521 = fieldWeight in 5524, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5524)
      0.14285715 = coord(2/14)
    
    Content
    Beitrag eines Themenheftes "Knowledge organization and classification in international information retrieval"
    Source
    Cataloging and classification quarterly. 37(2003) nos.1/2, S.131-151
  17. Atlam, E.-S.; Morita, K.; Fuketa, M.; Aoe, J.-i.: ¬A new method for selecting English field association terms of compound words and its knowledge representation (2002) 0.01
    0.00576784 = product of:
      0.04037488 = sum of:
        0.02018744 = weight(_text_:classification in 2590) [ClassicSimilarity], result of:
          0.02018744 = score(doc=2590,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 2590, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2590)
        0.02018744 = weight(_text_:classification in 2590) [ClassicSimilarity], result of:
          0.02018744 = score(doc=2590,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 2590, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2590)
      0.14285715 = coord(2/14)
    
    Abstract
    This paper presents a strategy for building a morphological machine dictionary of English that infers meaning of derivations by considering morphological affixes and their semantic classification. Derivations are grouped into a frame that is accessible to semantic stem and knowledge base. This paper also proposes an efficient method for selecting compound Field Association (FA) terms from a large pool of single FA terms for some specialized fields. For single FA terms, five levels of association are defined and two ranks are defined, based on stability and inheritance. About 85% of redundant compound FA terms can be removed effectively by using levels and ranks proposed in this paper. Recall averages of 60-80% are achieved, depending on the type of text. The proposed methods are applied to 22,000 relationships between verbs and nouns extracted from the large tagged corpus.
  18. Pirkola, A.: Morphological typology of languages for IR (2001) 0.01
    0.00576784 = product of:
      0.04037488 = sum of:
        0.02018744 = weight(_text_:classification in 4476) [ClassicSimilarity], result of:
          0.02018744 = score(doc=4476,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 4476, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=4476)
        0.02018744 = weight(_text_:classification in 4476) [ClassicSimilarity], result of:
          0.02018744 = score(doc=4476,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 4476, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=4476)
      0.14285715 = coord(2/14)
    
    Abstract
    This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language in the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theoretical basis for IR research handling morphological issues. A common theoretical framework is needed in particular because of the increasing significance of cross-language retrieval research and CLIR systems processing different languages. The paper elaborates the linguistic morphological typology for the purposes of IR research. It studies how the indexes of synthesis and fusion could be used as practical tools in mono- and cross-lingual IR research. The need for semantic and syntactic typologies is discussed. The paper also reviews studies made in different languages on the effects of morphology and stemming in IR.
  19. Fautsch, C.; Savoy, J.: Algorithmic stemmers or morphological analysis? : an evaluation (2009) 0.01
    0.00576784 = product of:
      0.04037488 = sum of:
        0.02018744 = weight(_text_:classification in 2950) [ClassicSimilarity], result of:
          0.02018744 = score(doc=2950,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 2950, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2950)
        0.02018744 = weight(_text_:classification in 2950) [ClassicSimilarity], result of:
          0.02018744 = score(doc=2950,freq=2.0), product of:
            0.09562149 = queryWeight, product of:
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.03002521 = queryNorm
            0.21111822 = fieldWeight in 2950, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.1847067 = idf(docFreq=4974, maxDocs=44218)
              0.046875 = fieldNorm(doc=2950)
      0.14285715 = coord(2/14)
    
    Abstract
    It is important in information retrieval (IR), information extraction, or classification tasks that morphologically related forms are conflated under the same stem (using stemmer) or lemma (using morphological analyzer). To achieve this for the English language, algorithmic stemming or various morphological analysis approaches have been suggested. Based on Cross-Language Evaluation Forum test collections containing 284 queries and various IR models, this article evaluates these word-normalization proposals. Stemming improves the mean average precision significantly by around 7% while performance differences are not significant when comparing various algorithmic stemmers or algorithmic stemmers and morphological analysis. Accounting for thesaurus class numbers during indexing does not modify overall retrieval performances. Finally, we demonstrate that including a stop word list, even one containing only around 10 terms, might significantly improve retrieval performance, depending on the IR model.
  20. L'Homme, D.; L'Homme, M.-C.; Lemay, C.: Benchmarking the performance of two Part-of-Speech (POS) taggers for terminological purposes (2002) 0.00
    0.004273037 = product of:
      0.059822515 = sum of:
        0.059822515 = product of:
          0.11964503 = sum of:
            0.11964503 = weight(_text_:texts in 1855) [ClassicSimilarity], result of:
              0.11964503 = score(doc=1855,freq=8.0), product of:
                0.16460659 = queryWeight, product of:
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.03002521 = queryNorm
                0.72685444 = fieldWeight in 1855, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  5.4822793 = idf(docFreq=499, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1855)
          0.5 = coord(1/2)
      0.071428575 = coord(1/14)
    
    Abstract
    Part-of-Speech (POS) taggers are used in an increasing number of terminology applications. However, terminologists do not know exactly how they perform an specialized texts since most POS taggers have been trained an "general" Corpora, that is, Corpora containing all sorts of undifferentiated texts. In this article, we evaluate the Performance of two POS taggers an French and English medical texts. The taggers are TnT (a statistical tagger developed at Saarland University (Brants 2000)) and WinBrill (the Windows version of the tagger initially developed by Eric Brill (1992)). Ten extracts from medical texts were submitted to the taggers and the outputs scanned manually. Results pertain to the accuracy of tagging in terms of correctly and incorrectly tagged words. We also study the handling of unknown words from different viewpoints.

Languages

  • e 41
  • d 10
  • m 1
  • More… Less…

Types

  • a 43
  • m 5
  • el 2
  • s 2
  • x 1
  • More… Less…