Search (53 results, page 1 of 3)

Hotho, A.; Bloehdorn, S.: Data Mining 2004 : Text classification by boosting weak learners based on terms and concepts (2004) 0.10

0.10162766 = sum of:
  0.08091931 = product of:
    0.24275793 = sum of:
      0.24275793 = weight(_text_:3a in 562) [ClassicSimilarity], result of:
        0.24275793 = score(doc=562,freq=2.0), product of:
          0.43193975 = queryWeight, product of:
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.05094824 = queryNorm
          0.56201804 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            8.478011 = idf(docFreq=24, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.33333334 = coord(1/3)
  0.020708349 = product of:
    0.041416697 = sum of:
      0.041416697 = weight(_text_:22 in 562) [ClassicSimilarity], result of:
        0.041416697 = score(doc=562,freq=2.0), product of:
          0.17841205 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.05094824 = queryNorm
          0.23214069 = fieldWeight in 562, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.046875 = fieldNorm(doc=562)
    0.5 = coord(1/2)

Content: Vgl.: http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CEAQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.91.4940%26rep%3Drep1%26type%3Dpdf&ei=dOXrUMeIDYHDtQahsIGACg&usg=AFQjCNHFWVh6gNPvnOrOS9R3rkrXCNVD-A&sig2=5I2F5evRfMnsttSgFF9g7Q&bvm=bv.1357316858,d.Yms.
Date: 8. 1.2013 10:22:32

Noever, D.; Ciolino, M.: ¬The Turing deception (2022) 0.04

0.040459655 = product of:
  0.08091931 = sum of:
    0.08091931 = product of:
      0.24275793 = sum of:
        0.24275793 = weight(_text_:3a in 862) [ClassicSimilarity], result of:
          0.24275793 = score(doc=862,freq=2.0), product of:
            0.43193975 = queryWeight, product of:
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.05094824 = queryNorm
            0.56201804 = fieldWeight in 862, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              8.478011 = idf(docFreq=24, maxDocs=44218)
              0.046875 = fieldNorm(doc=862)
      0.33333334 = coord(1/3)
  0.5 = coord(1/2)

Source: https%3A%2F%2Farxiv.org%2Fabs%2F2212.06721&usg=AOvVaw3i_9pZm9y_dQWoHi6uv0EN

Hayes, P.J.; Knecht, L.E.; Cellio, M.J.: ¬A news story categorization system (1988) 0.04

0.03866488 = product of:
  0.07732976 = sum of:
    0.07732976 = product of:
      0.15465952 = sum of:
        0.15465952 = weight(_text_:news in 1954) [ClassicSimilarity], result of:
          0.15465952 = score(doc=1954,freq=2.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.57913023 = fieldWeight in 1954, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.078125 = fieldNorm(doc=1954)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Panicheva, P.; Cardiff, J.; Rosso, P.: Identifying subjective statements in news titles using a personal sense annotation framework (2013) 0.03
```
0.03280824 = product of:
  0.06561648 = sum of:
    0.06561648 = product of:
      0.13123296 = sum of:
        0.13123296 = weight(_text_:news in 968) [ClassicSimilarity], result of:
          0.13123296 = score(doc=968,freq=4.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.49140832 = fieldWeight in 968, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.046875 = fieldNorm(doc=968)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Subjective language contains information about private states. The goal of subjective language identification is to determine that a private state is expressed, without considering its polarity or specific emotion. A component of word meaning, "Personal Sense," has clear potential in the field of subjective language identification, as it reflects a meaning of words in terms of unique personal experience and carries personal characteristics. In this paper we investigate how Personal Sense can be harnessed for the purpose of identifying subjectivity in news titles. In the process, we develop a new Personal Sense annotation framework for annotating and classifying subjectivity, polarity, and emotion. The Personal Sense framework yields high performance in a fine-grained subsentence subjectivity classification. Our experiments demonstrate lexico-syntactic features to be useful for the identification of subjectivity indicators and the targets that receive the subjective Personal Sense.
AL-Smadi, M.; Jaradat, Z.; AL-Ayyoub, M.; Jararweh, Y.: Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features (2017) 0.03
```
0.03280824 = product of:
  0.06561648 = sum of:
    0.06561648 = product of:
      0.13123296 = sum of:
        0.13123296 = weight(_text_:news in 5095) [ClassicSimilarity], result of:
          0.13123296 = score(doc=5095,freq=4.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.49140832 = fieldWeight in 5095, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.046875 = fieldNorm(doc=5095)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The rapid growth in digital information has raised considerable challenges in particular when it comes to automated content analysis. Social media such as twitter share a lot of its users' information about their events, opinions, personalities, etc. Paraphrase Identification (PI) is concerned with recognizing whether two texts have the same/similar meaning, whereas the Semantic Text Similarity (STS) is concerned with the degree of that similarity. This research proposes a state-of-the-art approach for paraphrase identification and semantic text similarity analysis in Arabic news tweets. The approach adopts several phases of text processing, features extraction and text classification. Lexical, syntactic, and semantic features are extracted to overcome the weakness and limitations of the current technologies in solving these tasks for the Arabic language. Maximum Entropy (MaxEnt) and Support Vector Regression (SVR) classifiers are trained using these features and are evaluated using a dataset prepared for this research. The experimentation results show that the approach achieves good results in comparison to the baseline results.
Pritchard-Schoch, T.: Natural language comes of age (1993) 0.03
```
0.030931905 = product of:
  0.06186381 = sum of:
    0.06186381 = product of:
      0.12372762 = sum of:
        0.12372762 = weight(_text_:news in 2570) [ClassicSimilarity], result of:
          0.12372762 = score(doc=2570,freq=2.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.4633042 = fieldWeight in 2570, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.0625 = fieldNorm(doc=2570)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Discusses natural languages and the natural language implementations of Westlaw's full-text legal documents, Westlaw Is Natural. Natural language is not aritificial intelligence but a hybrid of linguistics, mathematics and statistics. Provides 3 classes of retrieval models. Explains how Westlaw processes an English query. Assesses WIN. Covers WIN enhancements; the natural language features of Congressional Quarterly's Washington Alert using a document for a query; the personal librarian front end search software and Dowquest from Dow Jones news/retrieval. Conmsiders whether natural language encourages fuzzy thinking and whether Boolean logic will still be needed

Warner, A.J.: Natural language processing (1987) 0.03

0.027611133 = product of:
  0.055222265 = sum of:
    0.055222265 = product of:
      0.11044453 = sum of:
        0.11044453 = weight(_text_:22 in 337) [ClassicSimilarity], result of:
          0.11044453 = score(doc=337,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.61904186 = fieldWeight in 337, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.125 = fieldNorm(doc=337)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Source: Annual review of information science and technology. 22(1987), S.79-108

Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P.: Good debt or bad debt : detecting semantic orientations in economic texts (2014) 0.03
```
0.0273402 = product of:
  0.0546804 = sum of:
    0.0546804 = product of:
      0.1093608 = sum of:
        0.1093608 = weight(_text_:news in 1226) [ClassicSimilarity], result of:
          0.1093608 = score(doc=1226,freq=4.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.40950692 = fieldWeight in 1226, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1226)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The use of robo-readers to analyze news texts is an emerging technology trend in computational finance. Recent research has developed sophisticated financial polarity lexicons for investigating how financial sentiments relate to future company performance. However, based on experience from fields that commonly analyze sentiment, it is well known that the overall semantic orientation of a sentence may differ from that of individual words. This article investigates how semantic orientations can be better detected in financial and economic news by accommodating the overall phrase-structure information and domain-specific use of language. Our three main contributions are the following: (a) a human-annotated finance phrase bank that can be used for training and evaluating alternative models; (b) a technique to enhance financial lexicons with attributes that help to identify expected direction of events that affect sentiment; and (c) a linearized phrase-structure model for detecting contextual semantic orientations in economic texts. The relevance of the newly added lexicon features and the benefit of using the proposed learning algorithm are demonstrated in a comparative study against general sentiment models as well as the popular word frequency models used in recent financial studies. The proposed framework is parsimonious and avoids the explosion in feature space caused by the use of conventional n-gram features.
Mock, K.J.; Vemuri, V.R.: Information filtering via hill climbing, WordNet, and index patterns (1997) 0.03
```
0.027065417 = product of:
  0.054130834 = sum of:
    0.054130834 = product of:
      0.10826167 = sum of:
        0.10826167 = weight(_text_:news in 1517) [ClassicSimilarity], result of:
          0.10826167 = score(doc=1517,freq=2.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.40539116 = fieldWeight in 1517, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1517)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The INFOS (Intelligent News Filtering Organizational System) project is designed to reduce the user's search burden by automatically categorising data as relevant or irrelevant based upon user interests. These predictions are learned automatically based upon features taken from input articles and collaborative features derived from other users. The filtering is performed by a hybrid technique that combines elements of a keyword-based hill climbing method, knowledge-based conceptual representation via WordNet, and partial parsing via index patterns. The hybrid systems integrating all these approaches combines the benefits of each while maintaing robustness and acalability
Moens, M.F.; Dumortier, J.: Use of a text grammar for generating highlight abstracts of magazine articles (2000) 0.03
```
0.027065417 = product of:
  0.054130834 = sum of:
    0.054130834 = product of:
      0.10826167 = sum of:
        0.10826167 = weight(_text_:news in 4540) [ClassicSimilarity], result of:
          0.10826167 = score(doc=4540,freq=2.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.40539116 = fieldWeight in 4540, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.0546875 = fieldNorm(doc=4540)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Browsing a database of article abstracts is one way to select and buy relevant magazine articles online. Our research contributes to the design and development of text grammars for abstracting texts in unlimited subject domains. We developed a system that parses texts based on the text grammar of a specific text type and that extracts sentences and statements which are relevant for inclusion in the abstracts. The system employs knowledge of the discourse patterns that are typical of news stories. The results are encouraging and demonstrate the importance of discourse structures in text summarisation.
Al-Khatib, K.; Ghosa, T.; Hou, Y.; Waard, A. de; Freitag, D.: Argument mining for scholarly document processing : taking stock and looking ahead (2021) 0.03
```
0.027065417 = product of:
  0.054130834 = sum of:
    0.054130834 = product of:
      0.10826167 = sum of:
        0.10826167 = weight(_text_:news in 568) [ClassicSimilarity], result of:
          0.10826167 = score(doc=568,freq=2.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.40539116 = fieldWeight in 568, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.0546875 = fieldNorm(doc=568)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Argument mining targets structures in natural language related to interpretation and persuasion. Most scholarly discourse involves interpreting experimental evidence and attempting to persuade other scientists to adopt the same conclusions, which could benefit from argument mining techniques. However, While various argument mining studies have addressed student essays and news articles, those that target scientific discourse are still scarce. This paper surveys existing work in argument mining of scholarly discourse, and provides an overview of current models, data, tasks, and applications. We identify a number of key challenges confronting argument mining in the scientific domain, and suggest some possible solutions and future directions.

McMahon, J.G.; Smith, F.J.: Improved statistical language model performance with automatic generated word hierarchies (1996) 0.02

0.02415974 = product of:
  0.04831948 = sum of:
    0.04831948 = product of:
      0.09663896 = sum of:
        0.09663896 = weight(_text_:22 in 3164) [ClassicSimilarity], result of:
          0.09663896 = score(doc=3164,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.5416616 = fieldWeight in 3164, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=3164)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Source: Computational linguistics. 22(1996) no.2, S.217-248

Ruge, G.: ¬A spreading activation network for automatic generation of thesaurus relationships (1991) 0.02

0.02415974 = product of:
  0.04831948 = sum of:
    0.04831948 = product of:
      0.09663896 = sum of:
        0.09663896 = weight(_text_:22 in 4506) [ClassicSimilarity], result of:
          0.09663896 = score(doc=4506,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.5416616 = fieldWeight in 4506, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=4506)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 8.10.2000 11:52:22

Somers, H.: Example-based machine translation : Review article (1999) 0.02

0.02415974 = product of:
  0.04831948 = sum of:
    0.04831948 = product of:
      0.09663896 = sum of:
        0.09663896 = weight(_text_:22 in 6672) [ClassicSimilarity], result of:
          0.09663896 = score(doc=6672,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.5416616 = fieldWeight in 6672, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=6672)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 31. 7.1996 9:22:19

New tools for human translators (1997) 0.02

0.02415974 = product of:
  0.04831948 = sum of:
    0.04831948 = product of:
      0.09663896 = sum of:
        0.09663896 = weight(_text_:22 in 1179) [ClassicSimilarity], result of:
          0.09663896 = score(doc=1179,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.5416616 = fieldWeight in 1179, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=1179)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 31. 7.1996 9:22:19

Baayen, R.H.; Lieber, H.: Word frequency distributions and lexical semantics (1997) 0.02

0.02415974 = product of:
  0.04831948 = sum of:
    0.04831948 = product of:
      0.09663896 = sum of:
        0.09663896 = weight(_text_:22 in 3117) [ClassicSimilarity], result of:
          0.09663896 = score(doc=3117,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.5416616 = fieldWeight in 3117, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.109375 = fieldNorm(doc=3117)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 28. 2.1999 10:48:22

Byrne, C.C.; McCracken, S.A.: ¬An adaptive thesaurus employing semantic distance, relational inheritance and nominal compound interpretation for linguistic support of information retrieval (1999) 0.02

0.020708349 = product of:
  0.041416697 = sum of:
    0.041416697 = product of:
      0.082833394 = sum of:
        0.082833394 = weight(_text_:22 in 4483) [ClassicSimilarity], result of:
          0.082833394 = score(doc=4483,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.46428138 = fieldWeight in 4483, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=4483)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 15. 3.2000 10:22:37

Boleda, G.; Evert, S.: Multiword expressions : a pain in the neck of lexical semantics (2009) 0.02

0.020708349 = product of:
  0.041416697 = sum of:
    0.041416697 = product of:
      0.082833394 = sum of:
        0.082833394 = weight(_text_:22 in 4888) [ClassicSimilarity], result of:
          0.082833394 = score(doc=4888,freq=2.0), product of:
            0.17841205 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.05094824 = queryNorm
            0.46428138 = fieldWeight in 4888, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.09375 = fieldNorm(doc=4888)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Date: 1. 3.2013 14:56:22

Li, W.; Wong, K.-F.; Yuan, C.: Toward automatic Chinese temporal information extraction (2001) 0.02
```
0.01933244 = product of:
  0.03866488 = sum of:
    0.03866488 = product of:
      0.07732976 = sum of:
        0.07732976 = weight(_text_:news in 6029) [ClassicSimilarity], result of:
          0.07732976 = score(doc=6029,freq=2.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.28956512 = fieldWeight in 6029, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.0390625 = fieldNorm(doc=6029)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Over the past few years, temporal information processing and temporal database management have increasingly become hot topics. Nevertheless, only a few researchers have investigated these areas in the Chinese language. This lays down the objective of our research: to exploit Chinese language processing techniques for temporal information extraction and concept reasoning. In this article, we first study the mechanism for expressing time in Chinese. On the basis of the study, we then design a general frame structure for maintaining the extracted temporal concepts and propose a system for extracting time-dependent information from Hong Kong financial news. In the system, temporal knowledge is represented by different types of temporal concepts (TTC) and different temporal relations, including absolute and relative relations, which are used to correlate between action times and reference times. In analyzing a sentence, the algorithm first determines the situation related to the verb. This in turn will identify the type of temporal concept associated with the verb. After that, the relevant temporal information is extracted and the temporal relations are derived. These relations link relevant concept frames together in chronological order, which in turn provide the knowledge to fulfill users' queries, e.g., for question-answering (i.e., Q&A) applications
Khoo, C.S.G.; Dai, D.; Loh, T.E.: Using statistical and contextual information to identify two- and three-character words in Chinese text (2002) 0.02
```
0.01933244 = product of:
  0.03866488 = sum of:
    0.03866488 = product of:
      0.07732976 = sum of:
        0.07732976 = weight(_text_:news in 5206) [ClassicSimilarity], result of:
          0.07732976 = score(doc=5206,freq=2.0), product of:
            0.26705483 = queryWeight, product of:
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.05094824 = queryNorm
            0.28956512 = fieldWeight in 5206, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              5.2416887 = idf(docFreq=635, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5206)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Khoo, Dai, and Loh examine new statistical methods for the identification of two and three character words in Chinese text. Some meaningful Chinese words are simple (independent units of one or more characters in a sentence that have independent meaning) but others are compounds of two or more simple words. In their segmentation they utilize the Modern Chinese Word Segmentation for Application of Information Processing, with some modifications to focus on meaningful words to do manual segmentation. About 37% of meaningful words are longer than 2 characters indicating a need to handle three and four character words. Four hundred sentences from news articles were manually broken into overlapping bi-grams and tri-grams. Using logistic regression, the log of the odds that such bi/tri-grams were meaningful words was calculated. Variables like relative frequency, document frequency, local frequency, and contextual and positional information, were incorporated in the model only if the concordance measure improved by at least 2% with their addition. For two- and three-character words relative frequency of adjacent characters and document frequency of overlapping bi-grams were found to be significant. Using measures of recall and precision where correct automatic segmentation is normalized either by manual segmentation or by automatic segmentation, the contextual information formula for 2 character words provides significantly better results than previous formulations and using both the 2 and 3 character formulations in combination significantly improves the 2 character results.

Search (53 results, page 1 of 3)

Authors

Years

Languages

Types

Themes

Subjects

Classifications