Document (#28628)

Author
Dunning, T.
Title
Statistical identification of language
Source
http://citeseer.ist.psu.edu/cache/papers/cs/36/http:zSzzSzwww.comp.lancs.ac.ukzSzcomputingzSzresearchzSzucrelzSzpaperszSzlingdet.pdf/dunning94statistical.pdf
Year
1994
Series
Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University
Abstract
A statistically based program has been written which learns to distinguish between languages. The amount of training text that such a program needs is surprisingly small, and the amount of text needed to make an identification is also quite small. The program incorporates no linguistic presuppositions other than the assumption that text can be encoded as a string of bytes. Such a program can be used to determine which language small bits of text are in. It also shows a potential for what might be called 'statistical philology' in that it may be applied directly to phonetic transcriptions to help elucidate family trees among language dialects. A variant of this program has been shown to be useful as a quality control in biochemistry. In this application, genetic sequences are assumed to be expressions in a language peculiar to the organism from which the sequence is taken. Thus language identification becomes species identification.
Theme
Computerlinguistik

Similar documents (content)

  1. Huang, X.; Peng, F,; An, A.; Schuurmans, D.: Dynamic Web log session identification with statistical language models (2004) 0.08
    0.084932156 = sum of:
      0.084932156 = product of:
        0.530826 = sum of:
          0.014327003 = weight(abstract_txt:which in 3096) [ClassicSimilarity], result of:
            0.014327003 = score(doc=3096,freq=1.0), product of:
              0.06287404 = queryWeight, product of:
                1.1836 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.018212624 = queryNorm
              0.22786833 = fieldWeight in 3096, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.078125 = fieldNorm(doc=3096)
          0.092876054 = weight(abstract_txt:statistical in 3096) [ClassicSimilarity], result of:
            0.092876054 = score(doc=3096,freq=2.0), product of:
              0.15156418 = queryWeight, product of:
                1.5004512 = boost
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.018212624 = queryNorm
              0.6127837 = fieldWeight in 3096, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.078125 = fieldNorm(doc=3096)
          0.15739343 = weight(abstract_txt:language in 3096) [ClassicSimilarity], result of:
            0.15739343 = score(doc=3096,freq=5.0), product of:
              0.21543607 = queryWeight, product of:
                2.828478 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.018212624 = queryNorm
              0.7305806 = fieldWeight in 3096, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.078125 = fieldNorm(doc=3096)
          0.26622945 = weight(abstract_txt:identification in 3096) [ClassicSimilarity], result of:
            0.26622945 = score(doc=3096,freq=3.0), product of:
              0.33662346 = queryWeight, product of:
                3.1623564 = boost
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.018212624 = queryNorm
              0.79088205 = fieldWeight in 3096, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.078125 = fieldNorm(doc=3096)
        0.16 = coord(4/25)
    
  2. Liu, X.; Croft, W.B.: Statistical language modeling for information retrieval (2004) 0.08
    0.082950234 = sum of:
      0.082950234 = product of:
        0.41475117 = sum of:
          0.055042855 = weight(abstract_txt:sequences in 4277) [ClassicSimilarity], result of:
            0.055042855 = score(doc=4277,freq=1.0), product of:
              0.13564257 = queryWeight, product of:
                1.0037062 = boost
                7.4202213 = idf(docFreq=71, maxDocs=44218)
                0.018212624 = queryNorm
              0.40579337 = fieldWeight in 4277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.4202213 = idf(docFreq=71, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.010028902 = weight(abstract_txt:which in 4277) [ClassicSimilarity], result of:
            0.010028902 = score(doc=4277,freq=1.0), product of:
              0.06287404 = queryWeight, product of:
                1.1836 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.018212624 = queryNorm
              0.15950784 = fieldWeight in 4277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.12162863 = weight(abstract_txt:statistical in 4277) [ClassicSimilarity], result of:
            0.12162863 = score(doc=4277,freq=7.0), product of:
              0.15156418 = queryWeight, product of:
                1.5004512 = boost
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.018212624 = queryNorm
              0.8024893 = fieldWeight in 4277, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.050398286 = weight(abstract_txt:text in 4277) [ClassicSimilarity], result of:
            0.050398286 = score(doc=4277,freq=2.0), product of:
              0.16114464 = queryWeight, product of:
                2.1879961 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.018212624 = queryNorm
              0.31275186 = fieldWeight in 4277, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
          0.17765248 = weight(abstract_txt:language in 4277) [ClassicSimilarity], result of:
            0.17765248 = score(doc=4277,freq=13.0), product of:
              0.21543607 = queryWeight, product of:
                2.828478 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.018212624 = queryNorm
              0.8246181 = fieldWeight in 4277, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0546875 = fieldNorm(doc=4277)
        0.2 = coord(5/25)
    
  3. Wu, Y.-f.B.; Li, Q.; Bot, R.S.; Chen, X.: Finding nuggets in documents : a machine learning approach (2006) 0.08
    0.080833845 = sum of:
      0.080833845 = product of:
        0.40416923 = sum of:
          0.0114616025 = weight(abstract_txt:which in 5290) [ClassicSimilarity], result of:
            0.0114616025 = score(doc=5290,freq=1.0), product of:
              0.06287404 = queryWeight, product of:
                1.1836 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.018212624 = queryNorm
              0.18229467 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.070095174 = weight(abstract_txt:small in 5290) [ClassicSimilarity], result of:
            0.070095174 = score(doc=5290,freq=1.0), product of:
              0.2102648 = queryWeight, product of:
                2.1644747 = boost
                5.333859 = idf(docFreq=579, maxDocs=44218)
                0.018212624 = queryNorm
              0.3333662 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.333859 = idf(docFreq=579, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.057598043 = weight(abstract_txt:text in 5290) [ClassicSimilarity], result of:
            0.057598043 = score(doc=5290,freq=2.0), product of:
              0.16114464 = queryWeight, product of:
                2.1879961 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.018212624 = queryNorm
              0.3574307 = fieldWeight in 5290, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.122966126 = weight(abstract_txt:identification in 5290) [ClassicSimilarity], result of:
            0.122966126 = score(doc=5290,freq=1.0), product of:
              0.33662346 = queryWeight, product of:
                3.1623564 = boost
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.018212624 = queryNorm
              0.3652928 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
          0.14204827 = weight(abstract_txt:program in 5290) [ClassicSimilarity], result of:
            0.14204827 = score(doc=5290,freq=1.0), product of:
              0.39922222 = queryWeight, product of:
                3.8503566 = boost
                5.6930003 = idf(docFreq=404, maxDocs=44218)
                0.018212624 = queryNorm
              0.35581252 = fieldWeight in 5290, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6930003 = idf(docFreq=404, maxDocs=44218)
                0.0625 = fieldNorm(doc=5290)
        0.2 = coord(5/25)
    
  4. Caseiro, D.: Automatic language identification bibliography : Last Update: 20 September 1999 (1999) 0.07
    0.07171076 = sum of:
      0.07171076 = product of:
        0.8963845 = sum of:
          0.2815539 = weight(abstract_txt:language in 1842) [ClassicSimilarity], result of:
            0.2815539 = score(doc=1842,freq=1.0), product of:
              0.21543607 = queryWeight, product of:
                2.828478 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.018212624 = queryNorm
              1.3069023 = fieldWeight in 1842, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.3125 = fieldNorm(doc=1842)
          0.6148306 = weight(abstract_txt:identification in 1842) [ClassicSimilarity], result of:
            0.6148306 = score(doc=1842,freq=1.0), product of:
              0.33662346 = queryWeight, product of:
                3.1623564 = boost
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.018212624 = queryNorm
              1.8264639 = fieldWeight in 1842, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.3125 = fieldNorm(doc=1842)
        0.08 = coord(2/25)
    
  5. Shaalan, K.; Raza, H.: NERA: Named Entity Recognition for Arabic (2009) 0.07
    0.070130385 = sum of:
      0.070130385 = product of:
        0.35065192 = sum of:
          0.010028902 = weight(abstract_txt:which in 2953) [ClassicSimilarity], result of:
            0.010028902 = score(doc=2953,freq=1.0), product of:
              0.06287404 = queryWeight, product of:
                1.1836 = boost
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.018212624 = queryNorm
              0.15950784 = fieldWeight in 2953, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.9167147 = idf(docFreq=6503, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2953)
          0.054279357 = weight(abstract_txt:amount in 2953) [ClassicSimilarity], result of:
            0.054279357 = score(doc=2953,freq=1.0), product of:
              0.16931489 = queryWeight, product of:
                1.5858831 = boost
                5.8620763 = idf(docFreq=341, maxDocs=44218)
                0.018212624 = queryNorm
              0.3205823 = fieldWeight in 2953, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8620763 = idf(docFreq=341, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2953)
          0.03563697 = weight(abstract_txt:text in 2953) [ClassicSimilarity], result of:
            0.03563697 = score(doc=2953,freq=1.0), product of:
              0.16114464 = queryWeight, product of:
                2.1879961 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.018212624 = queryNorm
              0.22114895 = fieldWeight in 2953, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2953)
          0.09854387 = weight(abstract_txt:language in 2953) [ClassicSimilarity], result of:
            0.09854387 = score(doc=2953,freq=4.0), product of:
              0.21543607 = queryWeight, product of:
                2.828478 = boost
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.018212624 = queryNorm
              0.45741582 = fieldWeight in 2953, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1820874 = idf(docFreq=1834, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2953)
          0.1521628 = weight(abstract_txt:identification in 2953) [ClassicSimilarity], result of:
            0.1521628 = score(doc=2953,freq=2.0), product of:
              0.33662346 = queryWeight, product of:
                3.1623564 = boost
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.018212624 = queryNorm
              0.45202672 = fieldWeight in 2953, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.0546875 = fieldNorm(doc=2953)
        0.2 = coord(5/25)