Document (#34125)

Author
Seki, K.
Mostafa, J.
Title
Gene ontology annotation as text categorization : an empirical study
Source
Information processing and management. 44(2008) no.5, S.1754-1770
Year
2008
Abstract
Gene ontology (GO) consists of three structured controlled vocabularies, i.e., GO domains, developed for describing attributes of gene products, and its annotation is crucial to provide a common gateway to access different model organism databases. This paper explores an effective application of text categorization methods to this highly practical problem in biology. As a first step, we attempt to tackle the automatic GO annotation task posed in the Text Retrieval Conference (TREC) 2004 Genomics Track. Given a pair of genes and an article reference where the genes appear, the task simulates assigning GO domain codes. We approach the problem with careful consideration of the specialized terminology and pay special attention to various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extract the words around the spotted gene occurrences and used them to represent the gene for GO domain code annotation. We regard the task as a text categorization problem and adopt a variant of kNN with supervised term weighting schemes, making our method among the top-performing systems in the TREC official evaluation. Furthermore, we investigate different feature selection policies in conjunction with the treatment of terms associated with negative instances. Our experiments reveal that round-robin feature space allocation with eliminating negative terms substantially improves performance as GO terms become specific.

Similar documents (author)

  1. Mostafa, J.: Digital image representation and access (1994) 5.52
    5.5185485 = sum of:
      5.5185485 = weight(author_txt:mostafa in 1171) [ClassicSimilarity], result of:
        5.5185485 = fieldWeight in 1171, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.829678 = idf(docFreq=16, maxDocs=42740)
          0.625 = fieldNorm(doc=1171)
    
  2. Mostafa, S.P.: Enfoqies paradigmaticos de bibliotecologia : unidade na diversidad na unidad (1996) 5.52
    5.5185485 = sum of:
      5.5185485 = weight(author_txt:mostafa in 830) [ClassicSimilarity], result of:
        5.5185485 = fieldWeight in 830, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.829678 = idf(docFreq=16, maxDocs=42740)
          0.625 = fieldNorm(doc=830)
    
  3. Mostafa, J.: Document search interface design : background and introduction to special topic section (2004) 5.52
    5.5185485 = sum of:
      5.5185485 = weight(author_txt:mostafa in 3504) [ClassicSimilarity], result of:
        5.5185485 = fieldWeight in 3504, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.829678 = idf(docFreq=16, maxDocs=42740)
          0.625 = fieldNorm(doc=3504)
    
  4. Mostafa, J.: Bessere Suchmaschinen für das Web (2006) 5.52
    5.5185485 = sum of:
      5.5185485 = weight(author_txt:mostafa in 872) [ClassicSimilarity], result of:
        5.5185485 = fieldWeight in 872, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.829678 = idf(docFreq=16, maxDocs=42740)
          0.625 = fieldNorm(doc=872)
    
  5. Sugimoto, C.R.; Mostafa, J.: ¬A note of concern and context : on careful use of terminologies (2018) 4.41
    4.414839 = sum of:
      4.414839 = weight(author_txt:mostafa in 7278) [ClassicSimilarity], result of:
        4.414839 = fieldWeight in 7278, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.829678 = idf(docFreq=16, maxDocs=42740)
          0.5 = fieldNorm(doc=7278)
    

Similar documents (content)

  1. Ling, X.; Jiang, J.; He, X.; Mei, Q.; Zhai, C.; Schatz, B.: Generating gene summaries from biomedical literature : a study of semi-structured summarization (2007) 0.30
    0.29545683 = sum of:
      0.29545683 = product of:
        1.4772841 = sum of:
          0.06572882 = weight(abstract_txt:genomics in 2947) [ClassicSimilarity], result of:
            0.06572882 = score(doc=2947,freq=1.0), product of:
              0.119105265 = queryWeight, product of:
                1.1533684 = boost
                8.829678 = idf(docFreq=16, maxDocs=42740)
                0.011695481 = queryNorm
              0.55185485 = fieldWeight in 2947, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.829678 = idf(docFreq=16, maxDocs=42740)
                0.0625 = fieldNorm(doc=2947)
          0.0076183863 = weight(abstract_txt:with in 2947) [ClassicSimilarity], result of:
            0.0076183863 = score(doc=2947,freq=1.0), product of:
              0.04841639 = queryWeight, product of:
                1.6443102 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.011695481 = queryNorm
              0.15735139 = fieldWeight in 2947, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=2947)
          0.03588231 = weight(abstract_txt:text in 2947) [ClassicSimilarity], result of:
            0.03588231 = score(doc=2947,freq=2.0), product of:
              0.100236066 = queryWeight, product of:
                2.1161408 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.011695481 = queryNorm
              0.35797805 = fieldWeight in 2947, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=2947)
          0.20878929 = weight(abstract_txt:genes in 2947) [ClassicSimilarity], result of:
            0.20878929 = score(doc=2947,freq=2.0), product of:
              0.25737473 = queryWeight, product of:
                2.3977313 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.011695481 = queryNorm
              0.81122684 = fieldWeight in 2947, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.0625 = fieldNorm(doc=2947)
          1.1592653 = weight(abstract_txt:gene in 2947) [ClassicSimilarity], result of:
            1.1592653 = score(doc=2947,freq=9.0), product of:
              0.74216557 = queryWeight, product of:
                7.6173162 = boost
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.011695481 = queryNorm
              1.5620036 = fieldWeight in 2947, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.0625 = fieldNorm(doc=2947)
        0.2 = coord(5/25)
    
  2. Rapp, B.A.; Wheeler, D.L.: Bioinformatics resources from the National Center for Biotechnology Information : an integrated foundation for discovery (2005) 0.18
    0.18443583 = sum of:
      0.18443583 = product of:
        0.9221791 = sum of:
          0.020835303 = weight(abstract_txt:domain in 266) [ClassicSimilarity], result of:
            0.020835303 = score(doc=266,freq=1.0), product of:
              0.0697649 = queryWeight, product of:
                1.2483491 = boost
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.011695481 = queryNorm
              0.29865023 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.0625 = fieldNorm(doc=266)
          0.0076183863 = weight(abstract_txt:with in 266) [ClassicSimilarity], result of:
            0.0076183863 = score(doc=266,freq=1.0), product of:
              0.04841639 = queryWeight, product of:
                1.6443102 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.011695481 = queryNorm
              0.15735139 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=266)
          0.20878929 = weight(abstract_txt:genes in 266) [ClassicSimilarity], result of:
            0.20878929 = score(doc=266,freq=2.0), product of:
              0.25737473 = queryWeight, product of:
                2.3977313 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.011695481 = queryNorm
              0.81122684 = fieldWeight in 266, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.0625 = fieldNorm(doc=266)
          0.13845323 = weight(abstract_txt:annotation in 266) [ClassicSimilarity], result of:
            0.13845323 = score(doc=266,freq=1.0), product of:
              0.3106818 = queryWeight, product of:
                3.7255504 = boost
                7.130291 = idf(docFreq=92, maxDocs=42740)
                0.011695481 = queryNorm
              0.4456432 = fieldWeight in 266, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.130291 = idf(docFreq=92, maxDocs=42740)
                0.0625 = fieldNorm(doc=266)
          0.54648286 = weight(abstract_txt:gene in 266) [ClassicSimilarity], result of:
            0.54648286 = score(doc=266,freq=2.0), product of:
              0.74216557 = queryWeight, product of:
                7.6173162 = boost
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.011695481 = queryNorm
              0.7363355 = fieldWeight in 266, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.0625 = fieldNorm(doc=266)
        0.2 = coord(5/25)
    
  3. Sy, M.-F.; Ranwez, S.; Montmain, J.; Ragnault, A.; Crampes, M.; Ranwez, V.: User centered and ontology based information retrieval system for life sciences (2012) 0.11
    0.11214729 = sum of:
      0.11214729 = product of:
        0.5607364 = sum of:
          0.025782371 = weight(abstract_txt:domain in 2700) [ClassicSimilarity], result of:
            0.025782371 = score(doc=2700,freq=2.0), product of:
              0.0697649 = queryWeight, product of:
                1.2483491 = boost
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.011695481 = queryNorm
              0.3695608 = fieldWeight in 2700, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.0546875 = fieldNorm(doc=2700)
          0.05822599 = weight(abstract_txt:ontology in 2700) [ClassicSimilarity], result of:
            0.05822599 = score(doc=2700,freq=4.0), product of:
              0.095313914 = queryWeight, product of:
                1.4591358 = boost
                5.5852485 = idf(docFreq=435, maxDocs=42740)
                0.011695481 = queryNorm
              0.6108866 = fieldWeight in 2700, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.5852485 = idf(docFreq=435, maxDocs=42740)
                0.0546875 = fieldNorm(doc=2700)
          0.009427273 = weight(abstract_txt:with in 2700) [ClassicSimilarity], result of:
            0.009427273 = score(doc=2700,freq=2.0), product of:
              0.04841639 = queryWeight, product of:
                1.6443102 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.011695481 = queryNorm
              0.19471242 = fieldWeight in 2700, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0546875 = fieldNorm(doc=2700)
          0.12918179 = weight(abstract_txt:genes in 2700) [ClassicSimilarity], result of:
            0.12918179 = score(doc=2700,freq=1.0), product of:
              0.25737473 = queryWeight, product of:
                2.3977313 = boost
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.011695481 = queryNorm
              0.501921 = fieldWeight in 2700, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.177984 = idf(docFreq=11, maxDocs=42740)
                0.0546875 = fieldNorm(doc=2700)
          0.33811903 = weight(abstract_txt:gene in 2700) [ClassicSimilarity], result of:
            0.33811903 = score(doc=2700,freq=1.0), product of:
              0.74216557 = queryWeight, product of:
                7.6173162 = boost
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.011695481 = queryNorm
              0.45558438 = fieldWeight in 2700, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.0546875 = fieldNorm(doc=2700)
        0.2 = coord(5/25)
    
  4. Hemminger, B.M.; Saelim, B.; Sullivan, P.F.; Vision, T.J.: Comparison of full-text searching to metadata searching for genes in two biomedical literature cohorts (2007) 0.11
    0.106858134 = sum of:
      0.106858134 = product of:
        0.5342907 = sum of:
          0.020835303 = weight(abstract_txt:domain in 3328) [ClassicSimilarity], result of:
            0.020835303 = score(doc=3328,freq=1.0), product of:
              0.0697649 = queryWeight, product of:
                1.2483491 = boost
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.011695481 = queryNorm
              0.29865023 = fieldWeight in 3328, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.0625 = fieldNorm(doc=3328)
          0.040141717 = weight(abstract_txt:feature in 3328) [ClassicSimilarity], result of:
            0.040141717 = score(doc=3328,freq=1.0), product of:
              0.10801922 = queryWeight, product of:
                1.5533456 = boost
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.011695481 = queryNorm
              0.37161642 = fieldWeight in 3328, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.945863 = idf(docFreq=303, maxDocs=42740)
                0.0625 = fieldNorm(doc=3328)
          0.010774026 = weight(abstract_txt:with in 3328) [ClassicSimilarity], result of:
            0.010774026 = score(doc=3328,freq=2.0), product of:
              0.04841639 = queryWeight, product of:
                1.6443102 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.011695481 = queryNorm
              0.22252847 = fieldWeight in 3328, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=3328)
          0.07611788 = weight(abstract_txt:text in 3328) [ClassicSimilarity], result of:
            0.07611788 = score(doc=3328,freq=9.0), product of:
              0.100236066 = queryWeight, product of:
                2.1161408 = boost
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.011695481 = queryNorm
              0.7593861 = fieldWeight in 3328, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                4.0500593 = idf(docFreq=2023, maxDocs=42740)
                0.0625 = fieldNorm(doc=3328)
          0.38642174 = weight(abstract_txt:gene in 3328) [ClassicSimilarity], result of:
            0.38642174 = score(doc=3328,freq=1.0), product of:
              0.74216557 = queryWeight, product of:
                7.6173162 = boost
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.011695481 = queryNorm
              0.52066785 = fieldWeight in 3328, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.0625 = fieldNorm(doc=3328)
        0.2 = coord(5/25)
    
  5. Almeida Campos, M.L. de; Machado Campos, M.L.; Dávila, A.M.R.; Espanha Gomes, H.; Campos, L.M.; Lira e Oliveira, L. de: Information sciences methodological aspects applied to ontology reuse tools : a study based on genomic annotations in the domain of trypanosomatides (2013) 0.11
    0.10563053 = sum of:
      0.10563053 = product of:
        0.52815264 = sum of:
          0.020835303 = weight(abstract_txt:domain in 2636) [ClassicSimilarity], result of:
            0.020835303 = score(doc=2636,freq=1.0), product of:
              0.0697649 = queryWeight, product of:
                1.2483491 = boost
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.011695481 = queryNorm
              0.29865023 = fieldWeight in 2636, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7784038 = idf(docFreq=976, maxDocs=42740)
                0.0625 = fieldNorm(doc=2636)
          0.094107404 = weight(abstract_txt:ontology in 2636) [ClassicSimilarity], result of:
            0.094107404 = score(doc=2636,freq=8.0), product of:
              0.095313914 = queryWeight, product of:
                1.4591358 = boost
                5.5852485 = idf(docFreq=435, maxDocs=42740)
                0.011695481 = queryNorm
              0.98734176 = fieldWeight in 2636, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                5.5852485 = idf(docFreq=435, maxDocs=42740)
                0.0625 = fieldNorm(doc=2636)
          0.019169793 = weight(abstract_txt:terms in 2636) [ClassicSimilarity], result of:
            0.019169793 = score(doc=2636,freq=1.0), product of:
              0.07554617 = queryWeight, product of:
                1.5909972 = boost
                4.05999 = idf(docFreq=2003, maxDocs=42740)
                0.011695481 = queryNorm
              0.25374937 = fieldWeight in 2636, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.05999 = idf(docFreq=2003, maxDocs=42740)
                0.0625 = fieldNorm(doc=2636)
          0.0076183863 = weight(abstract_txt:with in 2636) [ClassicSimilarity], result of:
            0.0076183863 = score(doc=2636,freq=1.0), product of:
              0.04841639 = queryWeight, product of:
                1.6443102 = boost
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.011695481 = queryNorm
              0.15735139 = fieldWeight in 2636, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.5176222 = idf(docFreq=9369, maxDocs=42740)
                0.0625 = fieldNorm(doc=2636)
          0.38642174 = weight(abstract_txt:gene in 2636) [ClassicSimilarity], result of:
            0.38642174 = score(doc=2636,freq=1.0), product of:
              0.74216557 = queryWeight, product of:
                7.6173162 = boost
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.011695481 = queryNorm
              0.52066785 = fieldWeight in 2636, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.330686 = idf(docFreq=27, maxDocs=42740)
                0.0625 = fieldNorm(doc=2636)
        0.2 = coord(5/25)