Document (#38644)

Author
Montalvo, S.
Martínez, R.
Fresno, V.
Delgado, A.
Title
Exploiting named entities for bilingual news clustering
Source
Journal of the Association for Information Science and Technology. 66(2015) no.2, S.363-376
Year
2015
Abstract
In this article, we present a new algorithm for clustering a bilingual collection of comparable news items in groups of specific topics. Our hypothesis is that named entities (NEs) are more informative than other features in the news when clustering fine grained topics. The algorithm does not need as input any information related to the number of clusters, and carries out the clustering only based on information regarding the shared named entities of the news items. This proposal is evaluated using different data sets and outperforms other state-of-the-art algorithms, thereby proving the plausibility of the approach. In addition, because the applicability of our approach depends on the possibility of identifying equivalent named entities among the news, we propose a heuristic system to identify equivalent named entities in the same and different languages, thereby obtaining good performance.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23175/abstract.

Similar documents (author)

  1. Delgado, A.D.; Martínez, R.; Montalvo, S.; Fresno, V.: Person name disambiguation in the Web using adaptive threshold clustering (2017) 3.92
    3.9245107 = sum of:
      3.9245107 = sum of:
        1.3228564 = weight(author_txt:martínez in 5695) [ClassicSimilarity], result of:
          1.3228564 = score(doc=5695,freq=1.0), product of:
            0.53728914 = queryWeight, product of:
              7.8787007 = idf(docFreq=43, maxDocs=42740)
              0.06819514 = queryNorm
            2.462094 = fieldWeight in 5695, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              7.8787007 = idf(docFreq=43, maxDocs=42740)
              0.3125 = fieldNorm(doc=5695)
        2.6016543 = weight(author_txt:delgado in 5695) [ClassicSimilarity], result of:
          2.6016543 = score(doc=5695,freq=1.0), product of:
            0.84339815 = queryWeight, product of:
              1.2528882 = boost
              9.871131 = idf(docFreq=5, maxDocs=42740)
              0.06819514 = queryNorm
            3.0847285 = fieldWeight in 5695, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              9.871131 = idf(docFreq=5, maxDocs=42740)
              0.3125 = fieldNorm(doc=5695)
    
  2. Delgado, Y. Hidalgo- => Hidalgo-Delgado, Y.: 2.21
    2.2075768 = sum of:
      2.2075768 = product of:
        4.4151535 = sum of:
          4.4151535 = weight(author_txt:delgado in 5706) [ClassicSimilarity], result of:
            4.4151535 = score(doc=5706,freq=2.0), product of:
              0.84339815 = queryWeight, product of:
                1.2528882 = boost
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.06819514 = queryNorm
              5.2349577 = fieldWeight in 5706, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.375 = fieldNorm(doc=5706)
        0.5 = coord(1/2)
    
  3. Thelwall, M.; Delgado, M.M.: Arts and humanities research evaluation : no metrics please, just data (2015) 2.08
    2.0813234 = sum of:
      2.0813234 = product of:
        4.162647 = sum of:
          4.162647 = weight(author_txt:delgado in 4314) [ClassicSimilarity], result of:
            4.162647 = score(doc=4314,freq=1.0), product of:
              0.84339815 = queryWeight, product of:
                1.2528882 = boost
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.06819514 = queryNorm
              4.9355655 = fieldWeight in 4314, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.5 = fieldNorm(doc=4314)
        0.5 = coord(1/2)
    
  4. Leiva-Mederos, A.; Senso, J.A.; Hidalgo-Delgado, Y.; Hipola, P.: Working framework of semantic interoperability for CRIS with heterogeneous data sources (2017) 1.30
    1.3008271 = sum of:
      1.3008271 = product of:
        2.6016543 = sum of:
          2.6016543 = weight(author_txt:delgado in 5707) [ClassicSimilarity], result of:
            2.6016543 = score(doc=5707,freq=1.0), product of:
              0.84339815 = queryWeight, product of:
                1.2528882 = boost
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.06819514 = queryNorm
              3.0847285 = fieldWeight in 5707, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.871131 = idf(docFreq=5, maxDocs=42740)
                0.3125 = fieldNorm(doc=5707)
        0.5 = coord(1/2)
    
  5. Arellano, F.F. Martínez => Martínez-Arellano, F.F.: 1.12
    1.1224808 = sum of:
      1.1224808 = product of:
        2.2449615 = sum of:
          2.2449615 = weight(author_txt:martínez in 52) [ClassicSimilarity], result of:
            2.2449615 = score(doc=52,freq=2.0), product of:
              0.53728914 = queryWeight, product of:
                7.8787007 = idf(docFreq=43, maxDocs=42740)
                0.06819514 = queryNorm
              4.178312 = fieldWeight in 52, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.8787007 = idf(docFreq=43, maxDocs=42740)
                0.375 = fieldNorm(doc=52)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Chen, H.-H.; Kuo, J.-J.; Huang, S.-J.; Lin, C.-J.; Wung, H.-C.: ¬A summarization system for Chinese news from multiple sources (2003) 0.26
    0.25629425 = sum of:
      0.25629425 = product of:
        0.9153366 = sum of:
          0.014239543 = weight(abstract_txt:other in 3116) [ClassicSimilarity], result of:
            0.014239543 = score(doc=3116,freq=1.0), product of:
              0.051561967 = queryWeight, product of:
                1.0319799 = boost
                3.5348954 = idf(docFreq=3387, maxDocs=42740)
                0.014134539 = queryNorm
              0.2761637 = fieldWeight in 3116, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.5348954 = idf(docFreq=3387, maxDocs=42740)
                0.078125 = fieldNorm(doc=3116)
          0.081181206 = weight(abstract_txt:informative in 3116) [ClassicSimilarity], result of:
            0.081181206 = score(doc=3116,freq=2.0), product of:
              0.10366108 = queryWeight, product of:
                1.034664 = boost
                7.0881796 = idf(docFreq=96, maxDocs=42740)
                0.014134539 = queryNorm
              0.7831406 = fieldWeight in 3116, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.0881796 = idf(docFreq=96, maxDocs=42740)
                0.078125 = fieldNorm(doc=3116)
          0.016257672 = weight(abstract_txt:different in 3116) [ClassicSimilarity], result of:
            0.016257672 = score(doc=3116,freq=1.0), product of:
              0.056325406 = queryWeight, product of:
                1.0785956 = boost
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.014134539 = queryNorm
              0.28863835 = fieldWeight in 3116, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.078125 = fieldNorm(doc=3116)
          0.06608297 = weight(abstract_txt:heuristic in 3116) [ClassicSimilarity], result of:
            0.06608297 = score(doc=3116,freq=1.0), product of:
              0.113862775 = queryWeight, product of:
                1.0843822 = boost
                7.428784 = idf(docFreq=68, maxDocs=42740)
                0.014134539 = queryNorm
              0.58037376 = fieldWeight in 3116, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.428784 = idf(docFreq=68, maxDocs=42740)
                0.078125 = fieldNorm(doc=3116)
          0.16351223 = weight(abstract_txt:entities in 3116) [ClassicSimilarity], result of:
            0.16351223 = score(doc=3116,freq=1.0), product of:
              0.3561877 = queryWeight, product of:
                4.2886043 = boost
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.014134539 = queryNorm
              0.45906198 = fieldWeight in 3116, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.078125 = fieldNorm(doc=3116)
          0.3130714 = weight(abstract_txt:news in 3116) [ClassicSimilarity], result of:
            0.3130714 = score(doc=3116,freq=3.0), product of:
              0.3808032 = queryWeight, product of:
                4.4343176 = boost
                6.0756416 = idf(docFreq=266, maxDocs=42740)
                0.014134539 = queryNorm
              0.8221344 = fieldWeight in 3116, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.0756416 = idf(docFreq=266, maxDocs=42740)
                0.078125 = fieldNorm(doc=3116)
          0.26099154 = weight(abstract_txt:named in 3116) [ClassicSimilarity], result of:
            0.26099154 = score(doc=3116,freq=1.0), product of:
              0.48647782 = queryWeight, product of:
                5.011965 = boost
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.014134539 = queryNorm
              0.53649217 = fieldWeight in 3116, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.078125 = fieldNorm(doc=3116)
        0.28 = coord(7/25)
    
  2. Husevag, A.-S.R.: Named entities in indexing : a case study of TV subtitles and metadata records (2016) 0.17
    0.16746591 = sum of:
      0.16746591 = product of:
        1.046662 = sum of:
          0.019509207 = weight(abstract_txt:different in 5106) [ClassicSimilarity], result of:
            0.019509207 = score(doc=5106,freq=1.0), product of:
              0.056325406 = queryWeight, product of:
                1.0785956 = boost
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.014134539 = queryNorm
              0.34636605 = fieldWeight in 5106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.09375 = fieldNorm(doc=5106)
          0.27748942 = weight(abstract_txt:entities in 5106) [ClassicSimilarity], result of:
            0.27748942 = score(doc=5106,freq=2.0), product of:
              0.3561877 = queryWeight, product of:
                4.2886043 = boost
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.014134539 = queryNorm
              0.7790539 = fieldWeight in 5106, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.09375 = fieldNorm(doc=5106)
          0.30674607 = weight(abstract_txt:news in 5106) [ClassicSimilarity], result of:
            0.30674607 = score(doc=5106,freq=2.0), product of:
              0.3808032 = queryWeight, product of:
                4.4343176 = boost
                6.0756416 = idf(docFreq=266, maxDocs=42740)
                0.014134539 = queryNorm
              0.8055239 = fieldWeight in 5106, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0756416 = idf(docFreq=266, maxDocs=42740)
                0.09375 = fieldNorm(doc=5106)
          0.44291732 = weight(abstract_txt:named in 5106) [ClassicSimilarity], result of:
            0.44291732 = score(doc=5106,freq=2.0), product of:
              0.48647782 = queryWeight, product of:
                5.011965 = boost
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.014134539 = queryNorm
              0.9104574 = fieldWeight in 5106, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.09375 = fieldNorm(doc=5106)
        0.16 = coord(4/25)
    
  3. Li, C.; Sun, A.: Extracting fine-grained location with temporal awareness in tweets : a two-stage approach (2017) 0.14
    0.14320455 = sum of:
      0.14320455 = product of:
        0.44751424 = sum of:
          0.009967681 = weight(abstract_txt:other in 5687) [ClassicSimilarity], result of:
            0.009967681 = score(doc=5687,freq=1.0), product of:
              0.051561967 = queryWeight, product of:
                1.0319799 = boost
                3.5348954 = idf(docFreq=3387, maxDocs=42740)
                0.014134539 = queryNorm
              0.1933146 = fieldWeight in 5687, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.5348954 = idf(docFreq=3387, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
          0.042264614 = weight(abstract_txt:outperforms in 5687) [ClassicSimilarity], result of:
            0.042264614 = score(doc=5687,freq=1.0), product of:
              0.107211486 = queryWeight, product of:
                1.0522336 = boost
                7.2085433 = idf(docFreq=85, maxDocs=42740)
                0.014134539 = queryNorm
              0.39421722 = fieldWeight in 5687, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2085433 = idf(docFreq=85, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
          0.04289224 = weight(abstract_txt:exploiting in 5687) [ClassicSimilarity], result of:
            0.04289224 = score(doc=5687,freq=1.0), product of:
              0.108270265 = queryWeight, product of:
                1.0574166 = boost
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.014134539 = queryNorm
              0.396159 = fieldWeight in 5687, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24405 = idf(docFreq=82, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
          0.062899135 = weight(abstract_txt:fine in 5687) [ClassicSimilarity], result of:
            0.062899135 = score(doc=5687,freq=2.0), product of:
              0.11091999 = queryWeight, product of:
                1.0702776 = boost
                7.332157 = idf(docFreq=75, maxDocs=42740)
                0.014134539 = queryNorm
              0.5670676 = fieldWeight in 5687, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.332157 = idf(docFreq=75, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
          0.011380372 = weight(abstract_txt:different in 5687) [ClassicSimilarity], result of:
            0.011380372 = score(doc=5687,freq=1.0), product of:
              0.056325406 = queryWeight, product of:
                1.0785956 = boost
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.014134539 = queryNorm
              0.20204686 = fieldWeight in 5687, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
          0.012113953 = weight(abstract_txt:approach in 5687) [ClassicSimilarity], result of:
            0.012113953 = score(doc=5687,freq=1.0), product of:
              0.058720622 = queryWeight, product of:
                1.1012903 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.014134539 = queryNorm
              0.2062981 = fieldWeight in 5687, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
          0.08330215 = weight(abstract_txt:grained in 5687) [ClassicSimilarity], result of:
            0.08330215 = score(doc=5687,freq=2.0), product of:
              0.13376758 = queryWeight, product of:
                1.1753492 = boost
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.014134539 = queryNorm
              0.62273794 = fieldWeight in 5687, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.051972 = idf(docFreq=36, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
          0.18269408 = weight(abstract_txt:named in 5687) [ClassicSimilarity], result of:
            0.18269408 = score(doc=5687,freq=1.0), product of:
              0.48647782 = queryWeight, product of:
                5.011965 = boost
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.014134539 = queryNorm
              0.37554452 = fieldWeight in 5687, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.0546875 = fieldNorm(doc=5687)
        0.32 = coord(8/25)
    
  4. Xu, S.; Zhai, D.; Wang, F.; An, X.; Pang, H.; Sun, Y.: ¬A novel method for topic linkages between scientific publications and patents (2019) 0.13
    0.13451824 = sum of:
      0.13451824 = product of:
        0.56049263 = sum of:
          0.013844517 = weight(abstract_txt:approach in 1361) [ClassicSimilarity], result of:
            0.013844517 = score(doc=1361,freq=1.0), product of:
              0.058720622 = queryWeight, product of:
                1.1012903 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.014134539 = queryNorm
              0.23576926 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.0625 = fieldNorm(doc=1361)
          0.03462364 = weight(abstract_txt:topics in 1361) [ClassicSimilarity], result of:
            0.03462364 = score(doc=1361,freq=1.0), product of:
              0.10819003 = queryWeight, product of:
                1.4948586 = boost
                5.1204185 = idf(docFreq=693, maxDocs=42740)
                0.014134539 = queryNorm
              0.32002616 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.1204185 = idf(docFreq=693, maxDocs=42740)
                0.0625 = fieldNorm(doc=1361)
          0.048268653 = weight(abstract_txt:algorithm in 1361) [ClassicSimilarity], result of:
            0.048268653 = score(doc=1361,freq=1.0), product of:
              0.13501506 = queryWeight, product of:
                1.6699275 = boost
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.014134539 = queryNorm
              0.3575057 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7200913 = idf(docFreq=380, maxDocs=42740)
                0.0625 = fieldNorm(doc=1361)
          0.1241528 = weight(abstract_txt:clustering in 1361) [ClassicSimilarity], result of:
            0.1241528 = score(doc=1361,freq=1.0), product of:
              0.31933984 = queryWeight, product of:
                3.6320186 = boost
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.014134539 = queryNorm
              0.38877955 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.220473 = idf(docFreq=230, maxDocs=42740)
                0.0625 = fieldNorm(doc=1361)
          0.13080978 = weight(abstract_txt:entities in 1361) [ClassicSimilarity], result of:
            0.13080978 = score(doc=1361,freq=1.0), product of:
              0.3561877 = queryWeight, product of:
                4.2886043 = boost
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.014134539 = queryNorm
              0.36724958 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.0625 = fieldNorm(doc=1361)
          0.20879324 = weight(abstract_txt:named in 1361) [ClassicSimilarity], result of:
            0.20879324 = score(doc=1361,freq=1.0), product of:
              0.48647782 = queryWeight, product of:
                5.011965 = boost
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.014134539 = queryNorm
              0.42919374 = fieldWeight in 1361, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.0625 = fieldNorm(doc=1361)
        0.24 = coord(6/25)
    
  5. Shaalan, K.; Raza, H.: NERA: Named Entity Recognition for Arabic (2009) 0.12
    0.11847814 = sum of:
      0.11847814 = product of:
        0.7404884 = sum of:
          0.016094275 = weight(abstract_txt:different in 4954) [ClassicSimilarity], result of:
            0.016094275 = score(doc=4954,freq=2.0), product of:
              0.056325406 = queryWeight, product of:
                1.0785956 = boost
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.014134539 = queryNorm
              0.2857374 = fieldWeight in 4954, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.694571 = idf(docFreq=2887, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4954)
          0.012113953 = weight(abstract_txt:approach in 4954) [ClassicSimilarity], result of:
            0.012113953 = score(doc=4954,freq=1.0), product of:
              0.058720622 = queryWeight, product of:
                1.1012903 = boost
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.014134539 = queryNorm
              0.2062981 = fieldWeight in 4954, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.772308 = idf(docFreq=2671, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4954)
          0.22891712 = weight(abstract_txt:entities in 4954) [ClassicSimilarity], result of:
            0.22891712 = score(doc=4954,freq=4.0), product of:
              0.3561877 = queryWeight, product of:
                4.2886043 = boost
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.014134539 = queryNorm
              0.6426868 = fieldWeight in 4954, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.8759933 = idf(docFreq=325, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4954)
          0.4833631 = weight(abstract_txt:named in 4954) [ClassicSimilarity], result of:
            0.4833631 = score(doc=4954,freq=7.0), product of:
              0.48647782 = queryWeight, product of:
                5.011965 = boost
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.014134539 = queryNorm
              0.9935974 = fieldWeight in 4954, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.8671 = idf(docFreq=120, maxDocs=42740)
                0.0546875 = fieldNorm(doc=4954)
        0.16 = coord(4/25)