Document (#38344)

Author
Iorio, A.D.
Peroni, S.
Poggi, F.
Vitali, F.
Title
Dealing with structural patterns of XML documents
Source
Journal of the Association for Information Science and Technology. 65(2014) no.9, S.1884-1900
Year
2014
Abstract
Evaluating collections of XML documents without paying attention to the schema they were written in may give interesting insights into the expected characteristics of a markup language, as well as any regularity that may span vocabularies and languages, and that are more fundamental and frequent than plain content models. In this paper we explore the idea of structural patterns in XML vocabularies, by examining the characteristics of elements as they are used, rather than as they are defined. We introduce from the ground up a formal theory of 8 plus 3 structural patterns for XML elements, and verify their identifiability in a number of different XML vocabularies. The results allowed the creation of visualization and content extraction tools that are completely independent of the schema and without any previous knowledge of the semantics and organization of the XML vocabulary of the documents.
Object
XML

Similar documents (content)

  1. Zhu, B.; Chen, H.: Information visualization (2004) 0.14
    0.14251855 = sum of:
      0.14251855 = product of:
        0.44537047 = sum of:
          0.13109714 = weight(abstract_txt:visualization in 274) [ClassicSimilarity], result of:
            0.13109714 = score(doc=274,freq=18.0), product of:
              0.12686913 = queryWeight, product of:
                1.0054669 = boost
                6.2350655 = idf(docFreq=231, maxDocs=43556)
                0.020237047 = queryNorm
              1.0333258 = fieldWeight in 274, product of:
                4.2426405 = tf(freq=18.0), with freq of:
                  18.0 = termFreq=18.0
                6.2350655 = idf(docFreq=231, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
          0.0089492 = weight(abstract_txt:that in 274) [ClassicSimilarity], result of:
            0.0089492 = score(doc=274,freq=3.0), product of:
              0.05553594 = queryWeight, product of:
                1.1522256 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.020237047 = queryNorm
              0.1611425 = fieldWeight in 274, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
          0.015112007 = weight(abstract_txt:than in 274) [ClassicSimilarity], result of:
            0.015112007 = score(doc=274,freq=1.0), product of:
              0.099222325 = queryWeight, product of:
                1.2575045 = boost
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.020237047 = queryNorm
              0.1523045 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
          0.073001355 = weight(abstract_txt:without in 274) [ClassicSimilarity], result of:
            0.073001355 = score(doc=274,freq=4.0), product of:
              0.17861937 = queryWeight, product of:
                1.6872098 = boost
                5.2313323 = idf(docFreq=632, maxDocs=43556)
                0.020237047 = queryNorm
              0.40869784 = fieldWeight in 274, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.2313323 = idf(docFreq=632, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
          0.037315726 = weight(abstract_txt:elements in 274) [ClassicSimilarity], result of:
            0.037315726 = score(doc=274,freq=1.0), product of:
              0.18126859 = queryWeight, product of:
                1.6996759 = boost
                5.2699842 = idf(docFreq=608, maxDocs=43556)
                0.020237047 = queryNorm
              0.20585877 = fieldWeight in 274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2699842 = idf(docFreq=608, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
          0.035403024 = weight(abstract_txt:they in 274) [ClassicSimilarity], result of:
            0.035403024 = score(doc=274,freq=3.0), product of:
              0.13891363 = queryWeight, product of:
                1.8223126 = boost
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.020237047 = queryNorm
              0.25485638 = fieldWeight in 274, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
          0.046347138 = weight(abstract_txt:documents in 274) [ClassicSimilarity], result of:
            0.046347138 = score(doc=274,freq=3.0), product of:
              0.16623913 = queryWeight, product of:
                1.9935038 = boost
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.020237047 = queryNorm
              0.278798 = fieldWeight in 274, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
          0.09814486 = weight(abstract_txt:patterns in 274) [ClassicSimilarity], result of:
            0.09814486 = score(doc=274,freq=3.0), product of:
              0.27413407 = queryWeight, product of:
                2.5599527 = boost
                5.291562 = idf(docFreq=595, maxDocs=43556)
                0.020237047 = queryNorm
              0.35801774 = fieldWeight in 274, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.291562 = idf(docFreq=595, maxDocs=43556)
                0.0390625 = fieldNorm(doc=274)
        0.32 = coord(8/25)
    
  2. Fairthorne, R.A.: Temporal structure in bibliographic classification (1985) 0.14
    0.1351208 = sum of:
      0.1351208 = product of:
        0.3753355 = sum of:
          0.0303986 = weight(abstract_txt:interesting in 4649) [ClassicSimilarity], result of:
            0.0303986 = score(doc=4649,freq=1.0), product of:
              0.12549324 = queryWeight, product of:
                6.201164 = idf(docFreq=239, maxDocs=43556)
                0.020237047 = queryNorm
              0.24223296 = fieldWeight in 4649, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.201164 = idf(docFreq=239, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.043375753 = weight(abstract_txt:ground in 4649) [ClassicSimilarity], result of:
            0.043375753 = score(doc=4649,freq=1.0), product of:
              0.15905572 = queryWeight, product of:
                1.1258084 = boost
                6.9813223 = idf(docFreq=109, maxDocs=43556)
                0.020237047 = queryNorm
              0.2727079 = fieldWeight in 4649, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9813223 = idf(docFreq=109, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.012656081 = weight(abstract_txt:that in 4649) [ClassicSimilarity], result of:
            0.012656081 = score(doc=4649,freq=6.0), product of:
              0.05553594 = queryWeight, product of:
                1.1522256 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.020237047 = queryNorm
              0.22788993 = fieldWeight in 4649, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.021371605 = weight(abstract_txt:than in 4649) [ClassicSimilarity], result of:
            0.021371605 = score(doc=4649,freq=2.0), product of:
              0.099222325 = queryWeight, product of:
                1.2575045 = boost
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.020237047 = queryNorm
              0.21539108 = fieldWeight in 4649, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8989954 = idf(docFreq=2398, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.029884404 = weight(abstract_txt:characteristics in 4649) [ClassicSimilarity], result of:
            0.029884404 = score(doc=4649,freq=1.0), product of:
              0.15632354 = queryWeight, product of:
                1.5783998 = boost
                4.8939576 = idf(docFreq=886, maxDocs=43556)
                0.020237047 = queryNorm
              0.19117022 = fieldWeight in 4649, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8939576 = idf(docFreq=886, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.036500677 = weight(abstract_txt:without in 4649) [ClassicSimilarity], result of:
            0.036500677 = score(doc=4649,freq=1.0), product of:
              0.17861937 = queryWeight, product of:
                1.6872098 = boost
                5.2313323 = idf(docFreq=632, maxDocs=43556)
                0.020237047 = queryNorm
              0.20434892 = fieldWeight in 4649, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2313323 = idf(docFreq=632, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.028906448 = weight(abstract_txt:they in 4649) [ClassicSimilarity], result of:
            0.028906448 = score(doc=4649,freq=2.0), product of:
              0.13891363 = queryWeight, product of:
                1.8223126 = boost
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.020237047 = queryNorm
              0.20808935 = fieldWeight in 4649, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.10363536 = weight(abstract_txt:documents in 4649) [ClassicSimilarity], result of:
            0.10363536 = score(doc=4649,freq=15.0), product of:
              0.16623913 = queryWeight, product of:
                1.9935038 = boost
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.020237047 = queryNorm
              0.62341136 = fieldWeight in 4649, product of:
                3.8729835 = tf(freq=15.0), with freq of:
                  15.0 = termFreq=15.0
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
          0.0686066 = weight(abstract_txt:schema in 4649) [ClassicSimilarity], result of:
            0.0686066 = score(doc=4649,freq=1.0), product of:
              0.27204362 = queryWeight, product of:
                2.082208 = boost
                6.456056 = idf(docFreq=185, maxDocs=43556)
                0.020237047 = queryNorm
              0.2521897 = fieldWeight in 4649, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.456056 = idf(docFreq=185, maxDocs=43556)
                0.0390625 = fieldNorm(doc=4649)
        0.36 = coord(9/25)
    
  3. Jia, J.: From data to knowledge : the relationships between vocabularies, linked data and knowledge graphs (2021) 0.11
    0.11036253 = sum of:
      0.11036253 = product of:
        0.55181265 = sum of:
          0.06808394 = weight(abstract_txt:frequent in 2393) [ClassicSimilarity], result of:
            0.06808394 = score(doc=2393,freq=1.0), product of:
              0.15703668 = queryWeight, product of:
                1.1186401 = boost
                6.9368706 = idf(docFreq=114, maxDocs=43556)
                0.020237047 = queryNorm
              0.4335544 = fieldWeight in 2393, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9368706 = idf(docFreq=114, maxDocs=43556)
                0.0625 = fieldNorm(doc=2393)
          0.008266917 = weight(abstract_txt:that in 2393) [ClassicSimilarity], result of:
            0.008266917 = score(doc=2393,freq=1.0), product of:
              0.05553594 = queryWeight, product of:
                1.1522256 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.020237047 = queryNorm
              0.14885707 = fieldWeight in 2393, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.0625 = fieldNorm(doc=2393)
          0.032703914 = weight(abstract_txt:they in 2393) [ClassicSimilarity], result of:
            0.032703914 = score(doc=2393,freq=1.0), product of:
              0.13891363 = queryWeight, product of:
                1.8223126 = boost
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.020237047 = queryNorm
              0.23542623 = fieldWeight in 2393, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.0625 = fieldNorm(doc=2393)
          0.15523902 = weight(abstract_txt:schema in 2393) [ClassicSimilarity], result of:
            0.15523902 = score(doc=2393,freq=2.0), product of:
              0.27204362 = queryWeight, product of:
                2.082208 = boost
                6.456056 = idf(docFreq=185, maxDocs=43556)
                0.020237047 = queryNorm
              0.57064015 = fieldWeight in 2393, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.456056 = idf(docFreq=185, maxDocs=43556)
                0.0625 = fieldNorm(doc=2393)
          0.28751883 = weight(abstract_txt:vocabularies in 2393) [ClassicSimilarity], result of:
            0.28751883 = score(doc=2393,freq=5.0), product of:
              0.346045 = queryWeight, product of:
                2.8761845 = boost
                5.9452305 = idf(docFreq=309, maxDocs=43556)
                0.020237047 = queryNorm
              0.8308712 = fieldWeight in 2393, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.9452305 = idf(docFreq=309, maxDocs=43556)
                0.0625 = fieldNorm(doc=2393)
        0.2 = coord(5/25)
    
  4. Wusteman, J.: Document Type Definition (DTD) (2009) 0.11
    0.10963509 = sum of:
      0.10963509 = product of:
        0.54817545 = sum of:
          0.14127442 = weight(abstract_txt:markup in 764) [ClassicSimilarity], result of:
            0.14127442 = score(doc=764,freq=2.0), product of:
              0.15474246 = queryWeight, product of:
                1.1104387 = boost
                6.886012 = idf(docFreq=120, maxDocs=43556)
                0.020237047 = queryNorm
              0.9129648 = fieldWeight in 764, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.886012 = idf(docFreq=120, maxDocs=43556)
                0.09375 = fieldNorm(doc=764)
          0.012400377 = weight(abstract_txt:that in 764) [ClassicSimilarity], result of:
            0.012400377 = score(doc=764,freq=1.0), product of:
              0.05553594 = queryWeight, product of:
                1.1522256 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.020237047 = queryNorm
              0.22328562 = fieldWeight in 764, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.09375 = fieldNorm(doc=764)
          0.045087937 = weight(abstract_txt:content in 764) [ClassicSimilarity], result of:
            0.045087937 = score(doc=764,freq=1.0), product of:
              0.114716895 = queryWeight, product of:
                1.3521302 = boost
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.020237047 = queryNorm
              0.3930366 = fieldWeight in 764, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1923904 = idf(docFreq=1788, maxDocs=43556)
                0.09375 = fieldNorm(doc=764)
          0.06422048 = weight(abstract_txt:documents in 764) [ClassicSimilarity], result of:
            0.06422048 = score(doc=764,freq=1.0), product of:
              0.16623913 = queryWeight, product of:
                1.9935038 = boost
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.020237047 = queryNorm
              0.38631386 = fieldWeight in 764, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.09375 = fieldNorm(doc=764)
          0.28519225 = weight(abstract_txt:schema in 764) [ClassicSimilarity], result of:
            0.28519225 = score(doc=764,freq=3.0), product of:
              0.27204362 = queryWeight, product of:
                2.082208 = boost
                6.456056 = idf(docFreq=185, maxDocs=43556)
                0.020237047 = queryNorm
              1.0483328 = fieldWeight in 764, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.456056 = idf(docFreq=185, maxDocs=43556)
                0.09375 = fieldNorm(doc=764)
        0.2 = coord(5/25)
    
  5. Graus, D.; Odijk, D.; Rijke, M. de: ¬The birth of collective memories : analyzing emerging entities in text streams (2018) 0.11
    0.10860748 = sum of:
      0.10860748 = product of:
        0.45253116 = sum of:
          0.02024973 = weight(abstract_txt:that in 538) [ClassicSimilarity], result of:
            0.02024973 = score(doc=538,freq=6.0), product of:
              0.05553594 = queryWeight, product of:
                1.1522256 = boost
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.020237047 = queryNorm
              0.36462387 = fieldWeight in 538, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.3817132 = idf(docFreq=10938, maxDocs=43556)
                0.0625 = fieldNorm(doc=538)
          0.13744295 = weight(abstract_txt:span in 538) [ClassicSimilarity], result of:
            0.13744295 = score(doc=538,freq=2.0), product of:
              0.19908702 = queryWeight, product of:
                1.259538 = boost
                7.8106017 = idf(docFreq=47, maxDocs=43556)
                0.020237047 = queryNorm
              0.69036615 = fieldWeight in 538, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.8106017 = idf(docFreq=47, maxDocs=43556)
                0.0625 = fieldNorm(doc=538)
          0.05840108 = weight(abstract_txt:without in 538) [ClassicSimilarity], result of:
            0.05840108 = score(doc=538,freq=1.0), product of:
              0.17861937 = queryWeight, product of:
                1.6872098 = boost
                5.2313323 = idf(docFreq=632, maxDocs=43556)
                0.020237047 = queryNorm
              0.32695827 = fieldWeight in 538, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2313323 = idf(docFreq=632, maxDocs=43556)
                0.0625 = fieldNorm(doc=538)
          0.06540783 = weight(abstract_txt:they in 538) [ClassicSimilarity], result of:
            0.06540783 = score(doc=538,freq=4.0), product of:
              0.13891363 = queryWeight, product of:
                1.8223126 = boost
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.020237047 = queryNorm
              0.47085246 = fieldWeight in 538, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.7668197 = idf(docFreq=2737, maxDocs=43556)
                0.0625 = fieldNorm(doc=538)
          0.042813655 = weight(abstract_txt:documents in 538) [ClassicSimilarity], result of:
            0.042813655 = score(doc=538,freq=1.0), product of:
              0.16623913 = queryWeight, product of:
                1.9935038 = boost
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.020237047 = queryNorm
              0.25754258 = fieldWeight in 538, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1206813 = idf(docFreq=1921, maxDocs=43556)
                0.0625 = fieldNorm(doc=538)
          0.12821591 = weight(abstract_txt:patterns in 538) [ClassicSimilarity], result of:
            0.12821591 = score(doc=538,freq=2.0), product of:
              0.27413407 = queryWeight, product of:
                2.5599527 = boost
                5.291562 = idf(docFreq=595, maxDocs=43556)
                0.020237047 = queryNorm
              0.46771243 = fieldWeight in 538, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.291562 = idf(docFreq=595, maxDocs=43556)
                0.0625 = fieldNorm(doc=538)
        0.24 = coord(6/25)