Document (#38346)

Author
Iorio, A.D.
Peroni, S.
Poggi, F.
Vitali, F.
Title
Dealing with structural patterns of XML documents
Source
Journal of the Association for Information Science and Technology. 65(2014) no.9, S.1884-1900
Year
2014
Abstract
Evaluating collections of XML documents without paying attention to the schema they were written in may give interesting insights into the expected characteristics of a markup language, as well as any regularity that may span vocabularies and languages, and that are more fundamental and frequent than plain content models. In this paper we explore the idea of structural patterns in XML vocabularies, by examining the characteristics of elements as they are used, rather than as they are defined. We introduce from the ground up a formal theory of 8 plus 3 structural patterns for XML elements, and verify their identifiability in a number of different XML vocabularies. The results allowed the creation of visualization and content extraction tools that are completely independent of the schema and without any previous knowledge of the semantics and organization of the XML vocabulary of the documents.
Object
XML

Similar documents (content)

  1. Zhu, B.; Chen, H.: Information visualization (2004) 0.14
    0.14240074 = sum of:
      0.14240074 = product of:
        0.44500232 = sum of:
          0.13121739 = weight(abstract_txt:visualization in 4276) [ClassicSimilarity], result of:
            0.13121739 = score(doc=4276,freq=18.0), product of:
              0.12711269 = queryWeight, product of:
                1.0053582 = boost
                6.228827 = idf(docFreq=236, maxDocs=44218)
                0.020298399 = queryNorm
              1.0322919 = fieldWeight in 4276, product of:
                4.2426405 = tf(freq=18.0), with freq of:
                  18.0 = termFreq=18.0
                6.228827 = idf(docFreq=236, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
          0.008846565 = weight(abstract_txt:that in 4276) [ClassicSimilarity], result of:
            0.008846565 = score(doc=4276,freq=3.0), product of:
              0.055182565 = queryWeight, product of:
                1.147329 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.020298399 = queryNorm
              0.1603145 = fieldWeight in 4276, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
          0.015125782 = weight(abstract_txt:than in 4276) [ClassicSimilarity], result of:
            0.015125782 = score(doc=4276,freq=1.0), product of:
              0.09941243 = queryWeight, product of:
                1.2573661 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.020298399 = queryNorm
              0.15215182 = fieldWeight in 4276, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
          0.073067866 = weight(abstract_txt:without in 4276) [ClassicSimilarity], result of:
            0.073067866 = score(doc=4276,freq=4.0), product of:
              0.17896155 = queryWeight, product of:
                1.687024 = boost
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.020298399 = queryNorm
              0.4082881 = fieldWeight in 4276, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
          0.037265435 = weight(abstract_txt:elements in 4276) [ClassicSimilarity], result of:
            0.037265435 = score(doc=4276,freq=1.0), product of:
              0.1813425 = queryWeight, product of:
                1.6982092 = boost
                5.260737 = idf(docFreq=623, maxDocs=44218)
                0.020298399 = queryNorm
              0.20549753 = fieldWeight in 4276, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.260737 = idf(docFreq=623, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
          0.03512537 = weight(abstract_txt:they in 4276) [ClassicSimilarity], result of:
            0.03512537 = score(doc=4276,freq=3.0), product of:
              0.13836706 = queryWeight, product of:
                1.8167843 = boost
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.020298399 = queryNorm
              0.25385645 = fieldWeight in 4276, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
          0.046550225 = weight(abstract_txt:documents in 4276) [ClassicSimilarity], result of:
            0.046550225 = score(doc=4276,freq=3.0), product of:
              0.1669424 = queryWeight, product of:
                1.9955856 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.020298399 = queryNorm
              0.27884004 = fieldWeight in 4276, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
          0.09780373 = weight(abstract_txt:patterns in 4276) [ClassicSimilarity], result of:
            0.09780373 = score(doc=4276,freq=3.0), product of:
              0.27385607 = queryWeight, product of:
                2.5559256 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.020298399 = queryNorm
              0.3571355 = fieldWeight in 4276, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0390625 = fieldNorm(doc=4276)
        0.32 = coord(8/25)
    
  2. Fairthorne, R.A.: Temporal structure in bibliographic classification (1985) 0.11
    0.11027961 = sum of:
      0.11027961 = product of:
        0.3446238 = sum of:
          0.04349124 = weight(abstract_txt:ground in 3651) [ClassicSimilarity], result of:
            0.04349124 = score(doc=3651,freq=1.0), product of:
              0.15954626 = queryWeight, product of:
                1.1263405 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.020298399 = queryNorm
              0.2725933 = fieldWeight in 3651, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
          0.012510933 = weight(abstract_txt:that in 3651) [ClassicSimilarity], result of:
            0.012510933 = score(doc=3651,freq=6.0), product of:
              0.055182565 = queryWeight, product of:
                1.147329 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.020298399 = queryNorm
              0.22671895 = fieldWeight in 3651, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
          0.021391086 = weight(abstract_txt:than in 3651) [ClassicSimilarity], result of:
            0.021391086 = score(doc=3651,freq=2.0), product of:
              0.09941243 = queryWeight, product of:
                1.2573661 = boost
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.020298399 = queryNorm
              0.21517517 = fieldWeight in 3651, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8950868 = idf(docFreq=2444, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
          0.029748581 = weight(abstract_txt:characteristics in 3651) [ClassicSimilarity], result of:
            0.029748581 = score(doc=3651,freq=1.0), product of:
              0.15605329 = queryWeight, product of:
                1.5753529 = boost
                4.8801513 = idf(docFreq=912, maxDocs=44218)
                0.020298399 = queryNorm
              0.19063091 = fieldWeight in 3651, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8801513 = idf(docFreq=912, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
          0.036533933 = weight(abstract_txt:without in 3651) [ClassicSimilarity], result of:
            0.036533933 = score(doc=3651,freq=1.0), product of:
              0.17896155 = queryWeight, product of:
                1.687024 = boost
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.020298399 = queryNorm
              0.20414405 = fieldWeight in 3651, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
          0.028679743 = weight(abstract_txt:they in 3651) [ClassicSimilarity], result of:
            0.028679743 = score(doc=3651,freq=2.0), product of:
              0.13836706 = queryWeight, product of:
                1.8167843 = boost
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.020298399 = queryNorm
              0.20727292 = fieldWeight in 3651, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
          0.10408948 = weight(abstract_txt:documents in 3651) [ClassicSimilarity], result of:
            0.10408948 = score(doc=3651,freq=15.0), product of:
              0.1669424 = queryWeight, product of:
                1.9955856 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.020298399 = queryNorm
              0.62350535 = fieldWeight in 3651, product of:
                3.8729835 = tf(freq=15.0), with freq of:
                  15.0 = termFreq=15.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
          0.06817882 = weight(abstract_txt:schema in 3651) [ClassicSimilarity], result of:
            0.06817882 = score(doc=3651,freq=1.0), product of:
              0.2712658 = queryWeight, product of:
                2.0770116 = boost
                6.434197 = idf(docFreq=192, maxDocs=44218)
                0.020298399 = queryNorm
              0.25133583 = fieldWeight in 3651, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.434197 = idf(docFreq=192, maxDocs=44218)
                0.0390625 = fieldNorm(doc=3651)
        0.32 = coord(8/25)
    
  3. Jia, J.: From data to knowledge : the relationships between vocabularies, linked data and knowledge graphs (2021) 0.11
    0.10943323 = sum of:
      0.10943323 = product of:
        0.5471661 = sum of:
          0.06828761 = weight(abstract_txt:frequent in 106) [ClassicSimilarity], result of:
            0.06828761 = score(doc=106,freq=1.0), product of:
              0.15755543 = queryWeight, product of:
                1.1192912 = boost
                6.9347134 = idf(docFreq=116, maxDocs=44218)
                0.020298399 = queryNorm
              0.4334196 = fieldWeight in 106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9347134 = idf(docFreq=116, maxDocs=44218)
                0.0625 = fieldNorm(doc=106)
          0.008172107 = weight(abstract_txt:that in 106) [ClassicSimilarity], result of:
            0.008172107 = score(doc=106,freq=1.0), product of:
              0.055182565 = queryWeight, product of:
                1.147329 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.020298399 = queryNorm
              0.1480922 = fieldWeight in 106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=106)
          0.032447428 = weight(abstract_txt:they in 106) [ClassicSimilarity], result of:
            0.032447428 = score(doc=106,freq=1.0), product of:
              0.13836706 = queryWeight, product of:
                1.8167843 = boost
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.020298399 = queryNorm
              0.23450254 = fieldWeight in 106, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.0625 = fieldNorm(doc=106)
          0.15427104 = weight(abstract_txt:schema in 106) [ClassicSimilarity], result of:
            0.15427104 = score(doc=106,freq=2.0), product of:
              0.2712658 = queryWeight, product of:
                2.0770116 = boost
                6.434197 = idf(docFreq=192, maxDocs=44218)
                0.020298399 = queryNorm
              0.568708 = fieldWeight in 106, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.434197 = idf(docFreq=192, maxDocs=44218)
                0.0625 = fieldNorm(doc=106)
          0.28398797 = weight(abstract_txt:vocabularies in 106) [ClassicSimilarity], result of:
            0.28398797 = score(doc=106,freq=5.0), product of:
              0.34365484 = queryWeight, product of:
                2.8631773 = boost
                5.913062 = idf(docFreq=324, maxDocs=44218)
                0.020298399 = queryNorm
              0.82637554 = fieldWeight in 106, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.913062 = idf(docFreq=324, maxDocs=44218)
                0.0625 = fieldNorm(doc=106)
        0.2 = coord(5/25)
    
  4. Wusteman, J.: Document Type Definition (DTD) (2009) 0.11
    0.109356895 = sum of:
      0.109356895 = product of:
        0.54678446 = sum of:
          0.14174843 = weight(abstract_txt:markup in 3766) [ClassicSimilarity], result of:
            0.14174843 = score(doc=3766,freq=2.0), product of:
              0.15529117 = queryWeight, product of:
                1.1112193 = boost
                6.8847027 = idf(docFreq=122, maxDocs=44218)
                0.020298399 = queryNorm
              0.91279125 = fieldWeight in 3766, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.8847027 = idf(docFreq=122, maxDocs=44218)
                0.09375 = fieldNorm(doc=3766)
          0.01225816 = weight(abstract_txt:that in 3766) [ClassicSimilarity], result of:
            0.01225816 = score(doc=3766,freq=1.0), product of:
              0.055182565 = queryWeight, product of:
                1.147329 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.020298399 = queryNorm
              0.22213829 = fieldWeight in 3766, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.09375 = fieldNorm(doc=3766)
          0.044861984 = weight(abstract_txt:content in 3766) [ClassicSimilarity], result of:
            0.044861984 = score(doc=3766,freq=1.0), product of:
              0.114482805 = queryWeight, product of:
                1.3493093 = boost
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.020298399 = queryNorm
              0.39186656 = fieldWeight in 3766, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.17991 = idf(docFreq=1838, maxDocs=44218)
                0.09375 = fieldNorm(doc=3766)
          0.06450189 = weight(abstract_txt:documents in 3766) [ClassicSimilarity], result of:
            0.06450189 = score(doc=3766,freq=1.0), product of:
              0.1669424 = queryWeight, product of:
                1.9955856 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.020298399 = queryNorm
              0.38637212 = fieldWeight in 3766, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.09375 = fieldNorm(doc=3766)
          0.283414 = weight(abstract_txt:schema in 3766) [ClassicSimilarity], result of:
            0.283414 = score(doc=3766,freq=3.0), product of:
              0.2712658 = queryWeight, product of:
                2.0770116 = boost
                6.434197 = idf(docFreq=192, maxDocs=44218)
                0.020298399 = queryNorm
              1.0447834 = fieldWeight in 3766, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.434197 = idf(docFreq=192, maxDocs=44218)
                0.09375 = fieldNorm(doc=3766)
        0.2 = coord(5/25)
    
  5. Graus, D.; Odijk, D.; Rijke, M. de: ¬The birth of collective memories : analyzing emerging entities in text streams (2018) 0.11
    0.10870126 = sum of:
      0.10870126 = product of:
        0.45292193 = sum of:
          0.020017494 = weight(abstract_txt:that in 4252) [ClassicSimilarity], result of:
            0.020017494 = score(doc=4252,freq=6.0), product of:
              0.055182565 = queryWeight, product of:
                1.147329 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.020298399 = queryNorm
              0.36275032 = fieldWeight in 4252, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=4252)
          0.13878375 = weight(abstract_txt:span in 4252) [ClassicSimilarity], result of:
            0.13878375 = score(doc=4252,freq=2.0), product of:
              0.2006417 = queryWeight, product of:
                1.2630979 = boost
                7.825686 = idf(docFreq=47, maxDocs=44218)
                0.020298399 = queryNorm
              0.69169945 = fieldWeight in 4252, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.825686 = idf(docFreq=47, maxDocs=44218)
                0.0625 = fieldNorm(doc=4252)
          0.058454294 = weight(abstract_txt:without in 4252) [ClassicSimilarity], result of:
            0.058454294 = score(doc=4252,freq=1.0), product of:
              0.17896155 = queryWeight, product of:
                1.687024 = boost
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.020298399 = queryNorm
              0.32663047 = fieldWeight in 4252, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2260876 = idf(docFreq=645, maxDocs=44218)
                0.0625 = fieldNorm(doc=4252)
          0.064894855 = weight(abstract_txt:they in 4252) [ClassicSimilarity], result of:
            0.064894855 = score(doc=4252,freq=4.0), product of:
              0.13836706 = queryWeight, product of:
                1.8167843 = boost
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.020298399 = queryNorm
              0.46900508 = fieldWeight in 4252, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.7520406 = idf(docFreq=2820, maxDocs=44218)
                0.0625 = fieldNorm(doc=4252)
          0.04300126 = weight(abstract_txt:documents in 4252) [ClassicSimilarity], result of:
            0.04300126 = score(doc=4252,freq=1.0), product of:
              0.1669424 = queryWeight, product of:
                1.9955856 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.020298399 = queryNorm
              0.2575814 = fieldWeight in 4252, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.0625 = fieldNorm(doc=4252)
          0.12777026 = weight(abstract_txt:patterns in 4252) [ClassicSimilarity], result of:
            0.12777026 = score(doc=4252,freq=2.0), product of:
              0.27385607 = queryWeight, product of:
                2.5559256 = boost
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.020298399 = queryNorm
              0.4665599 = fieldWeight in 4252, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2785225 = idf(docFreq=612, maxDocs=44218)
                0.0625 = fieldNorm(doc=4252)
        0.24 = coord(6/25)