Document (#30810)

Author
Baumgartner, R.
Title
Methoden und Werkzeuge zur Webdatenextraktion
Source
Semantic Web: Wege zur vernetzten Wissensgesellschaft. Hrsg.: T. Pellegrini, u. A. Blumauer
Imprint
Berlin : Springer
Year
2006
Pages
S.419-435
Series
X.media.press
Abstract
Das World Wide Web kann als die größte uns bekannte "Datenbank" angesehen werden. Leider ist das heutige Web großteils auf die Präsentation für menschliche Benutzerinnen ausgelegt und besteht aus sehr heterogenen Datenbeständen. Überdies fehlen im Web die Möglichkeiten Informationen strukturiert und aus verschiedenen Quellen aggregiert abzufragen. Das heutige Web ist daher für die automatische maschinelle Verarbeitung nicht geeignet. Um Webdaten dennoch effektiv zu nutzen, wurden Sprachen, Methoden und Werkzeuge zur Extraktion und Aggregation dieser Daten entwickelt. Dieser Artikel gibt einen Überblick und eine Kategorisierung von verschiedenen Ansätzen zur Datenextraktion aus dem Web. Einige Beispielszenarien im B2B Datenaustausch, im Business Intelligence Bereich und insbesondere die Generierung von Daten für Semantic Web Ontologien illustrieren die effektive Nutzung dieser Technologien.
Theme
Data Mining

Similar documents (content)

  1. Frohner, H.: Social Tagging : Grundlagen, Anwendungen, Auswirkungen auf Wissensorganisation und soziale Strukturen der User (2010) 0.13
    0.13349222 = sum of:
      0.13349222 = product of:
        0.5562176 = sum of:
          0.07367788 = weight(abstract_txt:ansätzen in 1724) [ClassicSimilarity], result of:
            0.07367788 = score(doc=1724,freq=1.0), product of:
              0.1550614 = queryWeight, product of:
                1.0044793 = boost
                7.6024475 = idf(docFreq=57, maxDocs=42740)
                0.020305295 = queryNorm
              0.47515297 = fieldWeight in 1724, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6024475 = idf(docFreq=57, maxDocs=42740)
                0.0625 = fieldNorm(doc=1724)
          0.074184686 = weight(abstract_txt:heterogenen in 1724) [ClassicSimilarity], result of:
            0.074184686 = score(doc=1724,freq=1.0), product of:
              0.15577166 = queryWeight, product of:
                1.0067772 = boost
                7.619839 = idf(docFreq=56, maxDocs=42740)
                0.020305295 = queryNorm
              0.47623995 = fieldWeight in 1724, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.619839 = idf(docFreq=56, maxDocs=42740)
                0.0625 = fieldNorm(doc=1724)
          0.086668536 = weight(abstract_txt:effektiv in 1724) [ClassicSimilarity], result of:
            0.086668536 = score(doc=1724,freq=1.0), product of:
              0.17279051 = queryWeight, product of:
                1.0603496 = boost
                8.025305 = idf(docFreq=37, maxDocs=42740)
                0.020305295 = queryNorm
              0.50158155 = fieldWeight in 1724, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.025305 = idf(docFreq=37, maxDocs=42740)
                0.0625 = fieldNorm(doc=1724)
          0.2087168 = weight(abstract_txt:kategorisierung in 1724) [ClassicSimilarity], result of:
            0.2087168 = score(doc=1724,freq=2.0), product of:
              0.24639991 = queryWeight, product of:
                1.2662207 = boost
                9.583449 = idf(docFreq=7, maxDocs=42740)
                0.020305295 = queryNorm
              0.8470652 = fieldWeight in 1724, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.583449 = idf(docFreq=7, maxDocs=42740)
                0.0625 = fieldNorm(doc=1724)
          0.06998818 = weight(abstract_txt:daten in 1724) [ClassicSimilarity], result of:
            0.06998818 = score(doc=1724,freq=2.0), product of:
              0.14984034 = queryWeight, product of:
                1.3964279 = boost
                5.2844644 = idf(docFreq=588, maxDocs=42740)
                0.020305295 = queryNorm
              0.46708506 = fieldWeight in 1724, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2844644 = idf(docFreq=588, maxDocs=42740)
                0.0625 = fieldNorm(doc=1724)
          0.04298151 = weight(abstract_txt:dieser in 1724) [ClassicSimilarity], result of:
            0.04298151 = score(doc=1724,freq=1.0), product of:
              0.15613747 = queryWeight, product of:
                1.7458355 = boost
                4.4044785 = idf(docFreq=1419, maxDocs=42740)
                0.020305295 = queryNorm
              0.2752799 = fieldWeight in 1724, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4044785 = idf(docFreq=1419, maxDocs=42740)
                0.0625 = fieldNorm(doc=1724)
        0.24 = coord(6/25)
    
  2. Bittner, E.: Auskunfts- und Informationssysteme in Datex-J (1994) 0.10
    0.10232289 = sum of:
      0.10232289 = product of:
        0.51161444 = sum of:
          0.09087077 = weight(abstract_txt:leider in 7155) [ClassicSimilarity], result of:
            0.09087077 = score(doc=7155,freq=1.0), product of:
              0.15368155 = queryWeight, product of:
                7.568546 = idf(docFreq=59, maxDocs=42740)
                0.020305295 = queryNorm
              0.5912926 = fieldWeight in 7155, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.568546 = idf(docFreq=59, maxDocs=42740)
                0.078125 = fieldNorm(doc=7155)
          0.092730865 = weight(abstract_txt:heterogenen in 7155) [ClassicSimilarity], result of:
            0.092730865 = score(doc=7155,freq=1.0), product of:
              0.15577166 = queryWeight, product of:
                1.0067772 = boost
                7.619839 = idf(docFreq=56, maxDocs=42740)
                0.020305295 = queryNorm
              0.59529996 = fieldWeight in 7155, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.619839 = idf(docFreq=56, maxDocs=42740)
                0.078125 = fieldNorm(doc=7155)
          0.10078108 = weight(abstract_txt:strukturiert in 7155) [ClassicSimilarity], result of:
            0.10078108 = score(doc=7155,freq=1.0), product of:
              0.16466133 = queryWeight, product of:
                1.0351063 = boost
                7.834249 = idf(docFreq=45, maxDocs=42740)
                0.020305295 = queryNorm
              0.6120507 = fieldWeight in 7155, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.834249 = idf(docFreq=45, maxDocs=42740)
                0.078125 = fieldNorm(doc=7155)
          0.15401396 = weight(abstract_txt:illustrieren in 7155) [ClassicSimilarity], result of:
            0.15401396 = score(doc=7155,freq=1.0), product of:
              0.21846355 = queryWeight, product of:
                1.192281 = boost
                9.023833 = idf(docFreq=13, maxDocs=42740)
                0.020305295 = queryNorm
              0.704987 = fieldWeight in 7155, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.023833 = idf(docFreq=13, maxDocs=42740)
                0.078125 = fieldNorm(doc=7155)
          0.07321776 = weight(abstract_txt:verschiedenen in 7155) [ClassicSimilarity], result of:
            0.07321776 = score(doc=7155,freq=1.0), product of:
              0.16765888 = queryWeight, product of:
                1.4771255 = boost
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.020305295 = queryNorm
              0.43670672 = fieldWeight in 7155, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.078125 = fieldNorm(doc=7155)
        0.2 = coord(5/25)
    
  3. Weigel, U.: Internet - (k)ein Netz mit doppeltem Boden? : T.1: Eine erste Annäherung; T.2: Dienste; T.3: World-Wide Web (1994) 0.08
    0.0830939 = sum of:
      0.0830939 = product of:
        1.0386738 = sum of:
          0.35144526 = weight(abstract_txt:verschiedenen in 127) [ClassicSimilarity], result of:
            0.35144526 = score(doc=127,freq=1.0), product of:
              0.16765888 = queryWeight, product of:
                1.4771255 = boost
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.020305295 = queryNorm
              2.0961924 = fieldWeight in 127, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.375 = fieldNorm(doc=127)
          0.6872285 = weight(abstract_txt:werkzeuge in 127) [ClassicSimilarity], result of:
            0.6872285 = score(doc=127,freq=1.0), product of:
              0.26217356 = queryWeight, product of:
                1.8471347 = boost
                6.9900618 = idf(docFreq=106, maxDocs=42740)
                0.020305295 = queryNorm
              2.621273 = fieldWeight in 127, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9900618 = idf(docFreq=106, maxDocs=42740)
                0.375 = fieldNorm(doc=127)
        0.08 = coord(2/25)
    
  4. Röhle, T.: ¬Die Demontage der Gatekeeper : relationale Perspektiven zur Macht der Suchmaschinen (2009) 0.08
    0.08267561 = sum of:
      0.08267561 = product of:
        0.41337806 = sum of:
          0.07367788 = weight(abstract_txt:ansätzen in 2024) [ClassicSimilarity], result of:
            0.07367788 = score(doc=2024,freq=1.0), product of:
              0.1550614 = queryWeight, product of:
                1.0044793 = boost
                7.6024475 = idf(docFreq=57, maxDocs=42740)
                0.020305295 = queryNorm
              0.47515297 = fieldWeight in 2024, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6024475 = idf(docFreq=57, maxDocs=42740)
                0.0625 = fieldNorm(doc=2024)
          0.08062487 = weight(abstract_txt:strukturiert in 2024) [ClassicSimilarity], result of:
            0.08062487 = score(doc=2024,freq=1.0), product of:
              0.16466133 = queryWeight, product of:
                1.0351063 = boost
                7.834249 = idf(docFreq=45, maxDocs=42740)
                0.020305295 = queryNorm
              0.48964056 = fieldWeight in 2024, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.834249 = idf(docFreq=45, maxDocs=42740)
                0.0625 = fieldNorm(doc=2024)
          0.05857421 = weight(abstract_txt:verschiedenen in 2024) [ClassicSimilarity], result of:
            0.05857421 = score(doc=2024,freq=1.0), product of:
              0.16765888 = queryWeight, product of:
                1.4771255 = boost
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.020305295 = queryNorm
              0.34936538 = fieldWeight in 2024, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.0625 = fieldNorm(doc=2024)
          0.08596302 = weight(abstract_txt:dieser in 2024) [ClassicSimilarity], result of:
            0.08596302 = score(doc=2024,freq=4.0), product of:
              0.15613747 = queryWeight, product of:
                1.7458355 = boost
                4.4044785 = idf(docFreq=1419, maxDocs=42740)
                0.020305295 = queryNorm
              0.5505598 = fieldWeight in 2024, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.4044785 = idf(docFreq=1419, maxDocs=42740)
                0.0625 = fieldNorm(doc=2024)
          0.11453809 = weight(abstract_txt:werkzeuge in 2024) [ClassicSimilarity], result of:
            0.11453809 = score(doc=2024,freq=1.0), product of:
              0.26217356 = queryWeight, product of:
                1.8471347 = boost
                6.9900618 = idf(docFreq=106, maxDocs=42740)
                0.020305295 = queryNorm
              0.43687886 = fieldWeight in 2024, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9900618 = idf(docFreq=106, maxDocs=42740)
                0.0625 = fieldNorm(doc=2024)
        0.2 = coord(5/25)
    
  5. Krüger, S.: Wissen ist Macht : Portale weisen den Weg und öffnen Türen (2001) 0.07
    0.07216646 = sum of:
      0.07216646 = product of:
        0.30069357 = sum of:
          0.046048675 = weight(abstract_txt:ansätzen in 653) [ClassicSimilarity], result of:
            0.046048675 = score(doc=653,freq=1.0), product of:
              0.1550614 = queryWeight, product of:
                1.0044793 = boost
                7.6024475 = idf(docFreq=57, maxDocs=42740)
                0.020305295 = queryNorm
              0.2969706 = fieldWeight in 653, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6024475 = idf(docFreq=57, maxDocs=42740)
                0.0390625 = fieldNorm(doc=653)
          0.05039054 = weight(abstract_txt:strukturiert in 653) [ClassicSimilarity], result of:
            0.05039054 = score(doc=653,freq=1.0), product of:
              0.16466133 = queryWeight, product of:
                1.0351063 = boost
                7.834249 = idf(docFreq=45, maxDocs=42740)
                0.020305295 = queryNorm
              0.30602536 = fieldWeight in 653, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.834249 = idf(docFreq=45, maxDocs=42740)
                0.0390625 = fieldNorm(doc=653)
          0.03660888 = weight(abstract_txt:verschiedenen in 653) [ClassicSimilarity], result of:
            0.03660888 = score(doc=653,freq=1.0), product of:
              0.16765888 = queryWeight, product of:
                1.4771255 = boost
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.020305295 = queryNorm
              0.21835336 = fieldWeight in 653, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.589846 = idf(docFreq=433, maxDocs=42740)
                0.0390625 = fieldNorm(doc=653)
          0.058068547 = weight(abstract_txt:methoden in 653) [ClassicSimilarity], result of:
            0.058068547 = score(doc=653,freq=2.0), product of:
              0.18098931 = queryWeight, product of:
                1.5347251 = boost
                5.8078184 = idf(docFreq=348, maxDocs=42740)
                0.020305295 = queryNorm
              0.32083964 = fieldWeight in 653, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.8078184 = idf(docFreq=348, maxDocs=42740)
                0.0390625 = fieldNorm(doc=653)
          0.037990645 = weight(abstract_txt:dieser in 653) [ClassicSimilarity], result of:
            0.037990645 = score(doc=653,freq=2.0), product of:
              0.15613747 = queryWeight, product of:
                1.7458355 = boost
                4.4044785 = idf(docFreq=1419, maxDocs=42740)
                0.020305295 = queryNorm
              0.24331537 = fieldWeight in 653, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4044785 = idf(docFreq=1419, maxDocs=42740)
                0.0390625 = fieldNorm(doc=653)
          0.0715863 = weight(abstract_txt:werkzeuge in 653) [ClassicSimilarity], result of:
            0.0715863 = score(doc=653,freq=1.0), product of:
              0.26217356 = queryWeight, product of:
                1.8471347 = boost
                6.9900618 = idf(docFreq=106, maxDocs=42740)
                0.020305295 = queryNorm
              0.2730493 = fieldWeight in 653, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9900618 = idf(docFreq=106, maxDocs=42740)
                0.0390625 = fieldNorm(doc=653)
        0.24 = coord(6/25)