Document (#30807)

Author
Baumgartner, R.
Title
Methoden und Werkzeuge zur Webdatenextraktion
Source
Semantic Web: Wege zur vernetzten Wissensgesellschaft. Hrsg.: T. Pellegrini, u. A. Blumauer
Imprint
Berlin : Springer
Year
2006
Pages
S.419-435
Series
X.media.press
Abstract
Das World Wide Web kann als die größte uns bekannte "Datenbank" angesehen werden. Leider ist das heutige Web großteils auf die Präsentation für menschliche Benutzerinnen ausgelegt und besteht aus sehr heterogenen Datenbeständen. Überdies fehlen im Web die Möglichkeiten Informationen strukturiert und aus verschiedenen Quellen aggregiert abzufragen. Das heutige Web ist daher für die automatische maschinelle Verarbeitung nicht geeignet. Um Webdaten dennoch effektiv zu nutzen, wurden Sprachen, Methoden und Werkzeuge zur Extraktion und Aggregation dieser Daten entwickelt. Dieser Artikel gibt einen Überblick und eine Kategorisierung von verschiedenen Ansätzen zur Datenextraktion aus dem Web. Einige Beispielszenarien im B2B Datenaustausch, im Business Intelligence Bereich und insbesondere die Generierung von Daten für Semantic Web Ontologien illustrieren die effektive Nutzung dieser Technologien.
Theme
Data Mining

Similar documents (content)

  1. Frohner, H.: Social Tagging : Grundlagen, Anwendungen, Auswirkungen auf Wissensorganisation und soziale Strukturen der User (2010) 0.13
    0.12902063 = sum of:
      0.12902063 = product of:
        0.537586 = sum of:
          0.07249154 = weight(abstract_txt:heterogenen in 1721) [ClassicSimilarity], result of:
            0.07249154 = score(doc=1721,freq=1.0), product of:
              0.15252817 = queryWeight, product of:
                7.604265 = idf(docFreq=58, maxDocs=43556)
                0.020058239 = queryNorm
              0.47526658 = fieldWeight in 1721, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.604265 = idf(docFreq=58, maxDocs=43556)
                0.0625 = fieldNorm(doc=1721)
          0.07298154 = weight(abstract_txt:ansätzen in 1721) [ClassicSimilarity], result of:
            0.07298154 = score(doc=1721,freq=1.0), product of:
              0.15321472 = queryWeight, product of:
                1.002248 = boost
                7.62136 = idf(docFreq=57, maxDocs=43556)
                0.020058239 = queryNorm
              0.476335 = fieldWeight in 1721, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62136 = idf(docFreq=57, maxDocs=43556)
                0.0625 = fieldNorm(doc=1721)
          0.08340676 = weight(abstract_txt:effektiv in 1721) [ClassicSimilarity], result of:
            0.08340676 = score(doc=1721,freq=1.0), product of:
              0.1674786 = queryWeight, product of:
                1.0478634 = boost
                7.9682307 = idf(docFreq=40, maxDocs=43556)
                0.020058239 = queryNorm
              0.49801442 = fieldWeight in 1721, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9682307 = idf(docFreq=40, maxDocs=43556)
                0.0625 = fieldNorm(doc=1721)
          0.19892283 = weight(abstract_txt:kategorisierung in 1721) [ClassicSimilarity], result of:
            0.19892283 = score(doc=1721,freq=2.0), product of:
              0.23728572 = queryWeight, product of:
                1.2472708 = boost
                9.484578 = idf(docFreq=8, maxDocs=43556)
                0.020058239 = queryNorm
              0.83832616 = fieldWeight in 1721, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.484578 = idf(docFreq=8, maxDocs=43556)
                0.0625 = fieldNorm(doc=1721)
          0.06799375 = weight(abstract_txt:daten in 1721) [ClassicSimilarity], result of:
            0.06799375 = score(doc=1721,freq=2.0), product of:
              0.14615192 = queryWeight, product of:
                1.3843383 = boost
                5.2634377 = idf(docFreq=612, maxDocs=43556)
                0.020058239 = queryNorm
              0.46522656 = fieldWeight in 1721, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2634377 = idf(docFreq=612, maxDocs=43556)
                0.0625 = fieldNorm(doc=1721)
          0.04178956 = weight(abstract_txt:dieser in 1721) [ClassicSimilarity], result of:
            0.04178956 = score(doc=1721,freq=1.0), product of:
              0.15237397 = queryWeight, product of:
                1.7311751 = boost
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.020058239 = queryNorm
              0.27425656 = fieldWeight in 1721, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.0625 = fieldNorm(doc=1721)
        0.24 = coord(6/25)
    
  2. Weigel, U.: Internet - (k)ein Netz mit doppeltem Boden? : T.1: Eine erste Annäherung; T.2: Dienste; T.3: World-Wide Web (1994) 0.08
    0.08109567 = sum of:
      0.08109567 = product of:
        1.013696 = sum of:
          0.34316343 = weight(abstract_txt:verschiedenen in 124) [ClassicSimilarity], result of:
            0.34316343 = score(doc=124,freq=1.0), product of:
              0.16408479 = queryWeight, product of:
                1.466811 = boost
                5.5770097 = idf(docFreq=447, maxDocs=43556)
                0.020058239 = queryNorm
              2.0913787 = fieldWeight in 124, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5770097 = idf(docFreq=447, maxDocs=43556)
                0.375 = fieldNorm(doc=124)
          0.67053246 = weight(abstract_txt:werkzeuge in 124) [ClassicSimilarity], result of:
            0.67053246 = score(doc=124,freq=1.0), product of:
              0.2564568 = queryWeight, product of:
                1.8337793 = boost
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.020058239 = queryNorm
              2.614602 = fieldWeight in 124, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.375 = fieldNorm(doc=124)
        0.08 = coord(2/25)
    
  3. Röhle, T.: ¬Die Demontage der Gatekeeper : relationale Perspektiven zur Macht der Suchmaschinen (2009) 0.08
    0.08107106 = sum of:
      0.08107106 = product of:
        0.40535527 = sum of:
          0.07298154 = weight(abstract_txt:ansätzen in 2021) [ClassicSimilarity], result of:
            0.07298154 = score(doc=2021,freq=1.0), product of:
              0.15321472 = queryWeight, product of:
                1.002248 = boost
                7.62136 = idf(docFreq=57, maxDocs=43556)
                0.020058239 = queryNorm
              0.476335 = fieldWeight in 2021, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62136 = idf(docFreq=57, maxDocs=43556)
                0.0625 = fieldNorm(doc=2021)
          0.079845265 = weight(abstract_txt:strukturiert in 2021) [ClassicSimilarity], result of:
            0.079845265 = score(doc=2021,freq=1.0), product of:
              0.16267642 = queryWeight, product of:
                1.0327312 = boost
                7.8531613 = idf(docFreq=45, maxDocs=43556)
                0.020058239 = queryNorm
              0.49082258 = fieldWeight in 2021, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.8531613 = idf(docFreq=45, maxDocs=43556)
                0.0625 = fieldNorm(doc=2021)
          0.057193905 = weight(abstract_txt:verschiedenen in 2021) [ClassicSimilarity], result of:
            0.057193905 = score(doc=2021,freq=1.0), product of:
              0.16408479 = queryWeight, product of:
                1.466811 = boost
                5.5770097 = idf(docFreq=447, maxDocs=43556)
                0.020058239 = queryNorm
              0.3485631 = fieldWeight in 2021, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5770097 = idf(docFreq=447, maxDocs=43556)
                0.0625 = fieldNorm(doc=2021)
          0.08357912 = weight(abstract_txt:dieser in 2021) [ClassicSimilarity], result of:
            0.08357912 = score(doc=2021,freq=4.0), product of:
              0.15237397 = queryWeight, product of:
                1.7311751 = boost
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.020058239 = queryNorm
              0.5485131 = fieldWeight in 2021, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.0625 = fieldNorm(doc=2021)
          0.111755416 = weight(abstract_txt:werkzeuge in 2021) [ClassicSimilarity], result of:
            0.111755416 = score(doc=2021,freq=1.0), product of:
              0.2564568 = queryWeight, product of:
                1.8337793 = boost
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.020058239 = queryNorm
              0.43576702 = fieldWeight in 2021, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.0625 = fieldNorm(doc=2021)
        0.2 = coord(5/25)
    
  4. Krüger, S.: Wissen ist Macht : Portale weisen den Weg und öffnen Türen (2001) 0.07
    0.07070893 = sum of:
      0.07070893 = product of:
        0.29462054 = sum of:
          0.04561346 = weight(abstract_txt:ansätzen in 735) [ClassicSimilarity], result of:
            0.04561346 = score(doc=735,freq=1.0), product of:
              0.15321472 = queryWeight, product of:
                1.002248 = boost
                7.62136 = idf(docFreq=57, maxDocs=43556)
                0.020058239 = queryNorm
              0.29770938 = fieldWeight in 735, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62136 = idf(docFreq=57, maxDocs=43556)
                0.0390625 = fieldNorm(doc=735)
          0.049903292 = weight(abstract_txt:strukturiert in 735) [ClassicSimilarity], result of:
            0.049903292 = score(doc=735,freq=1.0), product of:
              0.16267642 = queryWeight, product of:
                1.0327312 = boost
                7.8531613 = idf(docFreq=45, maxDocs=43556)
                0.020058239 = queryNorm
              0.30676413 = fieldWeight in 735, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.8531613 = idf(docFreq=45, maxDocs=43556)
                0.0390625 = fieldNorm(doc=735)
          0.03574619 = weight(abstract_txt:verschiedenen in 735) [ClassicSimilarity], result of:
            0.03574619 = score(doc=735,freq=1.0), product of:
              0.16408479 = queryWeight, product of:
                1.466811 = boost
                5.5770097 = idf(docFreq=447, maxDocs=43556)
                0.020058239 = queryNorm
              0.21785194 = fieldWeight in 735, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5770097 = idf(docFreq=447, maxDocs=43556)
                0.0390625 = fieldNorm(doc=735)
          0.05657336 = weight(abstract_txt:methoden in 735) [ClassicSimilarity], result of:
            0.05657336 = score(doc=735,freq=2.0), product of:
              0.17686687 = queryWeight, product of:
                1.5228714 = boost
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.020058239 = queryNorm
              0.3198641 = fieldWeight in 735, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7901587 = idf(docFreq=361, maxDocs=43556)
                0.0390625 = fieldNorm(doc=735)
          0.036937103 = weight(abstract_txt:dieser in 735) [ClassicSimilarity], result of:
            0.036937103 = score(doc=735,freq=2.0), product of:
              0.15237397 = queryWeight, product of:
                1.7311751 = boost
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.020058239 = queryNorm
              0.24241084 = fieldWeight in 735, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.0390625 = fieldNorm(doc=735)
          0.06984714 = weight(abstract_txt:werkzeuge in 735) [ClassicSimilarity], result of:
            0.06984714 = score(doc=735,freq=1.0), product of:
              0.2564568 = queryWeight, product of:
                1.8337793 = boost
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.020058239 = queryNorm
              0.2723544 = fieldWeight in 735, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.0390625 = fieldNorm(doc=735)
        0.24 = coord(6/25)
    
  5. Cejpek, J.: Wie die neuen Medien bewerten : die Informationswissenschaft als Wissenschaft mit Gewissen (1996) 0.07
    0.06901869 = sum of:
      0.06901869 = product of:
        0.5751558 = sum of:
          0.07313173 = weight(abstract_txt:dieser in 6342) [ClassicSimilarity], result of:
            0.07313173 = score(doc=6342,freq=1.0), product of:
              0.15237397 = queryWeight, product of:
                1.7311751 = boost
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.020058239 = queryNorm
              0.47994897 = fieldWeight in 6342, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.388105 = idf(docFreq=1470, maxDocs=43556)
                0.109375 = fieldNorm(doc=6342)
          0.19557197 = weight(abstract_txt:werkzeuge in 6342) [ClassicSimilarity], result of:
            0.19557197 = score(doc=6342,freq=1.0), product of:
              0.2564568 = queryWeight, product of:
                1.8337793 = boost
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.020058239 = queryNorm
              0.7625923 = fieldWeight in 6342, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9722724 = idf(docFreq=110, maxDocs=43556)
                0.109375 = fieldNorm(doc=6342)
          0.3064521 = weight(abstract_txt:heutige in 6342) [ClassicSimilarity], result of:
            0.3064521 = score(doc=6342,freq=1.0), product of:
              0.3459804 = queryWeight, product of:
                2.129932 = boost
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.020058239 = queryNorm
              0.8857498 = fieldWeight in 6342, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.098284 = idf(docFreq=35, maxDocs=43556)
                0.109375 = fieldNorm(doc=6342)
        0.12 = coord(3/25)