Search (27 results, page 1 of 2)

Chakrabarti, S.: Mining the Web : discovering knowledge from hypertext data (2003) 0.01
```
0.0146638565 = product of:
  0.058655426 = sum of:
    0.023603994 = weight(_text_:computer in 2222) [ClassicSimilarity], result of:
      0.023603994 = score(doc=2222,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.16150802 = fieldWeight in 2222, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.03125 = fieldNorm(doc=2222)
    0.03505143 = weight(_text_:network in 2222) [ClassicSimilarity], result of:
      0.03505143 = score(doc=2222,freq=2.0), product of:
        0.17809492 = queryWeight, product of:
          4.4533744 = idf(docFreq=1398, maxDocs=44218)
          0.039991006 = queryNorm
        0.1968132 = fieldWeight in 2222, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          4.4533744 = idf(docFreq=1398, maxDocs=44218)
          0.03125 = fieldNorm(doc=2222)
  0.25 = coord(2/8)
```
Footnote

Part I, Infrastructure, has two chapters: Chapter 2 on crawling the Web and Chapter 3 an Web search and information retrieval. The second part of the book, containing chapters 4, 5, and 6, is the centerpiece. This part specifically focuses an machine learning in the context of hypertext. Part III is a collection of applications that utilize the techniques described in earlier chapters. Chapter 7 is an social network analysis. Chapter 8 is an resource discovery. Chapter 9 is an the future of Web mining. Overall, this is a valuable reference book for researchers and developers in the field of Web mining. It should be particularly useful for those who would like to design and probably code their own Computer programs out of the equations and pseudocodes an most of the pages. For a student, the most valuable feature of the book is perhaps the formal and consistent treatments of concepts across the board. For what is behind and beyond the technical details, one has to either dig deeper into the bibliographic notes at the end of each chapter, or resort to more in-depth analysis of relevant subjects in the literature. lf you are looking for successful stories about Web mining or hard-way-learned lessons of failures, this is not the book."
Thelwall, M.; Wilkinson, D.; Uppal, S.: Data mining emotion in social network communication : gender differences in MySpace (2009) 0.01
```
0.011383288 = product of:
  0.0910663 = sum of:
    0.0910663 = weight(_text_:network in 3322) [ClassicSimilarity], result of:
      0.0910663 = score(doc=3322,freq=6.0), product of:
        0.17809492 = queryWeight, product of:
          4.4533744 = idf(docFreq=1398, maxDocs=44218)
          0.039991006 = queryNorm
        0.51133573 = fieldWeight in 3322, product of:
          2.4494898 = tf(freq=6.0), with freq of:
            6.0 = termFreq=6.0
          4.4533744 = idf(docFreq=1398, maxDocs=44218)
          0.046875 = fieldNorm(doc=3322)
  0.125 = coord(1/8)
```
Abstract

Despite the rapid growth in social network sites and in data mining for emotion (sentiment analysis), little research has tied the two together, and none has had social science goals. This article examines the extent to which emotion is present in MySpace comments, using a combination of data mining and content analysis, and exploring age and gender. A random sample of 819 public comments to or from U.S. users was manually classified for strength of positive and negative emotion. Two thirds of the comments expressed positive emotion, but a minority (20%) contained negative emotion, confirming that MySpace is an extraordinarily emotion-rich environment. Females are likely to give and receive more positive comments than are males, but there is no difference for negative comments. It is thus possible that females are more successful social network site users partly because of their greater ability to textually harness positive affect.
Liu, W.; Weichselbraun, A.; Scharl, A.; Chang, E.: Semi-automatic ontology extension using spreading activation (2005) 0.01
```
0.010843484 = product of:
  0.08674787 = sum of:
    0.08674787 = weight(_text_:network in 3028) [ClassicSimilarity], result of:
      0.08674787 = score(doc=3028,freq=4.0), product of:
        0.17809492 = queryWeight, product of:
          4.4533744 = idf(docFreq=1398, maxDocs=44218)
          0.039991006 = queryNorm
        0.48708782 = fieldWeight in 3028, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          4.4533744 = idf(docFreq=1398, maxDocs=44218)
          0.0546875 = fieldNorm(doc=3028)
  0.125 = coord(1/8)
```
Abstract

This paper describes a system to semi-automatically extend and refine ontologies by mining textual data from the Web sites of international online media. Expanding a seed ontology creates a semantic network through co-occurrence analysis, trigger phrase analysis, and disambiguation based on the WordNet lexical dictionary. Spreading activation then processes this semantic network to find the most probable candidates for inclusion in an extended ontology. Approaches to identifying hierarchical relationships such as subsumption, head noun analysis and WordNet consultation are used to confirm and classify the found relationships. Using a seed ontology on "climate change" as an example, this paper demonstrates how spreading activation improves the result by naturally integrating the mentioned methods.
Information visualization in data mining and knowledge discovery (2002) 0.01
```
0.007952074 = product of:
  0.031808294 = sum of:
    0.026390066 = weight(_text_:computer in 1789) [ClassicSimilarity], result of:
      0.026390066 = score(doc=1789,freq=10.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.18057145 = fieldWeight in 1789, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.015625 = fieldNorm(doc=1789)
    0.0054182294 = product of:
      0.010836459 = sum of:
        0.010836459 = weight(_text_:22 in 1789) [ClassicSimilarity], result of:
          0.010836459 = score(doc=1789,freq=2.0), product of:
            0.1400417 = queryWeight, product of:
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.039991006 = queryNorm
            0.07738023 = fieldWeight in 1789, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.5018296 = idf(docFreq=3622, maxDocs=44218)
              0.015625 = fieldNorm(doc=1789)
      0.5 = coord(1/2)
  0.25 = coord(2/8)
```
Date

23. 3.2008 19:10:22

Footnote

Rez. in: JASIST 54(2003) no.9, S.905-906 (C.A. Badurek): "Visual approaches for knowledge discovery in very large databases are a prime research need for information scientists focused an extracting meaningful information from the ever growing stores of data from a variety of domains, including business, the geosciences, and satellite and medical imagery. This work presents a summary of research efforts in the fields of data mining, knowledge discovery, and data visualization with the goal of aiding the integration of research approaches and techniques from these major fields. The editors, leading computer scientists from academia and industry, present a collection of 32 papers from contributors who are incorporating visualization and data mining techniques through academic research as well application development in industry and government agencies. Information Visualization focuses upon techniques to enhance the natural abilities of humans to visually understand data, in particular, large-scale data sets. It is primarily concerned with developing interactive graphical representations to enable users to more intuitively make sense of multidimensional data as part of the data exploration process. It includes research from computer science, psychology, human-computer interaction, statistics, and information science. Knowledge Discovery in Databases (KDD) most often refers to the process of mining databases for previously unknown patterns and trends in data. Data mining refers to the particular computational methods or algorithms used in this process. The data mining research field is most related to computational advances in database theory, artificial intelligence and machine learning. This work compiles research summaries from these main research areas in order to provide "a reference work containing the collection of thoughts and ideas of noted researchers from the fields of data mining and data visualization" (p. 8). It addresses these areas in three main sections: the first an data visualization, the second an KDD and model visualization, and the last an using visualization in the knowledge discovery process. The seven chapters of Part One focus upon methodologies and successful techniques from the field of Data Visualization. Hoffman and Grinstein (Chapter 2) give a particularly good overview of the field of data visualization and its potential application to data mining. An introduction to the terminology of data visualization, relation to perceptual and cognitive science, and discussion of the major visualization display techniques are presented. Discussion and illustration explain the usefulness and proper context of such data visualization techniques as scatter plots, 2D and 3D isosurfaces, glyphs, parallel coordinates, and radial coordinate visualizations. Remaining chapters present the need for standardization of visualization methods, discussion of user requirements in the development of tools, and examples of using information visualization in addressing research problems.
With contributors almost exclusively from the computer science field, the intended audience of this work is heavily slanted towards a computer science perspective. However, it is highly readable and provides introductory material that would be useful to information scientists from a variety of domains. Yet, much interesting work in information visualization from other fields could have been included giving the work more of an interdisciplinary perspective to complement their goals of integrating work in this area. Unfortunately, many of the application chapters are these, shallow, and lack complementary illustrations of visualization techniques or user interfaces used. However, they do provide insight into the many applications being developed in this rapidly expanding field. The authors have successfully put together a highly useful reference text for the data mining and information visualization communities. Those interested in a good introduction and overview of complementary research areas in these fields will be satisfied with this collection of papers. The focus upon integrating data visualization with data mining complements texts in each of these fields, such as Advances in Knowledge Discovery and Data Mining (Fayyad et al., MIT Press) and Readings in Information Visualization: Using Vision to Think (Card et. al., Morgan Kauffman). This unique work is a good starting point for future interaction between researchers in the fields of data visualization and data mining and makes a good accompaniment for a course focused an integrating these areas or to the main reference texts in these fields."
Keim, D.A.: Datenvisualisierung und Data Mining (2004) 0.01
```
0.0052157952 = product of:
  0.041726362 = sum of:
    0.041726362 = weight(_text_:computer in 2931) [ClassicSimilarity], result of:
      0.041726362 = score(doc=2931,freq=4.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.28550854 = fieldWeight in 2931, product of:
          2.0 = tf(freq=4.0), with freq of:
            4.0 = termFreq=4.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.0390625 = fieldNorm(doc=2931)
  0.125 = coord(1/8)
```
Abstract

Die rasante technologische Entwicklung der letzten zwei Jahrzehnte ermöglicht heute die persistente Speicherung riesiger Datenmengen durch den Computer. Forscher an der Universität Berkeley haben berechnet, dass jedes Jahr ca. 1 Exabyte (= 1 Million Terabyte) Daten generiert werden - ein großer Teil davon in digitaler Form. Das bedeutet aber, dass in den nächsten drei Jahren mehr Daten generiert werden als in der gesamten menschlichen Entwicklung zuvor. Die Daten werden oft automatisch mit Hilfe von Sensoren und Überwachungssystemen aufgezeichnet. So werden beispielsweise alltägliche Vorgänge des menschlichen Lebens, wie das Bezahlen mit Kreditkarte oder die Benutzung des Telefons, durch Computer aufgezeichnet. Dabei werden gewöhnlich alle verfügbaren Parameter abgespeichert, wodurch hochdimensionale Datensätze entstehen. Die Daten werden gesammelt, da sie wertvolle Informationen enthalten, die einen Wettbewerbsvorteil bieten können. Das Finden der wertvollen Informationen in den großen Datenmengen ist aber keine leichte Aufgabe. Heutige Datenbankmanagementsysteme können nur kleine Teilmengen dieser riesigen Datenmengen darstellen. Werden die Daten zum Beispiel in textueller Form ausgegeben, können höchstens ein paar hundert Zeilen auf dem Bildschirm dargestellt werden. Bei Millionen von Datensätzen ist dies aber nur ein Tropfen auf den heißen Stein.

Medien-Informationsmanagement : Archivarische, dokumentarische, betriebswirtschaftliche, rechtliche und Berufsbild-Aspekte ; [Frühjahrstagung der Fachgruppe 7 im Jahr 2000 in Weimar und Folgetagung 2001 in Köln] (2003) 0.01

0.005154173 = product of:
  0.041233383 = sum of:
    0.041233383 = sum of:
      0.024978695 = weight(_text_:resources in 1833) [ClassicSimilarity], result of:
        0.024978695 = score(doc=1833,freq=4.0), product of:
          0.14598069 = queryWeight, product of:
            3.650338 = idf(docFreq=3122, maxDocs=44218)
            0.039991006 = queryNorm
          0.17110959 = fieldWeight in 1833, product of:
            2.0 = tf(freq=4.0), with freq of:
              4.0 = termFreq=4.0
            3.650338 = idf(docFreq=3122, maxDocs=44218)
            0.0234375 = fieldNorm(doc=1833)
      0.016254688 = weight(_text_:22 in 1833) [ClassicSimilarity], result of:
        0.016254688 = score(doc=1833,freq=2.0), product of:
          0.1400417 = queryWeight, product of:
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.039991006 = queryNorm
          0.116070345 = fieldWeight in 1833, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            3.5018296 = idf(docFreq=3622, maxDocs=44218)
            0.0234375 = fieldNorm(doc=1833)
  0.125 = coord(1/8)

Date: 11. 5.2008 19:49:22
LCSH: Mass media / Archival resources / Congresses
Subject: Mass media / Archival resources / Congresses

Fenstermacher, K.D.; Ginsburg, M.: Client-side monitoring for Web mining (2003) 0.00
```
0.004425749 = product of:
  0.035405993 = sum of:
    0.035405993 = weight(_text_:computer in 1611) [ClassicSimilarity], result of:
      0.035405993 = score(doc=1611,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.24226204 = fieldWeight in 1611, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.046875 = fieldNorm(doc=1611)
  0.125 = coord(1/8)
```
Abstract

"Garbage in, garbage out" is a well-known phrase in computer analysis, and one that comes to mind when mining Web data to draw conclusions about Web users. The challenge is that data analysts wish to infer patterns of client-side behavior from server-side data. However, because only a fraction of the user's actions ever reaches the Web server, analysts must rely an incomplete data. In this paper, we propose a client-side monitoring system that is unobtrusive and supports flexible data collection. Moreover, the proposed framework encompasses client-side applications beyond the Web browser. Expanding monitoring beyond the browser to incorporate standard office productivity tools enables analysts to derive a much richer and more accurate picture of user behavior an the Web.
Gluck , M.: Multimedia exploratory data analysis for geospatial data mining : the case for augmented seriation (2001) 0.00
```
0.004425749 = product of:
  0.035405993 = sum of:
    0.035405993 = weight(_text_:computer in 5214) [ClassicSimilarity], result of:
      0.035405993 = score(doc=5214,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.24226204 = fieldWeight in 5214, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.046875 = fieldNorm(doc=5214)
  0.125 = coord(1/8)
```
Abstract

To prevent type-one error, statisticians tend to accept the possibility of type-two error, which leads to the rejection of hypotheses later shown to be true. In both Exploratory Data Analysis and data mining the emphasis is more appropriately on the elimination of type-two error. Thus EDA methods, including its visualization tools may be appropriate for Data Mining. Seriation, creates a matrix of observations and variables, where the cells contain an icon whose size represents its value, and permits the movement of rows and columns in order to visually discern patterns. Augmented Seriation, a method of data mining, adds computer graphics, sound, color, and extra dimensions to the matrix so that the analyst has different modalities for pattern observation. Gluck has developed software for such analysis.
Sánchez, D.; Chamorro-Martínez, J.; Vila, M.A.: Modelling subjectivity in visual perception of orientation for image retrieval (2003) 0.00
```
0.004425749 = product of:
  0.035405993 = sum of:
    0.035405993 = weight(_text_:computer in 1067) [ClassicSimilarity], result of:
      0.035405993 = score(doc=1067,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.24226204 = fieldWeight in 1067, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.046875 = fieldNorm(doc=1067)
  0.125 = coord(1/8)
```
Abstract

In this paper we combine computer vision and data mining techniques to model high-level concepts for image retrieval, on the basis of basic perceptual features of the human visual system. High-level concepts related to these features are learned and represented by means of a set of fuzzy association rules. The concepts so acquired can be used for image retrieval with the advantage that it is not needed to provide an image as a query. Instead, a query is formulated by using the labels that identify the learned concepts as search terms, and the retrieval process calculates the relevance of an image to the query by an inference mechanism. An additional feature of our methodology is that it can capture user's subjectivity. For that purpose, fuzzy sets theory is employed to measure user's assessments about the fulfillment of a concept by an image.

Hereth, J.; Stumme, G.; Wille, R.; Wille, U.: Conceptual knowledge discovery and data analysis (2000) 0.00

0.0036881242 = product of:
  0.029504994 = sum of:
    0.029504994 = weight(_text_:computer in 5083) [ClassicSimilarity], result of:
      0.029504994 = score(doc=5083,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.20188503 = fieldWeight in 5083, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.0390625 = fieldNorm(doc=5083)
  0.125 = coord(1/8)

Series: Lecture notes in computer science; vol.1867: Lecture notes on artificial intelligence

Lam, W.; Yang, C.C.; Menczer, F.: Introduction to the special topic section on mining Web resources for enhancing information retrieval (2007) 0.00

0.0036427265 = product of:
  0.029141812 = sum of:
    0.029141812 = product of:
      0.058283623 = sum of:
        0.058283623 = weight(_text_:resources in 600) [ClassicSimilarity], result of:
          0.058283623 = score(doc=600,freq=4.0), product of:
            0.14598069 = queryWeight, product of:
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.039991006 = queryNorm
            0.39925572 = fieldWeight in 600, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.0546875 = fieldNorm(doc=600)
      0.5 = coord(1/2)
  0.125 = coord(1/8)

Footnote: Einführung in einen Themenschwerpunkt "Mining Web resources for enhancing information retrieval"

Relational data mining (2001) 0.00
```
0.003122337 = product of:
  0.024978695 = sum of:
    0.024978695 = product of:
      0.04995739 = sum of:
        0.04995739 = weight(_text_:resources in 1303) [ClassicSimilarity], result of:
          0.04995739 = score(doc=1303,freq=4.0), product of:
            0.14598069 = queryWeight, product of:
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.039991006 = queryNorm
            0.34221917 = fieldWeight in 1303, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.046875 = fieldNorm(doc=1303)
      0.5 = coord(1/2)
  0.125 = coord(1/8)
```
Abstract

As the first book devoted to relational data mining, this coherently written multi-author monograph provides a thorough introduction and systematic overview of the area. The ferst part introduces the reader to the basics and principles of classical knowledge discovery in databases and inductive logic programmeng; subsequent chapters by leading experts assess the techniques in relational data mining in a principled and comprehensive way; finally, three chapters deal with advanced applications in various fields and refer the reader to resources for relational data mining. This book will become a valuable source of reference for R&D professionals active in relational data mining. Students as well as IT professionals and ambitioned practitioners interested in learning about relational data mining will appreciate the book as a useful text and gentle introduction to this exciting new field.

Theme

Information Resources Management
Cohen, D.J.: From Babel to knowledge : data mining large digital collections (2006) 0.00
```
0.0029504993 = product of:
  0.023603994 = sum of:
    0.023603994 = weight(_text_:computer in 1178) [ClassicSimilarity], result of:
      0.023603994 = score(doc=1178,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.16150802 = fieldWeight in 1178, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.03125 = fieldNorm(doc=1178)
  0.125 = coord(1/8)
```
Abstract

In Jorge Luis Borges's curious short story The Library of Babel, the narrator describes an endless collection of books stored from floor to ceiling in a labyrinth of countless hexagonal rooms. The pages of the library's books seem to contain random sequences of letters and spaces; occasionally a few intelligible words emerge in the sea of paper and ink. Nevertheless, readers diligently, and exasperatingly, scan the shelves for coherent passages. The narrator himself has wandered numerous rooms in search of enlightenment, but with resignation he simply awaits his death and burial - which Borges explains (with signature dark humor) consists of being tossed unceremoniously over the library's banister. Borges's nightmare, of course, is a cursed vision of the research methods of disciplines such as literature, history, and philosophy, where the careful reading of books, one after the other, is supposed to lead inexorably to knowledge and understanding. Computer scientists would approach Borges's library far differently. Employing the information theory that forms the basis for search engines and other computerized techniques for assessing in one fell swoop large masses of documents, they would quickly realize the collection's incoherence though sampling and statistical methods - and wisely start looking for the library's exit. These computational methods, which allow us to find patterns, determine relationships, categorize documents, and extract information from massive corpuses, will form the basis for new tools for research in the humanities and other disciplines in the coming decade. For the past three years I have been experimenting with how to provide such end-user tools - that is, tools that harness the power of vast electronic collections while hiding much of their complicated technical plumbing. In particular, I have made extensive use of the application programming interfaces (APIs) the leading search engines provide for programmers to query their databases directly (from server to server without using their web interfaces). In addition, I have explored how one might extract information from large digital collections, from the well-curated lexicographic database WordNet to the democratic (and poorly curated) online reference work Wikipedia. While processing these digital corpuses is currently an imperfect science, even now useful tools can be created by combining various collections and methods for searching and analyzing them. And more importantly, these nascent services suggest a future in which information can be gleaned from, and sense can be made out of, even imperfect digital libraries of enormous scale. A brief examination of two approaches to data mining large digital collections hints at this future, while also providing some lessons about how to get there.
Wang, W.M.; Cheung, C.F.; Lee, W.B.; Kwok, S.K.: Mining knowledge from natural language texts using fuzzy associated concept mapping (2008) 0.00
```
0.0029504993 = product of:
  0.023603994 = sum of:
    0.023603994 = weight(_text_:computer in 2121) [ClassicSimilarity], result of:
      0.023603994 = score(doc=2121,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.16150802 = fieldWeight in 2121, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.03125 = fieldNorm(doc=2121)
  0.125 = coord(1/8)
```
Abstract

Natural Language Processing (NLP) techniques have been successfully used to automatically extract information from unstructured text through a detailed analysis of their content, often to satisfy particular information needs. In this paper, an automatic concept map construction technique, Fuzzy Association Concept Mapping (FACM), is proposed for the conversion of abstracted short texts into concept maps. The approach consists of a linguistic module and a recommendation module. The linguistic module is a text mining method that does not require the use to have any prior knowledge about using NLP techniques. It incorporates rule-based reasoning (RBR) and case based reasoning (CBR) for anaphoric resolution. It aims at extracting the propositions in text so as to construct a concept map automatically. The recommendation module is arrived at by adopting fuzzy set theories. It is an interactive process which provides suggestions of propositions for further human refinement of the automatically generated concept maps. The suggested propositions are relationships among the concepts which are not explicitly found in the paragraphs. This technique helps to stimulate individual reflection and generate new knowledge. Evaluation was carried out by using the Science Citation Index (SCI) abstract database and CNET News as test data, which are well known databases and the quality of the text is assured. Experimental results show that the automatically generated concept maps conform to the outputs generated manually by domain experts, since the degree of difference between them is proportionally small. The method provides users with the ability to convert scientific and short texts into a structured format which can be easily processed by computer. Moreover, it provides knowledge workers with extra time to re-think their written text and to view their knowledge from another angle.
Wang, F.L.; Yang, C.C.: Mining Web data for Chinese segmentation (2007) 0.00
```
0.0026019474 = product of:
  0.02081558 = sum of:
    0.02081558 = product of:
      0.04163116 = sum of:
        0.04163116 = weight(_text_:resources in 604) [ClassicSimilarity], result of:
          0.04163116 = score(doc=604,freq=4.0), product of:
            0.14598069 = queryWeight, product of:
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.039991006 = queryNorm
            0.28518265 = fieldWeight in 604, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.0390625 = fieldNorm(doc=604)
      0.5 = coord(1/2)
  0.125 = coord(1/8)
```
Abstract

Modern information retrieval systems use keywords within documents as indexing terms for search of relevant documents. As Chinese is an ideographic character-based language, the words in the texts are not delimited by white spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although most search engines have problems in segmenting texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining Web data with the help of search engines. On the other hand, the Romanized pinyin of Chinese language indicates boundaries of words in the text. Our algorithm is the first to utilize the Romanized pinyin to segmentation. It is the first unified segmentation algorithm for the Chinese language from different geographical areas, and it is also domain independent because of the nature of the Web. Experiments have been conducted on the datasets of a recent Chinese segmentation competition. The results show that our algorithm outperforms the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problems of segmentation ambiguity, new word (unknown word) detection, and stop words.

Footnote

Beitrag eines Themenschwerpunktes "Mining Web resources for enhancing information retrieval"
Liu, Y.; Zhang, M.; Cen, R.; Ru, L.; Ma, S.: Data cleansing for Web information retrieval using query independent features (2007) 0.00
```
0.0026019474 = product of:
  0.02081558 = sum of:
    0.02081558 = product of:
      0.04163116 = sum of:
        0.04163116 = weight(_text_:resources in 607) [ClassicSimilarity], result of:
          0.04163116 = score(doc=607,freq=4.0), product of:
            0.14598069 = queryWeight, product of:
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.039991006 = queryNorm
            0.28518265 = fieldWeight in 607, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.0390625 = fieldNorm(doc=607)
      0.5 = coord(1/2)
  0.125 = coord(1/8)
```
Abstract

Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query-independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning-based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low-quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query-independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance.

Footnote

Beitrag eines Themenschwerpunktes "Mining Web resources for enhancing information retrieval"
Seidenfaden, U.: Schürfen in Datenbergen : Data-Mining soll möglichst viel Information zu Tage fördern (2001) 0.00
```
0.0025816867 = product of:
  0.020653494 = sum of:
    0.020653494 = weight(_text_:computer in 6923) [ClassicSimilarity], result of:
      0.020653494 = score(doc=6923,freq=2.0), product of:
        0.1461475 = queryWeight, product of:
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.039991006 = queryNorm
        0.14131951 = fieldWeight in 6923, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.6545093 = idf(docFreq=3109, maxDocs=44218)
          0.02734375 = fieldNorm(doc=6923)
  0.125 = coord(1/8)
```
Content

"Fast alles wird heute per Computer erfasst. Kaum einer überblickt noch die enormen Datenmengen, die sich in Unternehmen, Universitäten und Verwaltung ansammeln. Allein in den öffentlich zugänglichen Datenbanken der Genforscher fallen pro Woche rund 4,5 Gigabyte an neuer Information an. "Vom potentiellen Wissen in den Datenbanken wird bislang aber oft nur ein Teil genutzt", meint Stefan Wrobel vom Lehrstuhl für Wissensentdeckung und Maschinelles Lernen der Otto-von-Guericke-Universität in Magdeburg. Sein Doktorand Mark-Andre Krogel hat soeben mit einem neuen Verfahren zur Datenbankrecherche in San Francisco einen inoffiziellen Weltmeister-Titel in der Disziplin "Data-Mining" gewonnen. Dieser Daten-Bergbau arbeitet im Unterschied zur einfachen Datenbankabfrage, die sich einfacher statistischer Methoden bedient, zusätzlich mit künstlicher Intelligenz und Visualisierungsverfahren, um Querverbindungen zu finden. "Das erleichtert die Suche nach verborgenen Zusammenhängen im Datenmaterial ganz erheblich", so Wrobel. Die Wirtschaft setzt Data-Mining bereits ein, um das Kundenverhalten zu untersuchen und vorherzusagen. "Stellen sie sich ein Unternehmen mit einer breiten Produktpalette und einem großen Kundenstamm vor", erklärt Wrobel. "Es kann seinen Erfolg maximieren, wenn es Marketing-Post zielgerichtet an seine Kunden verschickt. Wer etwa gerade einen PC gekauft hat, ist womöglich auch an einem Drucker oder Scanner interessiert." In einigen Jahren könnte ein Analysemodul den Manager eines Unternehmens selbständig informieren, wenn ihm etwas Ungewöhnliches aufgefallen ist. Das muss nicht immer positiv für den Kunden sein. Data-Mining ließe sich auch verwenden, um die Lebensdauer von Geschäftsbeziehungen zu prognostizieren. Für Kunden mit geringen Kaufinteressen würden Reklamationen dann längere Bearbeitungszeiten nach sich ziehen. Im konkreten Projekt von Mark-Andre Krogel ging es um die Vorhersage von Protein-Funktionen. Proteine sind Eiweißmoleküle, die fast alle Stoffwechselvorgänge im menschlichen Körper steuern. Sie sind daher die primären Ziele von Wirkstoffen zur Behandlung von Erkrankungen. Das erklärt das große Interesse der Pharmaindustrie. Experimentelle Untersuchungen, die Aufschluss über die Aufgaben der über 100 000 Eiweißmoleküle im menschlichen Körper geben können, sind mit einem hohen Zeitaufwand verbunden. Die Forscher möchten deshalb die Zeit verkürzen, indem sie das vorhandene Datenmaterial mit Hilfe von Data-Mining auswerten. Aus der im Humangenomprojekt bereits entschlüsselten Abfolge der Erbgut-Bausteine lässt sich per Datenbankanalyse die Aneinanderreihung bestimmter Aminosäuren zu einem Protein vorhersagen. Andere Datenbanken wiederum enthalten Informationen, welche Struktur ein Protein mit einer bestimmten vorgegebenen Funktion haben könnte. Aus bereits bekannten Strukturelementen versuchen die Genforscher dann, auf die mögliche Funktion eines bislang noch unbekannten Eiweißmoleküls zu schließen.- Fakten Verschmelzen - Bei diesem theoretischen Ansatz kommt es darauf an, die in Datenbanken enthaltenen Informationen so zu verknüpfen, dass die Ergebnisse mit hoher Wahrscheinlichkeit mit der Realität übereinstimmen. "Im Rahmen des Wettbewerbs erhielten wir Tabellen als Vorgabe, in denen Gene und Chromosomen nach bestimmten Gesichtspunkten klassifiziert waren", erläutert Krogel. Von einigen Genen war bekannt, welche Proteine sie produzieren und welche Aufgabe diese Eiweißmoleküle besitzen. Diese Beispiele dienten dem von Krogel entwickelten Programm dann als Hilfe, für andere Gene vorherzusagen, welche Funktionen die von ihnen erzeugten Proteine haben. "Die Genauigkeit der Vorhersage lag bei den gestellten Aufgaben bei über 90 Prozent", stellt Krogel fest. Allerdings könne man in der Praxis nicht davon ausgehen, dass alle Informationen aus verschiedenen Datenbanken in einem einheitlichen Format vorliegen. Es gebe verschiedene Abfragesprachen der Datenbanken, und die Bezeichnungen von Eiweißmolekülen mit gleicher Aufgabe seien oftmals uneinheitlich. Die Magdeburger Informatiker arbeiten deshalb in der DFG-Forschergruppe "Informationsfusion" an Methoden, um die verschiedenen Datenquellen besser zu erschließen."

Baeza-Yates, R.; Hurtado, C.; Mendoza, M.: Improving search engines by query clustering (2007) 0.00

0.0025757966 = product of:
  0.020606373 = sum of:
    0.020606373 = product of:
      0.041212745 = sum of:
        0.041212745 = weight(_text_:resources in 601) [ClassicSimilarity], result of:
          0.041212745 = score(doc=601,freq=2.0), product of:
            0.14598069 = queryWeight, product of:
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.039991006 = queryNorm
            0.28231642 = fieldWeight in 601, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.0546875 = fieldNorm(doc=601)
      0.5 = coord(1/2)
  0.125 = coord(1/8)

Footnote: Beitrag eines Themenschwerpunktes "Mining Web resources for enhancing information retrieval"

Perugini, S.; Ramakrishnan, N.: Mining Web functional dependencies for flexible information access (2007) 0.00

0.0022078257 = product of:
  0.017662605 = sum of:
    0.017662605 = product of:
      0.03532521 = sum of:
        0.03532521 = weight(_text_:resources in 602) [ClassicSimilarity], result of:
          0.03532521 = score(doc=602,freq=2.0), product of:
            0.14598069 = queryWeight, product of:
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.039991006 = queryNorm
            0.2419855 = fieldWeight in 602, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.046875 = fieldNorm(doc=602)
      0.5 = coord(1/2)
  0.125 = coord(1/8)

Footnote: Beitrag eines Themenschwerpunktes "Mining Web resources for enhancing information retrieval"

Schwartz, F.; Fang, Y.C.: Citation data analysis on hydrogeology (2007) 0.00
```
0.0020815579 = product of:
  0.016652463 = sum of:
    0.016652463 = product of:
      0.033304926 = sum of:
        0.033304926 = weight(_text_:resources in 433) [ClassicSimilarity], result of:
          0.033304926 = score(doc=433,freq=4.0), product of:
            0.14598069 = queryWeight, product of:
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.039991006 = queryNorm
            0.22814612 = fieldWeight in 433, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              3.650338 = idf(docFreq=3122, maxDocs=44218)
              0.03125 = fieldNorm(doc=433)
      0.5 = coord(1/2)
  0.125 = coord(1/8)
```
Abstract

This article explores the status of research in hydrogeology using data mining techniques. First we try to explain what citation analysis is and review some of the previous work on citation analysis. The main idea in this article is to address some common issues about citation numbers and the use of these data. To validate the use of citation numbers, we compare the citation patterns for Water Resources Research papers in the 1980s with those in the 1990s. The citation growths for highly cited authors from the 1980s are used to examine whether it is possible to predict the citation patterns for highly-cited authors in the 1990s. If the citation data prove to be steady and stable, these numbers then can be used to explore the evolution of science in hydrogeology. The famous quotation, "If you are not the lead dog, the scenery never changes," attributed to Lee Iacocca, points to the importance of an entrepreneurial spirit in all forms of endeavor. In the case of hydrogeological research, impact analysis makes it clear how important it is to be a pioneer. Statistical correlation coefficients are used to retrieve papers among a collection of 2,847 papers before and after 1991 sharing the same topics with 273 papers in 1991 in Water Resources Research. The numbers of papers before and after 1991 are then plotted against various levels of citations for papers in 1991 to compare the distributions of paper population before and after that year. The similarity metrics based on word counts can ensure that the "before" papers are like ancestors and "after" papers are descendants in the same type of research. This exercise gives us an idea of how many papers are populated before and after 1991 (1991 is chosen based on balanced numbers of papers before and after that year). In addition, the impact of papers is measured in terms of citation presented as "percentile," a relative measure based on rankings in one year, in order to minimize the effect of time.

Search (27 results, page 1 of 2)

Authors

Languages

Types

Themes

Subjects

Classifications