Search (162 results, page 2 of 9)

  • × theme_ss:"Data Mining"
  1. Wu, T.; Pottenger, W.M.: ¬A semi-supervised active learning algorithm for information extraction from textual data (2005) 0.01
    0.00795386 = product of:
      0.01988465 = sum of:
        0.01021673 = weight(_text_:a in 3237) [ClassicSimilarity], result of:
          0.01021673 = score(doc=3237,freq=18.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.19109234 = fieldWeight in 3237, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3237)
        0.009667919 = product of:
          0.019335838 = sum of:
            0.019335838 = weight(_text_:information in 3237) [ClassicSimilarity], result of:
              0.019335838 = score(doc=3237,freq=12.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.23754507 = fieldWeight in 3237, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3237)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    In this article we present a semi-supervised active learning algorithm for pattern discovery in information extraction from textual data. The patterns are reduced regular expressions composed of various characteristics of features useful in information extraction. Our major contribution is a semi-supervised learning algorithm that extracts information from a set of examples labeled as relevant or irrelevant to a given attribute. The approach is semi-supervised because it does not require precise labeling of the exact location of features in the training data. This significantly reduces the effort needed to develop a training set. An active learning algorithm is used to assist the semi-supervised learning algorithm to further reduce the training set development effort. The active learning algorithm is seeded with a Single positive example of a given attribute. The context of the seed is used to automatically identify candidates for additional positive examples of the given attribute. Candidate examples are manually pruned during the active learning phase, and our semi-supervised learning algorithm automatically discovers reduced regular expressions for each attribute. We have successfully applied this learning technique in the extraction of textual features from police incident reports, university crime reports, and patents. The performance of our algorithm compares favorably with competitive extraction systems being used in criminal justice information systems.
    Source
    Journal of the American Society for Information Science and Technology. 56(2005) no.3, S.258-271
    Type
    a
  2. Ebrahimi, M.; ShafieiBavani, E.; Wong, R.; Chen, F.: Twitter user geolocation by filtering of highly mentioned users (2018) 0.01
    0.007848554 = product of:
      0.019621385 = sum of:
        0.012923255 = weight(_text_:a in 4286) [ClassicSimilarity], result of:
          0.012923255 = score(doc=4286,freq=20.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.24171482 = fieldWeight in 4286, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=4286)
        0.0066981306 = product of:
          0.013396261 = sum of:
            0.013396261 = weight(_text_:information in 4286) [ClassicSimilarity], result of:
              0.013396261 = score(doc=4286,freq=4.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.16457605 = fieldWeight in 4286, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4286)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Geolocated social media data provide a powerful source of information about places and regional human behavior. Because only a small amount of social media data have been geolocation-annotated, inference techniques play a substantial role to increase the volume of annotated data. Conventional research in this area has been based on the text content of posts from a given user or the social network of the user, with some recent crossovers between the text- and network-based approaches. This paper proposes a novel approach to categorize highly-mentioned users (celebrities) into Local and Global types, and consequently use Local celebrities as location indicators. A label propagation algorithm is then used over the refined social network for geolocation inference. Finally, we propose a hybrid approach by merging a text-based method as a back-off strategy into our network-based approach. Empirical experiments over three standard Twitter benchmark data sets demonstrate that our approach outperforms state-of-the-art user geolocation methods.
    Source
    Journal of the Association for Information Science and Technology. 69(2018) no.7, S.879-889
    Type
    a
  3. Gaizauskas, R.; Wilks, Y.: Information extraction : beyond document retrieval (1998) 0.01
    0.0077931583 = product of:
      0.019482896 = sum of:
        0.0100103095 = weight(_text_:a in 4716) [ClassicSimilarity], result of:
          0.0100103095 = score(doc=4716,freq=12.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.18723148 = fieldWeight in 4716, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=4716)
        0.009472587 = product of:
          0.018945174 = sum of:
            0.018945174 = weight(_text_:information in 4716) [ClassicSimilarity], result of:
              0.018945174 = score(doc=4716,freq=8.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.23274569 = fieldWeight in 4716, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4716)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    In this paper we give a synoptic view of the growth of the text processing technology of informatione xtraction (IE) whose function is to extract information about a pre-specified set of entities, relations or events from natural language texts and to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960s and 70s till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining
    Type
    a
  4. Baeza-Yates, R.; Hurtado, C.; Mendoza, M.: Improving search engines by query clustering (2007) 0.01
    0.007642546 = product of:
      0.019106366 = sum of:
        0.009535614 = weight(_text_:a in 601) [ClassicSimilarity], result of:
          0.009535614 = score(doc=601,freq=8.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.17835285 = fieldWeight in 601, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0546875 = fieldNorm(doc=601)
        0.009570752 = product of:
          0.019141505 = sum of:
            0.019141505 = weight(_text_:information in 601) [ClassicSimilarity], result of:
              0.019141505 = score(doc=601,freq=6.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.23515764 = fieldWeight in 601, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=601)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    In this paper, we present a framework for clustering Web search engine queries whose aim is to identify groups of queries used to search for similar information on the Web. The framework is based on a novel term vector model of queries that integrates user selections and the content of selected documents extracted from the logs of a search engine. The query representation obtained allows us to treat query clustering similarly to standard document clustering. We study the application of the clustering framework to two problems: relevance ranking boosting and query recommendation. Finally, we evaluate with experiments the effectiveness of our approach.
    Footnote
    Beitrag eines Themenschwerpunktes "Mining Web resources for enhancing information retrieval"
    Source
    Journal of the American Society for Information Science and Technology. 58(2007) no.12, S.1793-1804
    Type
    a
  5. Chen, H.; Chau, M.: Web mining : machine learning for Web applications (2003) 0.01
    0.007505624 = product of:
      0.01876406 = sum of:
        0.008173384 = weight(_text_:a in 4242) [ClassicSimilarity], result of:
          0.008173384 = score(doc=4242,freq=8.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.15287387 = fieldWeight in 4242, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=4242)
        0.010590675 = product of:
          0.02118135 = sum of:
            0.02118135 = weight(_text_:information in 4242) [ClassicSimilarity], result of:
              0.02118135 = score(doc=4242,freq=10.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.2602176 = fieldWeight in 4242, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4242)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich knowledge base. The knowledge comes not only from the content of the pages themselves, but also from the unique characteristics of the Web, such as its hyperlink structure and its diversity of content and languages. Analysis of these characteristics often reveals interesting patterns and new knowledge. Such knowledge can be used to improve users' efficiency and effectiveness in searching for information an the Web, and also for applications unrelated to the Web, such as support for decision making or business management. The Web's size and its unstructured and dynamic content, as well as its multilingual nature, make the extraction of useful knowledge a challenging research problem. Furthermore, the Web generates a large amount of data in other formats that contain valuable information. For example, Web server logs' information about user access patterns can be used for information personalization or improving Web page design.
    Source
    Annual review of information science and technology. 38(2004), S.289-330
    Type
    a
  6. O'Brien, H.L.; Lebow, M.: Mixed-methods approach to measuring user experience in online news interactions (2013) 0.01
    0.007471291 = product of:
      0.018678227 = sum of:
        0.009010308 = weight(_text_:a in 1001) [ClassicSimilarity], result of:
          0.009010308 = score(doc=1001,freq=14.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.1685276 = fieldWeight in 1001, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=1001)
        0.009667919 = product of:
          0.019335838 = sum of:
            0.019335838 = weight(_text_:information in 1001) [ClassicSimilarity], result of:
              0.019335838 = score(doc=1001,freq=12.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.23754507 = fieldWeight in 1001, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=1001)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    When it comes to evaluating online information experiences, what metrics matter? We conducted a study in which 30 people browsed and selected content within an online news website. Data collected included psychometric scales (User Engagement, Cognitive Absorption, System Usability Scales), self-reported interest in news content, and performance metrics (i.e., reading time, browsing time, total time, number of pages visited, and use of recommended links); a subset of the participants had their physiological responses recorded during the interaction (i.e., heart rate, electrodermal activity, electrocmytogram). Findings demonstrated the concurrent validity of the psychometric scales and interest ratings and revealed that increased time on tasks, number of pages visited, and use of recommended links were not necessarily indicative of greater self-reported engagement, cognitive absorption, or perceived usability. Positive ratings of news content were associated with lower physiological activity. The implications of this research are twofold. First, we propose that user experience is a useful framework for studying online information interactions and will result in a broader conceptualization of information interaction and its evaluation. Second, we advocate a mixed-methods approach to measurement that employs a suite of metrics capable of capturing the pragmatic (e.g., usability) and hedonic (e.g., fun, engagement) aspects of information interactions. We underscore the importance of using multiple measures in information research, because our results emphasize that performance and physiological data must be interpreted in the context of users' subjective experiences.
    Source
    Journal of the American Society for Information Science and Technology. 64(2013) no.8, S.1543-1556
    Type
    a
  7. Chardonnens, A.; Hengchen, S.: Text mining for cultural heritage institutions : a 5-step method for cultural heritage institutions (2017) 0.01
    0.0073474604 = product of:
      0.01836865 = sum of:
        0.009437811 = weight(_text_:a in 646) [ClassicSimilarity], result of:
          0.009437811 = score(doc=646,freq=6.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.17652355 = fieldWeight in 646, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0625 = fieldNorm(doc=646)
        0.0089308405 = product of:
          0.017861681 = sum of:
            0.017861681 = weight(_text_:information in 646) [ClassicSimilarity], result of:
              0.017861681 = score(doc=646,freq=4.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.21943474 = fieldWeight in 646, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0625 = fieldNorm(doc=646)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Source
    Everything changes, everything stays the same? - Understanding information spaces : Proceedings of the 15th International Symposium of Information Science (ISI 2017), Berlin/Germany, 13th - 15th March 2017. Eds.: M. Gäde, V. Trkulja u. V. Petras
    Type
    a
  8. Ekbia, H.; Mattioli, M.; Kouper, I.; Arave, G.; Ghazinejad, A.; Bowman, T.; Suri, V.R.; Tsou, A.; Weingart, S.; Sugimoto, C.R.: Big data, bigger dilemmas : a critical review (2015) 0.01
    0.00732971 = product of:
      0.018324275 = sum of:
        0.0127425 = weight(_text_:a in 2155) [ClassicSimilarity], result of:
          0.0127425 = score(doc=2155,freq=28.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.23833402 = fieldWeight in 2155, product of:
              5.2915025 = tf(freq=28.0), with freq of:
                28.0 = termFreq=28.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2155)
        0.0055817757 = product of:
          0.011163551 = sum of:
            0.011163551 = weight(_text_:information in 2155) [ClassicSimilarity], result of:
              0.011163551 = score(doc=2155,freq=4.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.13714671 = fieldWeight in 2155, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2155)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    The recent interest in Big Data has generated a broad range of new academic, corporate, and policy practices along with an evolving debate among its proponents, detractors, and skeptics. While the practices draw on a common set of tools, techniques, and technologies, most contributions to the debate come either from a particular disciplinary perspective or with a focus on a domain-specific issue. A close examination of these contributions reveals a set of common problematics that arise in various guises and in different places. It also demonstrates the need for a critical synthesis of the conceptual and practical dilemmas surrounding Big Data. The purpose of this article is to provide such a synthesis by drawing on relevant writings in the sciences, humanities, policy, and trade literature. In bringing these diverse literatures together, we aim to shed light on the common underlying issues that concern and affect all of these areas. By contextualizing the phenomenon of Big Data within larger socioeconomic developments, we also seek to provide a broader understanding of its drivers, barriers, and challenges. This approach allows us to identify attributes of Big Data that require more attention-autonomy, opacity, generativity, disparity, and futurity-leading to questions and ideas for moving beyond dilemmas.
    Series
    Advances in information science
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.8, S.1523-1545
    Type
    a
  9. Bell, D.A.; Guan, J.W.: Computational methods for rough classification and discovery (1998) 0.01
    0.0072560436 = product of:
      0.01814011 = sum of:
        0.012614433 = weight(_text_:a in 2909) [ClassicSimilarity], result of:
          0.012614433 = score(doc=2909,freq=14.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.23593865 = fieldWeight in 2909, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2909)
        0.005525676 = product of:
          0.011051352 = sum of:
            0.011051352 = weight(_text_:information in 2909) [ClassicSimilarity], result of:
              0.011051352 = score(doc=2909,freq=2.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.13576832 = fieldWeight in 2909, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=2909)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Rough set theory is a mathematical tool to deal with vagueness and uncertainty. To apply the theory, it needs to be associated with efficient and effective computational methods. A relation can be used to represent a decison table for use in decision making. By using this kind of table, rough set theory can be applied successfully to rough classification and knowledge discovery. Presents computational methods for using rough sets to identify classes in datasets, finding dependencies in relations, and discovering rules which are hidden in databases. Illustrates the methods with a running example from a database of car test results
    Footnote
    Contribution to a special issue devoted to knowledge discovery and data mining
    Source
    Journal of the American Society for Information Science. 49(1998) no.5, S.403-414
    Type
    a
  10. Varathan, K.D.; Giachanou, A.; Crestani, F.: Comparative opinion mining : a review (2017) 0.01
    0.0072525083 = product of:
      0.018131271 = sum of:
        0.01129502 = weight(_text_:a in 3540) [ClassicSimilarity], result of:
          0.01129502 = score(doc=3540,freq=22.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.21126054 = fieldWeight in 3540, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3540)
        0.006836252 = product of:
          0.013672504 = sum of:
            0.013672504 = weight(_text_:information in 3540) [ClassicSimilarity], result of:
              0.013672504 = score(doc=3540,freq=6.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.16796975 = fieldWeight in 3540, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3540)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Opinion mining refers to the use of natural language processing, text analysis, and computational linguistics to identify and extract subjective information in textual material. Opinion mining, also known as sentiment analysis, has received a lot of attention in recent times, as it provides a number of tools to analyze public opinion on a number of different topics. Comparative opinion mining is a subfield of opinion mining which deals with identifying and extracting information that is expressed in a comparative form (e.g., "paper X is better than the Y"). Comparative opinion mining plays a very important role when one tries to evaluate something because it provides a reference point for the comparison. This paper provides a review of the area of comparative opinion mining. It is the first review that cover specifically this topic as all previous reviews dealt mostly with general opinion mining. This survey covers comparative opinion mining from two different angles. One from the perspective of techniques and the other from the perspective of comparative opinion elements. It also incorporates preprocessing tools as well as data set that were used by past researchers that can be useful to future researchers in the field of comparative opinion mining.
    Source
    Journal of the Association for Information Science and Technology. 68(2017) no.4, S.811-829
    Type
    a
  11. Liu, Y.; Zhang, M.; Cen, R.; Ru, L.; Ma, S.: Data cleansing for Web information retrieval using query independent features (2007) 0.01
    0.0072442205 = product of:
      0.018110551 = sum of:
        0.01021673 = weight(_text_:a in 607) [ClassicSimilarity], result of:
          0.01021673 = score(doc=607,freq=18.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.19109234 = fieldWeight in 607, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=607)
        0.007893822 = product of:
          0.015787644 = sum of:
            0.015787644 = weight(_text_:information in 607) [ClassicSimilarity], result of:
              0.015787644 = score(doc=607,freq=8.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.19395474 = fieldWeight in 607, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=607)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query-independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning-based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low-quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query-independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance.
    Footnote
    Beitrag eines Themenschwerpunktes "Mining Web resources for enhancing information retrieval"
    Source
    Journal of the American Society for Information Science and Technology. 58(2007) no.12, S.1884-1898
    Type
    a
  12. Knowledge discovery and data mining (1998) 0.01
    0.007058388 = product of:
      0.01764597 = sum of:
        0.008173384 = weight(_text_:a in 2898) [ClassicSimilarity], result of:
          0.008173384 = score(doc=2898,freq=2.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.15287387 = fieldWeight in 2898, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.09375 = fieldNorm(doc=2898)
        0.009472587 = product of:
          0.018945174 = sum of:
            0.018945174 = weight(_text_:information in 2898) [ClassicSimilarity], result of:
              0.018945174 = score(doc=2898,freq=2.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.23274569 = fieldWeight in 2898, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.09375 = fieldNorm(doc=2898)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Footnote
    A special issue devoted to knowledge discovery and data mining
    Source
    Journal of the American Society for Information Science. 49(1998) no.5, S.397-470
  13. Chen, C.-C.; Chen, A.-P.: Using data mining technology to provide a recommendation service in the digital library (2007) 0.01
    0.0070422525 = product of:
      0.01760563 = sum of:
        0.010769378 = weight(_text_:a in 2533) [ClassicSimilarity], result of:
          0.010769378 = score(doc=2533,freq=20.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.20142901 = fieldWeight in 2533, product of:
              4.472136 = tf(freq=20.0), with freq of:
                20.0 = termFreq=20.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=2533)
        0.006836252 = product of:
          0.013672504 = sum of:
            0.013672504 = weight(_text_:information in 2533) [ClassicSimilarity], result of:
              0.013672504 = score(doc=2533,freq=6.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.16796975 = fieldWeight in 2533, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2533)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Purpose - Since library storage has been increasing day by day, it is difficult for readers to find the books which interest them as well as representative booklists. How to utilize meaningful information effectively to improve the service quality of the digital library appears to be very important. The purpose of this paper is to provide a recommendation system architecture to promote digital library services in electronic libraries. Design/methodology/approach - In the proposed architecture, a two-phase data mining process used by association rule and clustering methods is designed to generate a recommendation system. The process considers not only the relationship of a cluster of users but also the associations among the information accessed. Findings - The process considered not only the relationship of a cluster of users but also the associations among the information accessed. With the advanced filter, the recommendation supported by the proposed system architecture would be closely served to meet users' needs. Originality/value - This paper not only constructs a recommendation service for readers to search books from the web but takes the initiative in finding the most suitable books for readers as well. Furthermore, library managers are expected to purchase core and hot books from a limited budget to maintain and satisfy the requirements of readers along with promoting digital library services.
    Type
    a
  14. Saz, J.T.: Perspectivas en recuperacion y explotacion de informacion electronica : el 'data mining' (1997) 0.01
    0.0070104985 = product of:
      0.017526247 = sum of:
        0.009632425 = weight(_text_:a in 3723) [ClassicSimilarity], result of:
          0.009632425 = score(doc=3723,freq=4.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.18016359 = fieldWeight in 3723, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.078125 = fieldNorm(doc=3723)
        0.007893822 = product of:
          0.015787644 = sum of:
            0.015787644 = weight(_text_:information in 3723) [ClassicSimilarity], result of:
              0.015787644 = score(doc=3723,freq=2.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.19395474 = fieldWeight in 3723, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.078125 = fieldNorm(doc=3723)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Presents the concept and the techniques identified by the term data mining. Explains the principles and phases of developing a data mining process, and the main types of data mining tools
    Footnote
    Übers. des Titels: Perspectives on the retrieval and exploitation of electronic information: data mining
    Type
    a
  15. Li, J.; Zhang, P.; Cao, J.: External concept support for group support systems through Web mining (2009) 0.01
    0.0069366493 = product of:
      0.017341623 = sum of:
        0.009138121 = weight(_text_:a in 2806) [ClassicSimilarity], result of:
          0.009138121 = score(doc=2806,freq=10.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.1709182 = fieldWeight in 2806, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2806)
        0.008203502 = product of:
          0.016407004 = sum of:
            0.016407004 = weight(_text_:information in 2806) [ClassicSimilarity], result of:
              0.016407004 = score(doc=2806,freq=6.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.20156369 = fieldWeight in 2806, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2806)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    External information plays an important role in group decision-making processes, yet research about external information support for Group Support Systems (GSS) has been lacking. In this study, we propose an approach to build a concept space to provide external concept support for GSS users. Built on a Web mining algorithm, the approach can mine a concept space from the Web and retrieve related concepts from the concept space based on users' comments in a real-time manner. We conduct two experiments to evaluate the quality of the proposed approach and the effectiveness of the external concept support provided by this approach. The experiment results indicate that the concept space mined from the Web contained qualified concepts to stimulate divergent thinking. The results also demonstrate that external concept support in GSS greatly enhanced group productivity for idea generation tasks.
    Source
    Journal of the American Society for Information Science and Technology. 60(2009) no.5, S.1057-1070
    Type
    a
  16. Peters, G.; Gaese, V.: ¬Das DocCat-System in der Textdokumentation von G+J (2003) 0.01
    0.006913379 = product of:
      0.017283447 = sum of:
        0.0047189053 = weight(_text_:a in 1507) [ClassicSimilarity], result of:
          0.0047189053 = score(doc=1507,freq=6.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.088261776 = fieldWeight in 1507, product of:
              2.4494898 = tf(freq=6.0), with freq of:
                6.0 = termFreq=6.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.03125 = fieldNorm(doc=1507)
        0.012564542 = product of:
          0.025129084 = sum of:
            0.025129084 = weight(_text_:22 in 1507) [ClassicSimilarity], result of:
              0.025129084 = score(doc=1507,freq=2.0), product of:
                0.16237405 = queryWeight, product of:
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.046368346 = queryNorm
                0.15476047 = fieldWeight in 1507, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  3.5018296 = idf(docFreq=3622, maxDocs=44218)
                  0.03125 = fieldNorm(doc=1507)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Wir werden einmal die Grundlagen des Text-Mining-Systems bei IBM darstellen, dann werden wir das Projekt etwas umfangreicher und deutlicher darstellen, da kennen wir uns aus. Von daher haben wir zwei Teile, einmal Heidelberg, einmal Hamburg. Noch einmal zur Technologie. Text-Mining ist eine von IBM entwickelte Technologie, die in einer besonderen Ausformung und Programmierung für uns zusammengestellt wurde. Das Projekt hieß bei uns lange Zeit DocText Miner und heißt seit einiger Zeit auf Vorschlag von IBM DocCat, das soll eine Abkürzung für Document-Categoriser sein, sie ist ja auch nett und anschaulich. Wir fangen an mit Text-Mining, das bei IBM in Heidelberg entwickelt wurde. Die verstehen darunter das automatische Indexieren als eine Instanz, also einen Teil von Text-Mining. Probleme werden dabei gezeigt, und das Text-Mining ist eben eine Methode zur Strukturierung von und der Suche in großen Dokumentenmengen, die Extraktion von Informationen und, das ist der hohe Anspruch, von impliziten Zusammenhängen. Das letztere sei dahingestellt. IBM macht das quantitativ, empirisch, approximativ und schnell. das muss man wirklich sagen. Das Ziel, und das ist ganz wichtig für unser Projekt gewesen, ist nicht, den Text zu verstehen, sondern das Ergebnis dieser Verfahren ist, was sie auf Neudeutsch a bundle of words, a bag of words nennen, also eine Menge von bedeutungstragenden Begriffen aus einem Text zu extrahieren, aufgrund von Algorithmen, also im Wesentlichen aufgrund von Rechenoperationen. Es gibt eine ganze Menge von linguistischen Vorstudien, ein wenig Linguistik ist auch dabei, aber nicht die Grundlage der ganzen Geschichte. Was sie für uns gemacht haben, ist also die Annotierung von Pressetexten für unsere Pressedatenbank. Für diejenigen, die es noch nicht kennen: Gruner + Jahr führt eine Textdokumentation, die eine Datenbank führt, seit Anfang der 70er Jahre, da sind z.Z. etwa 6,5 Millionen Dokumente darin, davon etwas über 1 Million Volltexte ab 1993. Das Prinzip war lange Zeit, dass wir die Dokumente, die in der Datenbank gespeichert waren und sind, verschlagworten und dieses Prinzip haben wir auch dann, als der Volltext eingeführt wurde, in abgespeckter Form weitergeführt. Zu diesen 6,5 Millionen Dokumenten gehören dann eben auch ungefähr 10 Millionen Faksimileseiten, weil wir die Faksimiles auch noch standardmäßig aufheben.
    Date
    22. 4.2003 11:45:36
    Type
    a
  17. Derek Doran, D.; Gokhale, S.S.: ¬A classification framework for web robots (2012) 0.01
    0.0068851607 = product of:
      0.017212901 = sum of:
        0.010897844 = weight(_text_:a in 505) [ClassicSimilarity], result of:
          0.010897844 = score(doc=505,freq=8.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.20383182 = fieldWeight in 505, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0625 = fieldNorm(doc=505)
        0.006315058 = product of:
          0.012630116 = sum of:
            0.012630116 = weight(_text_:information in 505) [ClassicSimilarity], result of:
              0.012630116 = score(doc=505,freq=2.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.1551638 = fieldWeight in 505, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0625 = fieldNorm(doc=505)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    The behavior of modern web robots varies widely when they crawl for different purposes. In this article, we present a framework to classify these web robots from two orthogonal perspectives, namely, their functionality and the types of resources they consume. Applying the classification framework to a year-long access log from the UConn SoE web server, we present trends that point to significant differences in their crawling behavior.
    Source
    Journal of the American Society for Information Science and Technology. 63(2012) no.12, S.2549-2554,
    Type
    a
  18. Wong, S.K.M.; Butz, C.J.; Xiang, X.: Automated database schema design using mined data dependencies (1998) 0.01
    0.0068817483 = product of:
      0.01720437 = sum of:
        0.011678694 = weight(_text_:a in 2897) [ClassicSimilarity], result of:
          0.011678694 = score(doc=2897,freq=12.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.21843673 = fieldWeight in 2897, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2897)
        0.005525676 = product of:
          0.011051352 = sum of:
            0.011051352 = weight(_text_:information in 2897) [ClassicSimilarity], result of:
              0.011051352 = score(doc=2897,freq=2.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.13576832 = fieldWeight in 2897, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=2897)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Data dependencies are used in database schema design to enforce the correctness of a database as well as to reduce redundant data. These dependencies are usually determined from the semantics of the attributes and are then enforced upon the relations. Describes a bottom-up procedure for discovering multivalued dependencies in observed data without knowing a priori the relationships among the attributes. The proposed algorithm is an application of the technique designed for learning conditional independencies in probabilistic reasoning. A prototype system for automated database schema design has been implemented. Experiments were carried out to demonstrate both the effectiveness and efficiency of the method
    Footnote
    Contribution to a special issue devoted to knowledge discovery and data mining
    Source
    Journal of the American Society for Information Science. 49(1998) no.5, S.455-470
    Type
    a
  19. Kong, S.; Ye, F.; Feng, L.; Zhao, Z.: Towards the prediction problems of bursting hashtags on Twitter (2015) 0.01
    0.0068817483 = product of:
      0.01720437 = sum of:
        0.011678694 = weight(_text_:a in 2338) [ClassicSimilarity], result of:
          0.011678694 = score(doc=2338,freq=12.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.21843673 = fieldWeight in 2338, product of:
              3.4641016 = tf(freq=12.0), with freq of:
                12.0 = termFreq=12.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0546875 = fieldNorm(doc=2338)
        0.005525676 = product of:
          0.011051352 = sum of:
            0.011051352 = weight(_text_:information in 2338) [ClassicSimilarity], result of:
              0.011051352 = score(doc=2338,freq=2.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.13576832 = fieldWeight in 2338, product of:
                  1.4142135 = tf(freq=2.0), with freq of:
                    2.0 = termFreq=2.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=2338)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Hundreds of thousands of hashtags are generated every day on Twitter. Only a few will burst and become trending topics. In this article, we provide the definition of a bursting hashtag and conduct a systematic study of a series of challenging prediction problems that span the entire life cycles of bursting hashtags. Around the problem of "how to build a system to predict bursting hashtags," we explore different types of features and present machine learning solutions. On real data sets from Twitter, experiments are conducted to evaluate the effectiveness of the proposed solutions and the contributions of features.
    Source
    Journal of the Association for Information Science and Technology. 66(2015) no.12, S.2566-2579
    Type
    a
  20. Haravu, L.J.; Neelameghan, A.: Text mining and data mining in knowledge organization and discovery : the making of knowledge-based products (2003) 0.01
    0.006821193 = product of:
      0.017052982 = sum of:
        0.01021673 = weight(_text_:a in 5653) [ClassicSimilarity], result of:
          0.01021673 = score(doc=5653,freq=18.0), product of:
            0.053464882 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046368346 = queryNorm
            0.19109234 = fieldWeight in 5653, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=5653)
        0.006836252 = product of:
          0.013672504 = sum of:
            0.013672504 = weight(_text_:information in 5653) [ClassicSimilarity], result of:
              0.013672504 = score(doc=5653,freq=6.0), product of:
                0.08139861 = queryWeight, product of:
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.046368346 = queryNorm
                0.16796975 = fieldWeight in 5653, product of:
                  2.4494898 = tf(freq=6.0), with freq of:
                    6.0 = termFreq=6.0
                  1.7554779 = idf(docFreq=20772, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=5653)
          0.5 = coord(1/2)
      0.4 = coord(2/5)
    
    Abstract
    Discusses the importance of knowledge organization in the context of the information overload caused by the vast quantities of data and information accessible on internal and external networks of an organization. Defines the characteristics of a knowledge-based product. Elaborates on the techniques and applications of text mining in developing knowledge products. Presents two approaches, as case studies, to the making of knowledge products: (1) steps and processes in the planning, designing and development of a composite multilingual multimedia CD product, with the potential international, inter-cultural end users in view, and (2) application of natural language processing software in text mining. Using a text mining software, it is possible to link concept terms from a processed text to a related thesaurus, glossary, schedules of a classification scheme, and facet structured subject representations. Concludes that the products of text mining and data mining could be made more useful if the features of a faceted scheme for subject classification are incorporated into text mining techniques and products.
    Content
    Beitrag eines Themenheftes "Knowledge organization and classification in international information retrieval"
    Type
    a

Years

Languages

  • e 128
  • d 33
  • sp 1
  • More… Less…

Types

  • a 141
  • el 15
  • m 15
  • s 15
  • x 1
  • More… Less…