Search (9 results, page 1 of 1)

  • × author_ss:"Yang, Y."
  1. Yang, Y.; Liu, X.: ¬A re-examination of text categorization methods (1999) 0.00
    0.0026061484 = product of:
      0.0052122967 = sum of:
        0.0052122967 = product of:
          0.010424593 = sum of:
            0.010424593 = weight(_text_:a in 3386) [ClassicSimilarity], result of:
              0.010424593 = score(doc=3386,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.21843673 = fieldWeight in 3386, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=3386)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This paper reports a controlled study with statistical significance tests an five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a Naive Bayes (NB) classifier. We focus an the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten, and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).
  2. Yang, Y.; Chute, C.G.A.: ¬A schematic analysis of the Unified Medical Language System (1992) 0.00
    0.002579418 = product of:
      0.005158836 = sum of:
        0.005158836 = product of:
          0.010317672 = sum of:
            0.010317672 = weight(_text_:a in 6445) [ClassicSimilarity], result of:
              0.010317672 = score(doc=6445,freq=4.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.2161963 = fieldWeight in 6445, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.09375 = fieldNorm(doc=6445)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Type
    a
  3. Yang, Y.; Wilbur, J.: Using corpus statistics to remove redundant words in text categorization (1996) 0.00
    0.0024128247 = product of:
      0.0048256493 = sum of:
        0.0048256493 = product of:
          0.009651299 = sum of:
            0.009651299 = weight(_text_:a in 4199) [ClassicSimilarity], result of:
              0.009651299 = score(doc=4199,freq=14.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.20223314 = fieldWeight in 4199, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4199)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This article studies aggressive word removal in text categorization to reduce the noice in free texts to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with 3 categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique qords reduced the vocabulary of documents from 8.002 distinct words to 1.045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases
    Type
    a
  4. Ortiz-Cordova, A.; Yang, Y.; Jansen, B.J.: External to internal search : associating searching on search engines with searching on sites (2015) 0.00
    0.002279905 = product of:
      0.00455981 = sum of:
        0.00455981 = product of:
          0.00911962 = sum of:
            0.00911962 = weight(_text_:a in 2675) [ClassicSimilarity], result of:
              0.00911962 = score(doc=2675,freq=18.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.19109234 = fieldWeight in 2675, product of:
                  4.2426405 = tf(freq=18.0), with freq of:
                    18.0 = termFreq=18.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2675)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    We analyze the transitions from external search, searching on web search engines, to internal search, searching on websites. We categorize 295,571 search episodes composed of a query submitted to web search engines and the subsequent queries submitted to a single website search by the same users. There are a total of 1,136,390 queries from all searches, of which 295,571 are external search queries and 840,819 are internal search queries. We algorithmically classify queries into states and then use n-grams to categorize search patterns. We cluster the searching episodes into major patterns and identify the most commonly occurring, which are: (1) Explorers (43% of all patterns) with a broad external search query and then broad internal search queries, (2) Navigators (15%) with an external search query containing a URL component and then specific internal search queries, and (3) Shifters (15%) with a different, seemingly unrelated, query types when transitioning from external to internal search. The implications of this research are that external search and internal search sessions are part of a single search episode and that online businesses can leverage these search episodes to more effectively target potential customers.
    Type
    a
  5. Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L.: Explain images with multimodal recurrent neural networks (2014) 0.00
    0.0022338415 = product of:
      0.004467683 = sum of:
        0.004467683 = product of:
          0.008935366 = sum of:
            0.008935366 = weight(_text_:a in 1557) [ClassicSimilarity], result of:
              0.008935366 = score(doc=1557,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.18723148 = fieldWeight in 1557, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=1557)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12 [8], Flickr 8K [28], and Flickr 30K [13]). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
    Type
    a
  6. Wang, P.; Berry, M.W.; Yang, Y.: Mining longitudinal Web queries : trends and patterns (2003) 0.00
    0.0021279112 = product of:
      0.0042558224 = sum of:
        0.0042558224 = product of:
          0.008511645 = sum of:
            0.008511645 = weight(_text_:a in 6561) [ClassicSimilarity], result of:
              0.008511645 = score(doc=6561,freq=8.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.17835285 = fieldWeight in 6561, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0546875 = fieldNorm(doc=6561)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This project analyzed 541,920 user queries submitted to and executed in an academic Website during a four-year period (May 1997 to May 2001) using a relational database. The purpose of the study is three-fold: (1) to understand Web users' query behavior; (2) to identify problems encountered by these Web users; (3) to develop appropriate techniques for optimization of query analysis and mining. The linguistic analyses focus an query structures, lexicon, and word associations using statistical measures such as Zipf distribution and mutual information. A data model with finest granularity is used for data storage and iterative analyses. Patterns and trends of querying behavior are identified and compared with previous studies.
    Type
    a
  7. He, D.; Brusilovsky, P.; Ahn, J.; Grady, J.; Farzan, R.; Peng, Y.; Yang, Y.; Rogati, M.: ¬An evaluation of adaptive filtering in the context of realistic task-based information exploration (2008) 0.00
    0.0018615347 = product of:
      0.0037230693 = sum of:
        0.0037230693 = product of:
          0.0074461387 = sum of:
            0.0074461387 = weight(_text_:a in 2048) [ClassicSimilarity], result of:
              0.0074461387 = score(doc=2048,freq=12.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.15602624 = fieldWeight in 2048, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2048)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Exploratory search increasingly becomes an important research topic. Our interests focus on task-based information exploration, a specific type of exploratory search performed by a range of professional users, such as intelligence analysts. In this paper, we present an evaluation framework designed specifically for assessing and comparing performance of innovative information access tools created to support the work of intelligence analysts in the context of task-based information exploration. The motivation for the development of this framework came from our needs for testing systems in task-based information exploration, which cannot be satisfied by existing frameworks. The new framework is closely tied with the kind of tasks that intelligence analysts perform: complex, dynamic, and multiple facets and multiple stages. It views the user rather than the information system as the center of the evaluation, and examines how well users are served by the systems in their tasks. The evaluation framework examines the support of the systems at users' major information access stages, such as information foraging and sense-making. The framework is accompanied by a reference test collection that has 18 tasks scenarios and corresponding passage-level ground truth annotations. To demonstrate the usage of the framework and the reference test collection, we present a specific evaluation study on CAFÉ, an adaptive filtering engine designed for supporting task-based information exploration. This study is a successful use case of the framework, and the study indeed revealed various aspects of the information systems and their roles in supporting task-based information exploration.
    Type
    a
  8. Yang, Y.; Lu, Q.; Zhao, T.: ¬A delimiter-based general approach for Chinese term extraction (2009) 0.00
    0.0015199365 = product of:
      0.003039873 = sum of:
        0.003039873 = product of:
          0.006079746 = sum of:
            0.006079746 = weight(_text_:a in 3315) [ClassicSimilarity], result of:
              0.006079746 = score(doc=3315,freq=8.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.12739488 = fieldWeight in 3315, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3315)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    This article addresses a two-step approach for term extraction. In the first step on term candidate extraction, a new delimiter-based approach is proposed to identify features of the delimiters of term candidates rather than those of the term candidates themselves. This delimiter-based method is much more stable and domain independent than the previous approaches. In the second step on term verification, an algorithm using link analysis is applied to calculate the relevance between term candidates and the sentences from which the terms are extracted. All information is obtained from the working domain corpus without the need for prior domain knowledge. The approach is not targeted at any specific domain and there is no need for extensive training when applying it to new domains. In other words, the method is not domain dependent and it is especially useful for resource-limited domains. Evaluations of Chinese text in two different domains show quite significant improvements over existing techniques and also verify its efficiency and its relatively domain-independent nature. The proposed method is also very effective for extracting new terms so that it can serve as an efficient tool for updating domain knowledge, especially for expanding lexicons.
    Type
    a
  9. Wang, Y.; Tai, Y.; Yang, Y.: Determination of semantic types of tags in social tagging systems (2018) 0.00
    0.001289709 = product of:
      0.002579418 = sum of:
        0.002579418 = product of:
          0.005158836 = sum of:
            0.005158836 = weight(_text_:a in 4648) [ClassicSimilarity], result of:
              0.005158836 = score(doc=4648,freq=4.0), product of:
                0.04772363 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.041389145 = queryNorm
                0.10809815 = fieldWeight in 4648, product of:
                  2.0 = tf(freq=4.0), with freq of:
                    4.0 = termFreq=4.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=4648)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The purpose of this paper is to determine semantic types for tags in social tagging systems. In social tagging systems, the determination of the semantic type of tags plays an important role in tag classification, increasing the semantic information of tags and establishing mapping relations between tagged resources and a normed ontology. The research reported in this paper constructs the semantic type library that is needed based on the Unified Medical Language System (UMLS) and FrameNet and determines the semantic type of selected tags that have been pretreated via direct matching using the Semantic Navigator tool, the Semantic Type Word Sense Disambiguation (STWSD) tools in UMLS, and artificial matching. And finally, we verify the feasibility of the determination of semantic type for tags by empirical analysis.
    Type
    a