Search (7 results, page 1 of 1)

  • × author_ss:"Goharian, N."
  1. Mengle, S.; Goharian, N.: Passage detection using text classification (2009) 0.02
    0.020383961 = product of:
      0.040767923 = sum of:
        0.040767923 = sum of:
          0.009567685 = weight(_text_:a in 2765) [ClassicSimilarity], result of:
            0.009567685 = score(doc=2765,freq=16.0), product of:
              0.053105544 = queryWeight, product of:
                1.153047 = idf(docFreq=37942, maxDocs=44218)
                0.046056706 = queryNorm
              0.18016359 = fieldWeight in 2765, product of:
                4.0 = tf(freq=16.0), with freq of:
                  16.0 = termFreq=16.0
                1.153047 = idf(docFreq=37942, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2765)
          0.03120024 = weight(_text_:22 in 2765) [ClassicSimilarity], result of:
            0.03120024 = score(doc=2765,freq=2.0), product of:
              0.16128273 = queryWeight, product of:
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.046056706 = queryNorm
              0.19345059 = fieldWeight in 2765, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.5018296 = idf(docFreq=3622, maxDocs=44218)
                0.0390625 = fieldNorm(doc=2765)
      0.5 = coord(1/2)
    
    Abstract
    Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.
    Date
    22. 3.2009 19:14:43
    Type
    a
  2. Soldaini, L.; Yates, A.; Goharian, N.: Learning to reformulate long queries for clinical decision support (2017) 0.00
    0.0025370158 = product of:
      0.0050740317 = sum of:
        0.0050740317 = product of:
          0.010148063 = sum of:
            0.010148063 = weight(_text_:a in 3957) [ClassicSimilarity], result of:
              0.010148063 = score(doc=3957,freq=18.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.19109234 = fieldWeight in 3957, product of:
                  4.2426405 = tf(freq=18.0), with freq of:
                    18.0 = termFreq=18.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3957)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    The large volume of biomedical literature poses a serious problem for medical professionals, who are often struggling to keep current with it. At the same time, many health providers consider knowledge of the latest literature in their field a key component for successful clinical practice. In this work, we introduce two systems designed to help retrieving medical literature. Both receive a long, discursive clinical note as input query, and return highly relevant literature that could be used in support of clinical practice. The first system is an improved version of a method previously proposed by the authors; it combines pseudo relevance feedback and a domain-specific term filter to reformulate the query. The second is an approach that uses a deep neural network to reformulate a clinical note. Both approaches were evaluated on the 2014 and 2015 TREC CDS datasets; in our tests, they outperform the previously proposed method by up to 28% in inferred NDCG; furthermore, they are competitive with the state of the art, achieving up to 8% improvement in inferred NDCG.
    Type
    a
  3. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.00
    0.0022374375 = product of:
      0.004474875 = sum of:
        0.004474875 = product of:
          0.00894975 = sum of:
            0.00894975 = weight(_text_:a in 2804) [ClassicSimilarity], result of:
              0.00894975 = score(doc=2804,freq=14.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.1685276 = fieldWeight in 2804, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2804)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier. We favorably show the effectiveness of our approach in outperforming eight existing feature-selection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets.
    Type
    a
  4. Cohan, A.; Young, S.; Yates, A.; Goharian, N.: Triaging content severity in online mental health forums (2017) 0.00
    0.0022374375 = product of:
      0.004474875 = sum of:
        0.004474875 = product of:
          0.00894975 = sum of:
            0.00894975 = weight(_text_:a in 3930) [ClassicSimilarity], result of:
              0.00894975 = score(doc=3930,freq=14.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.1685276 = fieldWeight in 3930, product of:
                  3.7416575 = tf(freq=14.0), with freq of:
                    14.0 = termFreq=14.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3930)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    In recent years, social media has become a significant resource for improving healthcare and mental health. Mental health forums are online communities where people express their issues, and seek help from moderators and other users. In such forums, there are often posts with severe content indicating that the user is in acute distress and there is a risk of attempted self-harm. Moderators need to respond to these severe posts in a timely manner to prevent potential self-harm. However, the large volume of daily posted content makes it difficult for the moderators to locate and respond to these critical posts. We propose an approach for triaging user content into four severity categories that are defined based on an indication of self-harm ideation. Our models are based on a feature-rich classification framework, which includes lexical, psycholinguistic, contextual, and topic modeling features. Our approaches improve over the state of the art in triaging the content severity in mental health forums by large margins (up to 17% improvement over the F-1 scores). Furthermore, using our proposed model, we analyze the mental state of users and we show that overall, long-term users of the forum demonstrate decreased severity of risk over time. Our analysis on the interaction of the moderators with the users further indicates that without an automatic way to identify critical content, it is indeed challenging for the moderators to provide timely response to the users in need.
    Type
    a
  5. Mengle, S.S.R.; Goharian, N.: Detecting relationships among categories using text classification (2010) 0.00
    0.0020714647 = product of:
      0.0041429293 = sum of:
        0.0041429293 = product of:
          0.008285859 = sum of:
            0.008285859 = weight(_text_:a in 3462) [ClassicSimilarity], result of:
              0.008285859 = score(doc=3462,freq=12.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.15602624 = fieldWeight in 3462, product of:
                  3.4641016 = tf(freq=12.0), with freq of:
                    12.0 = termFreq=12.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=3462)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Discovering relationships among concepts and categories is crucial in various information systems. The authors' objective was to discover such relationships among document categories. Traditionally, such relationships are represented in the form of a concept hierarchy, grouping some categories under the same parent category. Although the nature of hierarchy supports the identification of categories that may share the same parent, not all of these categories have a relationship with each other - other than sharing the same parent. However, some non-sibling relationships exist that although are related to each other are not identified as such. The authors identify and build a relationship network (relationship-net) with categories as the vertices and relationships as the edges of this network. They demonstrate that using a relationship-net, some nonobvious category relationships are detected. Their approach capitalizes on the misclassification information generated during the process of text classification to identify potential relationships among categories and automatically generate relationship-nets. Their results demonstrate a statistically significant improvement over the current approach by up to 73% on 20 News groups 20NG, up to 68% on 17 categories in the Open Directories Project (ODP17), and more than twice on ODP46 and Special Interest Group on Information Retrieval (SIGIR) data sets. Their results also indicate that using misclassification information stemming from passage classification as opposed to document classification statistically significantly improves the results on 20NG (8%), ODP17 (5%), ODP46 (73%), and SIGIR (117%) with respect to F1 measure. By assigning weights to relationships and by performing feature selection, results are further optimized.
    Type
    a
  6. Beitzel, S.M.; Jensen, E.C.; Chowdhury, A.; Grossman, D.; Frieder, O; Goharian, N.: Fusion of effective retrieval strategies in the same information retrieval system (2004) 0.00
    0.0020296127 = product of:
      0.0040592253 = sum of:
        0.0040592253 = product of:
          0.008118451 = sum of:
            0.008118451 = weight(_text_:a in 2502) [ClassicSimilarity], result of:
              0.008118451 = score(doc=2502,freq=8.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.15287387 = fieldWeight in 2502, product of:
                  2.828427 = tf(freq=8.0), with freq of:
                    8.0 = termFreq=8.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046875 = fieldNorm(doc=2502)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    Prior efforts have shown that under certain situations retrieval effectiveness may be improved via the use of data fusion techniques. Although these improvements have been observed from the fusion of result sets from several distinct information retrieval systems, it has often been thought that fusing different document retrieval strategies in a single information retrieval system will lead to similar improvements. In this study, we show that this is not the case. We hold constant systemic differences such as parsing, stemming, phrase processing, and relevance feedback, and fuse result sets generated from highly effective retrieval strategies in the same information retrieval system. From this, we show that data fusion of highly effective retrieval strategies alone shows little or no improvement in retrieval effectiveness. Furthermore, we present a detailed analysis of the performance of modern data fusion approaches, and demonstrate the reasons why they do not perform weIl when applied to this problem. Detailed results and analyses are included to support our conclusions.
    Type
    a
  7. Urbain, J.; Goharian, N.; Frieder, O.: Probabilistic passage models for semantic search of genomics literature (2008) 0.00
    0.0018909799 = product of:
      0.0037819599 = sum of:
        0.0037819599 = product of:
          0.0075639198 = sum of:
            0.0075639198 = weight(_text_:a in 2380) [ClassicSimilarity], result of:
              0.0075639198 = score(doc=2380,freq=10.0), product of:
                0.053105544 = queryWeight, product of:
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.046056706 = queryNorm
                0.14243183 = fieldWeight in 2380, product of:
                  3.1622777 = tf(freq=10.0), with freq of:
                    10.0 = termFreq=10.0
                  1.153047 = idf(docFreq=37942, maxDocs=44218)
                  0.0390625 = fieldNorm(doc=2380)
          0.5 = coord(1/2)
      0.5 = coord(1/2)
    
    Abstract
    We explore unsupervised learning techniques for extracting semantic information about biomedical concepts and topics, and introduce a passage retrieval model for using these semantics in context to improve genomics literature search. Our contributions include a new passage retrieval model based on an undirected graphical model (Markov Random Fields), and new methods for modeling passage-concepts, document-topics, and passage-terms as potential functions within the model. Each potential function includes distributional evidence to disambiguate topics, concepts, and terms in context. The joint distribution across potential functions in the graph represents the probability of a passage being relevant to a biologist's information need. Relevance ranking within each potential function simplifies normalization across potential functions and eliminates the need for tuning of passage retrieval model parameters. Our dimensional indexing model facilitates efficient aggregation of topic, concept, and term distributions. The proposed passage-retrieval model improves search results in the presence of varying levels of semantic evidence, outperforming models of query terms, concepts, or document topics alone. Our results exceed the state-of-the-art for automatic document retrieval by 14.46% (0.3554 vs. 0.3105) and passage retrieval by 15.57% (0.1128 vs. 0.0976) as assessed by the TREC 2007 Genomics Track, and automatic document retrieval by 18.56% (0.3424 vs. 0.2888) as assessed by the TREC 2005 Genomics Track. Automatic document retrieval results for TREC 2007 and TREC 2005 are statistically significant at the 95% confidence level (p = .0359 and .0253, respectively). Passage retrieval is significant at the 90% confidence level (p = 0.0893).
    Type
    a