Search (9 results, page 1 of 1)

Carterette, B.; Can, F.: Comparing inverted files and signature files for searching a large lexicon (2005) 0.00
```
0.0023499418 = product of:
  0.0046998835 = sum of:
    0.0046998835 = product of:
      0.009399767 = sum of:
        0.009399767 = weight(_text_:a in 1029) [ClassicSimilarity], result of:
          0.009399767 = score(doc=1029,freq=16.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.2161963 = fieldWeight in 1029, product of:
              4.0 = tf(freq=16.0), with freq of:
                16.0 = termFreq=16.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=1029)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.

Type

a
Aksoy, C.; Can, F.; Kocberber, S.: Novelty detection for topic tracking (2012) 0.00
```
0.0022962927 = product of:
  0.0045925854 = sum of:
    0.0045925854 = product of:
      0.009185171 = sum of:
        0.009185171 = weight(_text_:a in 51) [ClassicSimilarity], result of:
          0.009185171 = score(doc=51,freq=22.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.21126054 = fieldWeight in 51, product of:
              4.690416 = tf(freq=22.0), with freq of:
                22.0 = termFreq=22.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=51)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Multisource web news portals provide various advantages such as richness in news content and an opportunity to follow developments from different perspectives. However, in such environments, news variety and quantity can have an overwhelming effect. New-event detection and topic-tracking studies address this problem. They examine news streams and organize stories according to their events; however, several tracking stories of an event/topic may contain no new information (i.e., no novelty). We study the novelty detection (ND) problem on the tracking news of a particular topic. For this purpose, we build a Turkish ND test collection called BilNov-2005 and propose the usage of three ND methods: a cosine-similarity (CS)-based method, a language-model (LM)-based method, and a cover-coefficient (CC)-based method. For the LM-based ND method, we show that a simpler smoothing approach, Dirichlet smoothing, can have similar performance to a more complex smoothing approach, Shrinkage smoothing. We introduce a baseline that shows the performance of a system with random novelty decisions. In addition, a category-based threshold learning method is used for the first time in ND literature. The experimental results show that the LM-based ND method significantly outperforms the CS- and CC-based methods, and category-based threshold learning achieves promising results when compared to general threshold learning.

Type

a
Can, F.; Kocberber, S.; Balcik, E.; Kaynak, C.; Ocalan, H.C.: Information retrieval on Turkish texts (2008) 0.00
```
0.0021674242 = product of:
  0.0043348484 = sum of:
    0.0043348484 = product of:
      0.008669697 = sum of:
        0.008669697 = weight(_text_:a in 1373) [ClassicSimilarity], result of:
          0.008669697 = score(doc=1373,freq=10.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.19940455 = fieldWeight in 1373, product of:
              3.1622777 = tf(freq=10.0), with freq of:
                10.0 = termFreq=10.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0546875 = fieldNorm(doc=1373)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

In this study, we investigate information retrieval (IR) on Turkish texts using a large-scale test collection that contains 408,305 documents and 72 ad hoc queries. We examine the effects of several stemming options and query-document matching functions on retrieval performance. We show that a simple word truncation approach, a word truncation approach that uses language-dependent corpus statistics, and an elaborate lemmatizer-based stemmer provide similar retrieval effectiveness in Turkish IR. We investigate the effects of a range of search conditions on the retrieval performance; these include scalability issues, query and document length effects, and the use of stopword list in indexing.

Type

a
Toraman, C.; Can, F.: Discovering story chains : a framework based on zigzagged search and news actors (2017) 0.00
```
0.0020770747 = product of:
  0.0041541494 = sum of:
    0.0041541494 = product of:
      0.008308299 = sum of:
        0.008308299 = weight(_text_:a in 3963) [ClassicSimilarity], result of:
          0.008308299 = score(doc=3963,freq=18.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.19109234 = fieldWeight in 3963, product of:
              4.2426405 = tf(freq=18.0), with freq of:
                18.0 = termFreq=18.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3963)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

A story chain is a set of related news articles that reveal how different events are connected. This study presents a framework for discovering story chains, given an input document, in a text collection. The framework has 3 complementary parts that i) scan the collection, ii) measure the similarity between chain-member candidates and the chain, and iii) measure similarity among news articles. For scanning, we apply a novel text-mining method that uses a zigzagged search that reinvestigates past documents based on the updated chain. We also utilize social networks of news actors to reveal connections among news articles. We conduct 2 user studies in terms of 4 effectiveness measures-relevance, coverage, coherence, and ability to disclose relations. The first user study compares several versions of the framework, by varying parameters, to set a guideline for use. The second compares the framework with 3 baselines. The results show that our method provides statistically significant improvement in effectiveness in 61% of pairwise comparisons, with medium or large effect size; in the remainder, none of the baselines significantly outperforms our method.

Type

a
Can, F.; Kocberber, S.; Baglioglu, O.; Kardas, S.; Ocalan, H.C.; Uyar, E.: New event detection and topic tracking in Turkish (2010) 0.00
```
0.0018318077 = product of:
  0.0036636153 = sum of:
    0.0036636153 = product of:
      0.0073272306 = sum of:
        0.0073272306 = weight(_text_:a in 3442) [ClassicSimilarity], result of:
          0.0073272306 = score(doc=3442,freq=14.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.1685276 = fieldWeight in 3442, product of:
              3.7416575 = tf(freq=14.0), with freq of:
                14.0 = termFreq=14.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0390625 = fieldNorm(doc=3442)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

Topic detection and tracking (TDT) applications aim to organize the temporally ordered stories of a news stream according to the events. Two major problems in TDT are new event detection (NED) and topic tracking (TT). These problems focus on finding the first stories of new events and identifying all subsequent stories on a certain topic defined by a small number of sample stories. In this work, we introduce the first large-scale TDT test collection for Turkish, and investigate the NED and TT problems in this language. We present our test-collection-construction approach, which is inspired by the TDT research initiative. We show that in TDT for Turkish with some similarity measures, a simple word truncation stemming method can compete with a lemmatizer-based stemming approach. Our findings show that contrary to our earlier observations on Turkish information retrieval, in NED word stopping has an impact on effectiveness. We demonstrate that the confidence scores of two different similarity measures can be combined in a straightforward manner for higher effectiveness. The influence of several similarity measures on effectiveness also is investigated. We show that it is possible to deploy TT applications in Turkish that can be used in operational settings.

Type

a
Kucukyilmaz, T.; Cambazoglu, B.B.; Aykanat, C.; Can, F.: Chat mining : Predicting user and message attributes in computer-mediated communication (2008) 0.00
```
0.0016616598 = product of:
  0.0033233196 = sum of:
    0.0033233196 = product of:
      0.006646639 = sum of:
        0.006646639 = weight(_text_:a in 2099) [ClassicSimilarity], result of:
          0.006646639 = score(doc=2099,freq=8.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.15287387 = fieldWeight in 2099, product of:
              2.828427 = tf(freq=8.0), with freq of:
                8.0 = termFreq=8.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2099)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The focus of this paper is to investigate the possibility of predicting several user and message attributes in text-based, real-time, online messaging services. For this purpose, a large collection of chat messages is examined. The applicability of various supervised classification techniques for extracting information from the chat messages is evaluated. Two competing models are used for defining the chat mining problem. A term-based approach is used to investigate the user and message attributes in the context of vocabulary use while a style-based approach is used to examine the chat messages according to the variations in the authors' writing styles. Among 100 authors, the identity of an author is correctly predicted with 99.7% accuracy. Moreover, the reverse problem is exploited, and the effect of author attributes on computer-mediated communications is discussed.

Type

a
Can, F.: On the efficiency of best-match cluster searches (1994) 0.00
```
0.0013707994 = product of:
  0.0027415988 = sum of:
    0.0027415988 = product of:
      0.0054831975 = sum of:
        0.0054831975 = weight(_text_:a in 7294) [ClassicSimilarity], result of:
          0.0054831975 = score(doc=7294,freq=4.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.12611452 = fieldWeight in 7294, product of:
              2.0 = tf(freq=4.0), with freq of:
                4.0 = termFreq=4.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0546875 = fieldNorm(doc=7294)
      0.5 = coord(1/2)
  0.5 = coord(1/2)
```
Abstract

The efficiency of various cluster-based retrieval (CBR) strategies is analyzed. The possibility of combining CBR and inverted index search (IIS) is investigated. A method for combining the two approaches is proposed and shown to be cost effective in terms of paging and CPU time. In the new method, the selection of documents from the best-matching clusters is done using the inverted index for all documents. Although this is counterintuitive to the concept of best-match CBR, the observations prove that it is much more efficient than conventional approaches. In the experiments, the effects of the number of selected clusters, page size, centroid length, and matching functions are considered. The experiments show that the storage overhead of the new method would be moderately higher than that of IIS

Type

a

Can, F.: Incremental clustering for dynamic information processing (1993) 0.00

0.0011077732 = product of:
  0.0022155463 = sum of:
    0.0022155463 = product of:
      0.0044310926 = sum of:
        0.0044310926 = weight(_text_:a in 6627) [ClassicSimilarity], result of:
          0.0044310926 = score(doc=6627,freq=2.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.10191591 = fieldWeight in 6627, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.0625 = fieldNorm(doc=6627)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Type: a

Can, F.; Nuray, R.; Sevdik, A.B.: Automatic performance evaluation of Web search engines (2004) 0.00

8.308299E-4 = product of:
  0.0016616598 = sum of:
    0.0016616598 = product of:
      0.0033233196 = sum of:
        0.0033233196 = weight(_text_:a in 2570) [ClassicSimilarity], result of:
          0.0033233196 = score(doc=2570,freq=2.0), product of:
            0.043477926 = queryWeight, product of:
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.037706986 = queryNorm
            0.07643694 = fieldWeight in 2570, product of:
              1.4142135 = tf(freq=2.0), with freq of:
                2.0 = termFreq=2.0
              1.153047 = idf(docFreq=37942, maxDocs=44218)
              0.046875 = fieldNorm(doc=2570)
      0.5 = coord(1/2)
  0.5 = coord(1/2)

Type: a

Search (9 results, page 1 of 1)

Authors

Years

Themes