Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft / Powered by litecat, BIS Oldenburg (Stand: 23. Dezember 2017)
1Zhao, M. ; Yan, E. ; Li, K.: Data set mentions and citations : a content analysis of full-text publications.
In: Journal of the Association for Information Science and Technology. 69(2018) no.1, S.32-46.
Abstract: This study provides evidence of data set mentions and citations in multiple disciplines based on a content analysis of 600 publications in PLoS One. We find that data set mentions and citations varied greatly among disciplines in terms of how data sets were collected, referenced, and curated. While a majority of articles provided free access to data, formal ways of data attribution such as DOIs and data citations were used in a limited number of articles. In addition, data reuse took place in less than 30% of the publications that used data, suggesting that researchers are still inclined to create and use their own data sets, rather than reusing previously curated data. This paper provides a comprehensive understanding of how data sets are used in science and helps institutions and publishers make useful data policies.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23919/full.
Objekt: PLoS One
2Li, K.W. ; Yang, C.C.: Conceptual analysis of parallel corpus collected from the Web.
In: Journal of the American Society for Information Science and Technology. 57(2006) no.5, S.632-644.
Abstract: As illustrated by the World Wide Web, the volume of information in languages other than English has grown significantly in recent years. This highlights the importance of multilingual corpora. Much effort has been devoted to the compilation of multilingual corpora for the purpose of cross-lingual information retrieval and machine translation. Existing parallel corpora mostly involve European languages, such as English-French and English-Spanish. There is still a lack of parallel corpora between European languages and Asian. languages. In the authors' previous work, an alignment method to identify one-to-one Chinese and English title pairs was developed to construct an English-Chinese parallel corpus that works automatically from the World Wide Web, and a 100% precision and 87% recall were obtained. Careful analysis of these results has helped the authors to understand how the alignment method can be improved. A conceptual analysis was conducted, which includes the analysis of conceptual equivalent and conceptual information alternation in the aligned and nonaligned English-Chinese title pairs that are obtained by the alignment method. The result of the analysis not only reflects the characteristics of parallel corpora, but also gives insight into the strengths and weaknesses of the alignment method. In particular, conceptual alternation, such as omission and addition, is found to have a significant impact on the performance of the alignment method.
Anmerkung: Beitrag einer special topic section on multilingual information systems
Themenfeld: Multilinguale Probleme
3Li, K.W. ; Yang, C.C.: Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web Corpus for Crime Analysis.
In: Journal of the American Society for Information Science and Technology. 56(2005) no.3, S.272-281.
Abstract: For the sake of national security, very large volumes of data and information are generated and gathered daily. Much of this data and information is written in different languages, stored in different locations, and may be seemingly unconnected. Crosslingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analyzed, shared, searched, and summarized. The recent terrorist attacks and the tragic events of September 11, 2001 have prompted increased attention an national security and criminal analysis. Many Asian countries and cities, such as Japan, Taiwan, and Singapore, have been advised that they may become the next targets of terrorist attacks. Semantic interoperability has been a focus in digital library research. Traditional information retrieval (IR) approaches normally require a document to share some common keywords with the query. Generating the associations for the related terms between the two term spaces of users and documents is an important issue. The problem can be viewed as the creation of a thesaurus. Apart from this, terrorists and criminals may communicate through letters, e-mails, and faxes in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. The problem is expanded to crosslingual semantic interoperability. In this paper, we focus an the English/Chinese crosslingual semantic interoperability problem. However, the developed techniques are not limited to English and Chinese languages but can be applied to many other languages. English and Chinese are popular languages in the Asian region. Much information about national security or crime is communicated in these languages. An efficient automatically generated thesaurus between these languages is important to crosslingual information retrieval between English and Chinese languages. To facilitate crosslingual information retrieval, a corpus-based approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model to cross the language boundary. In this paper, the text based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. We also introduce an algorithmic approach to generate a robust knowledge base based an statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semanticsbased crosslingual information management and retrieval.
Anmerkung: Beitrag in einem Themenheft zu: 'Intelligence and security informatics'
Themenfeld: Multilinguale Probleme ; Konzeption und Anwendung des Prinzips Thesaurus ; Semantische Interoperabilität
4Yang, C.C. ; Li, K.W.: ¬A heuristic method based on a statistical approach for chinese text segmentation.
In: Journal of the American Society for Information Science and Technology. 56(2005) no.13, S.1438-1447.
Abstract: The authors propose a heuristic method for Chinese automatic text segmentation based an a statistical approach. This method is developed based an statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused an the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result Shows that the proposed heuristic method is promising to segment the unknown words as weIl as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.
5Yang, C.C. ; Li, K.W.: Automatic construction of English/Chinese parallel corpora.
In: Journal of the American Society for Information Science and technology. 54(2003) no.8, S.730-742.
Abstract: As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpusbased approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/ Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based an dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliabie Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
6Dilevko, J. ; Dali, K.: ¬The challenge of building multilingual collections in Canadian public libraries.
In: Library resources and technical services. 46(2002) no.4, S.116-137.
Abstract: A Web-based survey was conducted to determine the extent to which Canadian public libraries are collecting multilingual materials (foreign languages other than English and French), the methods that they use to select these materials, and whether public librarians are sufficiently prepared to provide their multilingual clientele with an adequate range of materials and services. There is room for improvement with regard to collection development of multilingual materials in Canadian public libraries, as well as in educating staff about keeping multilingual collections current, diverse, and of sufficient interest to potential users to keep such materials circulating. The main constraints preventing public libraries from developing better multilingual collections are addressed, and recommendations for improving the state of multilingual holdings are provided.
Themenfeld: Multilinguale Probleme
Anwendungsfeld: Öffentliche Bibliotheken
7Broccoli, K. ; Ravenswaay, G.V.: Web indexing : anchors away!.Beyond book indexing: how to get started in Web indexing, embedded indexing and other computer-based media. Ed. by D. Brenner u. M. Rowland.
Phoenix, AZ : American Society of Indexers / Information Today, 2000. S.37-42.
Abstract: In this chapter we turn to embedded indexing for the Internet, frequently called Web indexing. We will define Web indexes; describe the structure of entries for Web indexes; present some of the challenges that Web indexers face; and compare Web indexes to search engines. One of the difficulties in defining Web indexes is their relative newness. The first pages were placed on the World Wide Web in 1991 when Tim Berners Lee, its founder, uploaded four files. We are in a period of transition, moving from using well-established forms of writing and communications to others that are still in their infancy. Paramount among these is the Web. For indexers, this is an uncharted voyage where we must jettison firmly established ideas while developing new ones. Where the voyage will end is anyone's guess.
Themenfeld: Register ; Internet
8Yee, K.-P. ; Swearingen, K. ; Li, K. ; Hearst, M.: Faceted metadata for image search and browsing.
Abstract: There are currently two dominant interface types for searching and browsing large image collections: keywordbased search, and searching by overall similarity to sample images. We present an alternative based on enabling users to navigate along conceptual dimensions that describe the images. The interface makes use of hierarchical faceted metadata and dynamically generated query previews. A usability study, in which 32 art history students explored a collection of 35,000 fine arts images, compares this approach to a standard image search interface. Despite the unfamiliarity and power of the interface (attributes that often lead to rejection of new search interfaces), the study results show that 90% of the participants preferred the metadata approach overall, 97% said that it helped them learn more about the collection, 75% found it more flexible, and 72% found it easier to use than a standard baseline system. These results indicate that a category-based approach is a successful way to provide access to image collections.
Inhalt: Vgl. auch: http://flamenco.berkeley.edu/.
Themenfeld: Bilder ; Benutzerstudien