Literatur zur Informationserschließung
Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft
/
Powered by litecat, BIS Oldenburg
(Stand: 04. Juni 2021)
Suche
Suchergebnisse
Treffer 1–20 von 91
sortiert nach:
-
1Lee, Y.-Y. ; Ke, H. ; Yen, T.-Y. ; Huang, H.-H. ; Chen, H.-H.: Combining and learning word embedding with WordNet for semantic relatedness and similarity measurement.
In: Journal of the Association for Information Science and Technology. 71(2020) no.6, S.657-670.
Abstract: In this research, we propose 3 different approaches to measure the semantic relatedness between 2 words: (i) boost the performance of GloVe word embedding model via removing or transforming abnormal dimensions; (ii) linearly combine the information extracted from WordNet and word embeddings; and (iii) utilize word embedding and 12 linguistic information extracted from WordNet as features for Support Vector Regression. We conducted our experiments on 8 benchmark data sets, and computed Spearman correlations between the outputs of our methods and the ground truth. We report our results together with 3 state-of-the-art approaches. The experimental results show that our method can outperform state-of-the-art approaches in all the selected English benchmark data sets.
Inhalt: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24289.
Themenfeld: Semantisches Umfeld in Indexierung u. Retrieval
Objekt: WordNet
-
2Chen, H. ; Baptista Nunes, J.M. ; Ragsdell, G. ; An, X.: Somatic and cultural knowledge : drivers of a habitus-driven model of tacit knowledge acquisition.
In: Journal of documentation. 75(2019) no.5, S.927-953.
Abstract: The purpose of this paper is to identify and explain the role of individual learning and development in acquiring tacit knowledge in the context of the inexorable and intense continuous change (technological and otherwise) that characterizes our society today, and also to investigate the software (SW) sector, which is at the core of contemporary continuous change and is a paradigm of effective and intrinsic knowledge sharing (KS). This makes the SW sector unique and different from others where KS is so hard to implement. Design/methodology/approach The study employed an inductive qualitative approach based on a multi-case study approach, composed of three successful SW companies in China. These companies are representative of the fabric of the sector, namely a small- and medium-sized enterprise, a large private company and a large state-owned enterprise. The fieldwork included 44 participants who were interviewed using a semi-structured script. The interview data were coded and interpreted following the Straussian grounded theory pattern of open coding, axial coding and selective coding. The process of interviewing was stopped when theoretical saturation was achieved after a careful process of theoretical sampling. ; Findings The findings of this research suggest that individual learning and development are deemed to be the fundamental feature for professional success and survival in the continuously changing environment of the SW industry today. However, individual learning was described by the participants as much more than a mere individual process. It involves a collective and participatory effort within the organization and the sector as a whole, and a KS process that transcends organizational, cultural and national borders. Individuals in particular are mostly motivated by the pressing need to face and adapt to the dynamic and changeable environments of today's digital society that is led by the sector. Software practitioners are continuously in need of learning, refreshing and accumulating tacit knowledge, partly because it is required by their companies, but also due to a sound awareness of continuous technical and technological changes that seem only to increase with the advances of information technology. This led to a clear theoretical understanding that the continuous change that faces the sector has led to individual acquisition of culture and somatic knowledge that in turn lay the foundation for not only the awareness of the need for continuous individual professional development but also for the creation of habitus related to KS and continuous learning. Originality/value The study reported in this paper shows that there is a theoretical link between the existence of conducive organizational and sector-wide somatic and cultural knowledge, and the success of KS practices that lead to individual learning and development. Therefore, the theory proposed suggests that somatic and cultural knowledge are crucial drivers for the creation of habitus of individual tacit knowledge acquisition. The paper further proposes a habitus-driven individual development (HDID) Theoretical Model that can be of use to both academics and practitioners interested in fostering and developing processes of KS and individual development in knowledge-intensive organizations.
Inhalt: Vgl.: https://doi.org/10.1108/JD-03-2018-0044.
Themenfeld: Wissensrepräsentation
-
3Huang, H.-H. ; Wang, J.-J. ; Chen, H.-H.: Implicit opinion analysis : extraction and polarity labelling.
In: Journal of the Association for Information Science and Technology. 68(2017) no.9, S.2076-2087.
Abstract: Opinion words are crucial information for sentiment analysis. In some text, however, opinion words are absent or highly ambiguous. The resulting implicit opinions are more difficult to extract and label than explicit ones. In this paper, cutting-edge machine-learning approaches - deep neural network and word-embedding - are adopted for implicit opinion mining at the snippet and clause levels. Hotel reviews written in Chinese are collected and annotated as the experimental data set. Results show the convolutional neural network models not only outperform traditional support vector machine models, but also capture hidden knowledge within the raw text. The strength of word-embedding is also analyzed.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23835/full.
-
4Chen, H. ; Beaudoin, C.E. ; Hong, H.: Teen online information disclosure : empirical testing of a protection motivation and social capital model.
In: Journal of the Association for Information Science and Technology. 67(2016) no.12, S.2871-2881.
Abstract: With bases in protection motivation theory and social capital theory, this study investigates teen and parental factors that determine teens' online privacy concerns, online privacy protection behaviors, and subsequent online information disclosure on social network sites. With secondary data from a 2012 survey (N?=?622), the final well-fitting structural equation model revealed that teen online privacy concerns were primarily influenced by parental interpersonal trust and parental concerns about teens' online privacy, whereas teen privacy protection behaviors were primarily predicted by teen cost-benefit appraisal of online interactions. In turn, teen online privacy concerns predicted increased privacy protection behaviors and lower teen information disclosure. Finally, restrictive and instructive parental mediation exerted differential influences on teens' privacy protection behaviors and online information disclosure.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23567/full.
-
5Lee, L.-H. ; Juan, Y.-C. ; Tseng, W.-L. ; Chen, H.-H. ; Tseng, Y.-H.: Mining browsing behaviors for objectionable content filtering.
In: Journal of the Association for Information Science and Technology. 66(2015) no.5, S.930-942.
Abstract: This article explores users' browsing intents to predict the category of a user's next access during web surfing and applies the results to filter objectionable content, such as pornography, gambling, violence, and drugs. Users' access trails in terms of category sequences in click-through data are employed to mine users' web browsing behaviors. Contextual relationships of URL categories are learned by the hidden Markov model. The top-level domains (TLDs) extracted from URLs themselves and the corresponding categories are caught by the TLD model. Given a URL to be predicted, its TLD and current context are empirically combined in an aggregation model. In addition to the uses of the current context, the predictions of the URL accessed previously in different contexts by various users are also considered by majority rule to improve the aggregation model. Large-scale experiments show that the advanced aggregation approach achieves promising performance while maintaining an acceptably low false positive rate. Different strategies are introduced to integrate the model with the blacklist it generates for filtering objectionable web pages without analyzing their content. In practice, this is complementary to the existing content analysis from users' behavioral perspectives.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23217/abstract.
-
6Jiang, S. ; Gao, Q. ; Chen, H. ; Roco, M.C.: ¬The roles of sharing, transfer, and public funding in nanotechnology knowledge-diffusion networks.
In: Journal of the Association for Information Science and Technology. 66(2015) no.5, S.1017-1029.
Abstract: Understanding the knowledge-diffusion networks of patent inventors can help governments and businesses effectively use their investment to stimulate commercial science and technology development. Such inventor networks are usually large and complex. This study proposes a multidimensional network analysis framework that utilizes Exponential Random Graph Models (ERGMs) to simultaneously model knowledge-sharing and knowledge-transfer processes, examine their interactions, and evaluate the impacts of network structures and public funding on knowledge-diffusion networks. Experiments are conducted on a longitudinal data set that covers 2 decades (1991-2010) of nanotechnology-related US Patent and Trademark Office (USPTO) patents. The results show that knowledge sharing and knowledge transfer are closely interrelated. High degree centrality or boundary inventors play significant roles in the network, and National Science Foundation (NSF) public funding positively affects knowledge sharing despite its small fraction in overall funding and upstream research topics.
Inhalt: Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23223/abstract.
Wissenschaftsfach: Patentinformation
-
7Ku, Y. ; Chiu, C. ; Zhang, Y. ; Chen, H. ; Su, H.: Text mining self-disclosing health information for public health service.
In: Journal of the Association for Information Science and Technology. 65(2014) no.5, S.928-947.
Abstract: Understanding specific patterns or knowledge of self-disclosing health information could support public health surveillance and healthcare. This study aimed to develop an analytical framework to identify self-disclosing health information with unusual messages on web forums by leveraging advanced text-mining techniques. To demonstrate the performance of the proposed analytical framework, we conducted an experimental study on 2 major human immunodeficiency virus (HIV)/acquired immune deficiency syndrome (AIDS) forums in Taiwan. The experimental results show that the classification accuracy increased significantly (up to 83.83%) when using features selected by the information gain technique. The results also show the importance of adopting domain-specific features in analyzing unusual messages on web forums. This study has practical implications for the prevention and support of HIV/AIDS healthcare. For example, public health agencies can re-allocate resources and deliver services to people who need help via social media sites. In addition, individuals can also join a social media site to get better suggestions and support from each other.
Wissenschaftsfach: Medizin
-
8Benjamin, V. ; Chen, H. ; Zimbra, D.: Bridging the virtual and real : the relationship between web content, linkage, and geographical proximity of social movements.
In: Journal of the Association for Information Science and Technology. 65(2014) no.11, S.2210-2222.
Abstract: As the Internet becomes ubiquitous, it has advanced to more closely represent aspects of the real world. Due to this trend, researchers in various disciplines have become interested in studying relationships between real-world phenomena and their virtual representations. One such area of emerging research seeks to study relationships between real-world and virtual activism of social movement organization (SMOs). In particular, SMOs holding extreme social perspectives are often studied due to their tendency to have robust virtual presences to circumvent real-world social barriers preventing information dissemination. However, many previous studies have been limited in scope because they utilize manual data-collection and analysis methods. They also often have failed to consider the real-world aspects of groups that partake in virtual activism. We utilize automated data-collection and analysis methods to identify significant relationships between aspects of SMO virtual communities and their respective real-world locations and ideological perspectives. Our results also demonstrate that the interconnectedness of SMO virtual communities is affected specifically by aspects of the real world. These observations provide insight into the behaviors of SMOs within virtual environments, suggesting that the virtual communities of SMOs are strongly affected by aspects of the real world.
Themenfeld: Internet
-
9Liu, J.S. ; Chen, H.-H. ; Ho, M.H.-C. ; Li, Y.-C.: Citations with different levels of relevancy : tracing the main paths of legal opinions.
In: Journal of the Association for Information Science and Technology. 65(2014) no.12, S.2479-2488.
Abstract: This study explores the effect from considering citation relevancy in the main path analysis. Traditional citation-based analyses treat all citations equally even though there can be various reasons and different levels of relevancy for one document to reference another. Taking the relevancy level into consideration is intuitively advantageous because it adopts more accurate information and will thus make the results of a citation-based analysis more trustworthy. This is nevertheless a challenging task. We are aware of no citation-based analysis that has taken the relevancy level into consideration. The difficulty lies in the fact that the existing patent or patent citation database provides no readily available relevancy level information. We overcome this issue by obtaining citation relevancy information from a legal database that has relevancy level ranked by legal experts. This paper selects trademark dilution, a legal concept that has been the subject of many lawsuit cases, as the target for exploration. We apply main path analysis, taking citation relevancy into consideration, and verify the results against a set of test cases that are mentioned in an authoritative trademark book. The findings show that relevancy information helps main path analysis uncover legal cases of higher importance. Nevertheless, in terms of the number of significant cases retrieved, relevancy information does not seem to make a noticeable difference.
-
10Yang, M. ; Kiang, M. ; Chen, H. ; Li, Y.: Artificial immune system for illicit content identification in social media.
In: Journal of the American Society for Information Science and Technology. 63(2012) no.2, S.256-269.
Abstract: Social media is frequently used as a platform for the exchange of information and opinions as well as propaganda dissemination. But online content can be misused for the distribution of illicit information, such as violent postings in web forums. Illicit content is highly distributed in social media, while non-illicit content is unspecific and topically diverse. It is costly and time consuming to label a large amount of illicit content (positive examples) and non-illicit content (negative examples) to train classification systems. Nevertheless, it is relatively easy to obtain large volumes of unlabeled content in social media. In this article, an artificial immune system-based technique is presented to address the difficulties in the illicit content identification in social media. Inspired by the positive selection principle in the immune system, we designed a novel labeling heuristic based on partially supervised learning to extract high-quality positive and negative examples from unlabeled datasets. The empirical evaluation results from two large hate group web forums suggest that our proposed approach generally outperforms the benchmark techniques and exhibits more stable performance.
Themenfeld: Internet
-
11Lee, L.-H. ; Chen, H.-H.: Mining search intents for collaborative cyberporn filtering.
In: Journal of the American Society for Information Science and Technology. 63(2012) no.2, S.366-376.
Abstract: This article presents a search-intent-based method to generate pornographic blacklists for collaborative cyberporn filtering. A novel porn-detection framework that can find newly appearing pornographic web pages by mining search query logs is proposed. First, suspected queries are identified along with their clicked URLs by an automatically constructed lexicon. Then, a candidate URL is determined if the number of clicks satisfies majority voting rules. Finally, a candidate whose URL contains at least one categorical keyword will be included in a blacklist. Several experiments are conducted on an MSN search porn dataset to demonstrate the effectiveness of our method. The resulting blacklist generated by our search-intent-based method achieves high precision (0.701) while maintaining a favorably low false-positive rate (0.086). The experiments of a real-life filtering simulation reveal that our proposed method with its accumulative update strategy can achieve 44.15% of a macro-averaging blocking rate, when the update frequency is set to 1 day. In addition, the overblocking rates are less than 9% with time change due to the strong advantages of our search-intent-based method. This user-behavior-oriented method can be easily applied to search engines for incorporating only implicit collective intelligence from query logs without other efforts. In practice, it is complementary to intelligent content analysis for keeping up with the changing trails of objectionable websites from users' perspectives.
Themenfeld: Internet
-
12Qu, B. ; Cong, G. ; Li, C. ; Sun, A. ; Chen, H.: ¬An evaluation of classification models for question topic categorization.
In: Journal of the American Society for Information Science and Technology. 63(2012) no.5, S.889-903.
Abstract: We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems.
Themenfeld: Automatisches Klassifizieren
-
13Chen, S.-j. ; Zeng, M.L. ; Chen, H.-h.: Alignment of conceptual structures in controlled vocabularies in the domain of Chinese art : a discussion of issues and patterns.
In: Categories, contexts and relations in knowledge organization: Proceedings of the Twelfth International ISKO Conference 6-9 August 2012, Mysore, India. Eds.: Neelameghan, A. u. K.S. Raghavan. Würzburg : Ergon Verlag, 2012. S.249-255.
(Advances in knowledge organization; vol.13)
Abstract: Based on our recent sub-project of the Chinese AAT-Taiwan Project, this paper reports issues regarding the alignment of the controlled vocabularies in the domain of Chinese art. The conceptual structures of the concepts for Chinese art in the National Palace Museum (NPM) Vocabularies and the Art & Architecture Thesaurus (AAT) are studied and patterns were identified in the effort of achieving semantic interoperability. The findings presented in the paper are meaningful to the research on the semantic interoperability of multilingual KOS, especially when dealing with cultural-related concepts that cannot be exactly aligned in vocabularies due to the discrepancies in the conceptual structures.
Wissenschaftsfach: Kunst
-
14Suakkaphong, N. ; Zhang, Z. ; Chen, H.: Disease named entity recognition using semisupervised learning and conditional random fields.
In: Journal of the American Society for Information Science and Technology. 62(2011) no.4, S.727-737.
Abstract: Information extraction is an important text-mining task that aims at extracting prespecified types of information from large text collections and making them available in structured representations such as databases. In the biomedical domain, information extraction can be applied to help biologists make the most use of their digital-literature archives. Currently, there are large amounts of biomedical literature that contain rich information about biomedical substances. Extracting such knowledge requires a good named entity recognition technique. In this article, we combine conditional random fields (CRFs), a state-of-the-art sequence-labeling algorithm, with two semisupervised learning techniques, bootstrapping and feature sampling, to recognize disease names from biomedical literature. Two data-processing strategies for each technique also were analyzed: one sequentially processing unlabeled data partitions and another one processing unlabeled data partitions in a round-robin fashion. The experimental results showed the advantage of semisupervised learning techniques given limited labeled training data. Specifically, CRFs with bootstrapping implemented in sequential fashion outperformed strictly supervised CRFs for disease name recognition. The project was supported by NIH/NLM Grant R33 LM07299-01, 2002-2005.
Themenfeld: Data Mining
-
15Liu, X. ; Kaza, S. ; Zhang, P. ; Chen, H.: Determining inventor status and its effect on knowledge diffusion : a study on nanotechnology literature from China, Russia, and India.
In: Journal of the American Society for Information Science and Technology. 62(2011) no.6, S.1166-1176.
Abstract: In an increasingly global research landscape, it is important to identify the most prolific researchers in various institutions and their influence on the diffusion of knowledge. Knowledge diffusion within institutions is influenced by not just the status of individual researchers but also the collaborative culture that determines status. There are various methods to measure individual status, but few studies have compared them or explored the possible effects of different cultures on the status measures. In this article, we examine knowledge diffusion within science and technology-oriented research organizations. Using social network analysis metrics to measure individual status in large-scale coauthorship networks, we studied an individual's impact on the recombination of knowledge to produce innovation in nanotechnology. Data from the most productive and high-impact institutions in China (Chinese Academy of Sciences), Russia (Russian Academy of Sciences), and India (Indian Institutes of Technology) were used. We found that boundary-spanning individuals influenced knowledge diffusion in all countries. However, our results also indicate that cultural and institutional differences may influence knowledge diffusion.
Themenfeld: Informetrie
Wissenschaftsfach: Materialwissenschaften
Land/Ort: Chi ; Russland ; Indien
-
16Hsu, M.-H. ; Chen, H.-H.: Efficient and effective prediction of social tags to enhance Web search.
In: Journal of the American Society for Information Science and Technology. 62(2011) no.8, S.1473-1487.
Abstract: As the web has grown into an integral part of daily life, social annotation has become a popular manner for web users to manage resources. This method of management has many potential applications, but it is limited in applicability by the cold-start problem, especially for new resources on the web. In this article, we study automatic tag prediction for web pages comprehensively and utilize the predicted tags to improve search performance. First, we explore the stabilizing phenomenon of tag usage in a social bookmarking system. Then, we propose a two-stage tag prediction approach, which is efficient and is effective in making use of early annotations from users. In the first stage, content-based ranking, candidate tags are selected and ranked to generate an initial tag list. In the second stage, random-walk re-ranking, we adopt a random-walk model that utilizes tag co-occurrence information to re-rank the initial list. The experimental results show that our algorithm effectively proposes appropriate tags for target web pages. In addition, we present a framework to incorporate tag prediction in a general web search. The experimental results of the web search validate the hypothesis that the proposed framework significantly enhances the typical retrieval model.
Themenfeld: Social tagging
-
17Tsai, M.-.F. ; Chen, H.-H. ; Wang, Y.-T.: Learning a merge model for multilingual information retrieval.
In: Information processing and management. 47(2011) no.5, S.635-646.
Abstract: This paper proposes a learning approach for the merging process in multilingual information retrieval (MLIR). To conduct the learning approach, we present a number of features that may influence the MLIR merging process. These features are mainly extracted from three levels: query, document, and translation. After the feature extraction, we then use the FRank ranking algorithm to construct a merge model. To the best of our knowledge, this practice is the first attempt to use a learning-based ranking algorithm to construct a merge model for MLIR merging. In our experiments, three test collections for the task of crosslingual information retrieval (CLIR) in NTCIR3, 4, and 5 are employed to assess the performance of our proposed method. Moreover, several merging methods are also carried out for a comparison, including traditional merging methods, the 2-step merging strategy, and the merging method based on logistic regression. The experimental results show that our proposed method can significantly improve merging quality on two different types of datasets. In addition to the effectiveness, through the merge model generated by FRank, our method can further identify key factors that influence the merging process. This information might provide us more insight and understanding into MLIR merging.
Inhalt: Beitrag in einem Themenschwerpunkt "Managing and Mining Multilingual Documents". Vgl.: 10.1016/j.ipm.2009.12.002.
Themenfeld: Multilinguale Probleme
-
18Chau, M. ; Wong, C.H. ; Zhou, Y. ; Qin, J. ; Chen, H.: Evaluating the use of search engine development tools in IT education.
In: Journal of the American Society for Information Science and Technology. 61(2010) no.2, S.288-299.
Abstract: It is important for education in computer science and information systems to keep up to date with the latest development in technology. With the rapid development of the Internet and the Web, many schools have included Internet-related technologies, such as Web search engines and e-commerce, as part of their curricula. Previous research has shown that it is effective to use search engine development tools to facilitate students' learning. However, the effectiveness of these tools in the classroom has not been evaluated. In this article, we review the design of three search engine development tools, SpidersRUs, Greenstone, and Alkaline, followed by an evaluation study that compared the three tools in the classroom. In the study, 33 students were divided into 13 groups and each group used the three tools to develop three independent search engines in a class project. Our evaluation results showed that SpidersRUs performed better than the two other tools in overall satisfaction and the level of knowledge gained in their learning experience when using the tools for a class project on Internet applications development.
Themenfeld: Suchmaschinen ; Ausbildung
-
19Huang, C. ; Fu, T. ; Chen, H.: Text-based video content classification for online video-sharing sites.
In: Journal of the American Society for Information Science and Technology. 61(2010) no.5, S.891-906.
Abstract: With the emergence of Web 2.0, sharing personal content, communicating ideas, and interacting with other online users in Web 2.0 communities have become daily routines for online users. User-generated data from Web 2.0 sites provide rich personal information (e.g., personal preferences and interests) and can be utilized to obtain insight about cyber communities and their social networks. Many studies have focused on leveraging user-generated information to analyze blogs and forums, but few studies have applied this approach to video-sharing Web sites. In this study, we propose a text-based framework for video content classification of online-video sharing Web sites. Different types of user-generated data (e.g., titles, descriptions, and comments) were used as proxies for online videos, and three types of text features (lexical, syntactic, and content-specific features) were extracted. Three feature-based classification techniques (C4.5, Naïve Bayes, and Support Vector Machine) were used to classify videos. To evaluate the proposed framework, user-generated data from candidate videos, which were identified by searching user-given keywords on YouTube, were first collected. Then, a subset of the collected data was randomly selected and manually tagged by users as our experiment data. The experimental results showed that the proposed approach was able to classify online videos based on users' interests with accuracy rates up to 87.2%, and all three types of text features contributed to discriminating videos. Support Vector Machine outperformed C4.5 and Naïve Bayes techniques in our experiments. In addition, our case study further demonstrated that accurate video-classification results are very useful for identifying implicit cyber communities on video-sharing Web sites.
Themenfeld: Social tagging ; Internet
Behandelte Form: Videos
Objekt: Web 2.0
-
20Fu, T. ; Abbasi, A. ; Chen, H.: ¬A focused crawler for Dark Web forums.
In: Journal of the American Society for Information Science and Technology. 61(2010) no.6, S.1213-1231.
Abstract: The unprecedented growth of the Internet has given rise to the Dark Web, the problematic facet of the Web associated with cybercrime, hate, and extremism. Despite the need for tools to collect and analyze Dark Web forums, the covert nature of this part of the Internet makes traditional Web crawling techniques insufficient for capturing such content. In this study, we propose a novel crawling system designed to collect Dark Web forum content. The system uses a human-assisted accessibility approach to gain access to Dark Web forums. Several URL ordering features and techniques enable efficient extraction of forum postings. The system also includes an incremental crawler coupled with a recall-improvement mechanism intended to facilitate enhanced retrieval and updating of collected content. Experiments conducted to evaluate the effectiveness of the human-assisted accessibility approach and the recall-improvement-based, incremental-update procedure yielded favorable results. The human-assisted approach significantly improved access to Dark Web forums while the incremental crawler with recall improvement also outperformed standard periodic- and incremental-update approaches. Using the system, we were able to collect over 100 Dark Web forums from three regions. A case study encompassing link and content analysis of collected forums was used to illustrate the value and importance of gathering and analyzing content from such online communities.
Themenfeld: Internet ; Suchmaschinen