Literatur zur Informationserschließung
Diese Datenbank enthält über 40.000 Dokumente zu Themen aus den Bereichen Formalerschließung – Inhaltserschließung – Information Retrieval.
© 2015 W. Gödert, TH Köln, Institut für Informationswissenschaft
/
Powered by litecat, BIS Oldenburg
(Stand: 28. April 2022)
Suche
Suchergebnisse
Treffer 1–20 von 781
sortiert nach:
-
1Andrushchenko, M. ; Sandberg, K. ; Turunen, R. ; Marjanen, J. ; Hatavara, M. ; Kurunmäki, J. ; Nummenmaa, T. ; Hyvärinen, M. ; Teräs, K. ; Peltonen, J. ; Nummenmaa, J.: Using parsed and annotated corpora to analyze parliamentarians' talk in Finland.
In: Journal of the Association for Information Science and Technology. 73(2022) no.2, S.288-302.
(JASIST special issue on digital humanities (DH): C. Methodological innovations, challenges, and new interest in DH)
Abstract: We present a search system for grammatically analyzed corpora of Finnish parliamentary records and interviews with former parliamentarians, annotated with metadata of talk structure and involved parliamentarians, and discuss their use through carefully chosen digital humanities case studies. We first introduce the construction, contents, and principles of use of the corpora. Then we discuss the application of the search system and the corpora to study how politicians talk about power, how ideological terms are used in political speech, and how to identify narratives in the data. All case studies stem from questions in the humanities and the social sciences, but rely on the grammatically parsed corpora in both identifying and quantifying passages of interest. Finally, the paper discusses the role of natural language processing methods for questions in the (digital) humanities. It makes the claim that a digital humanities inquiry of parliamentary speech and interviews with politicians cannot only rely on computational humanities modeling, but needs to accommodate a range of perspectives starting with simple searches, quantitative exploration, and ending with modeling. Furthermore, the digital humanities need a more thorough discussion about how the utilization of tools from information science and technologies alter the research questions posed in the humanities.
Inhalt: Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24500.
Themenfeld: Computerlinguistik
Land/Ort: FIN
-
2Suissa, O. ; Elmalech, A. ; Zhitomirsky-Geffet, M.: Text analysis using deep neural networks in digital humanities and information science.
In: Journal of the Association for Information Science and Technology. 73(2022) no.2, S.268-287.
(JASIST special issue on digital humanities (DH): C. Methodological innovations, challenges, and new interest in DH)
Abstract: Combining computational technologies and humanities is an ongoing effort aimed at making resources such as texts, images, audio, video, and other artifacts digitally available, searchable, and analyzable. In recent years, deep neural networks (DNN) dominate the field of automatic text analysis and natural language processing (NLP), in some cases presenting a super-human performance. DNNs are the state-of-the-art machine learning algorithms solving many NLP tasks that are relevant for Digital Humanities (DH) research, such as spell checking, language detection, entity extraction, author detection, question answering, and other tasks. These supervised algorithms learn patterns from a large number of "right" and "wrong" examples and apply them to new examples. However, using DNNs for analyzing the text resources in DH research presents two main challenges: (un)availability of training data and a need for domain adaptation. This paper explores these challenges by analyzing multiple use-cases of DH studies in recent literature and their possible solutions and lays out a practical decision model for DH experts for when and how to choose the appropriate deep learning approaches for their research. Moreover, in this paper, we aim to raise awareness of the benefits of utilizing deep learning models in the DH community.
Inhalt: Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24544.
Themenfeld: Computerlinguistik
Wissenschaftsfach: Literaturwissenschaft
-
3Escolano, C. ; Costa-Jussà, M.R. ; Fonollosa, J.A.: From bilingual to multilingual neural-based machine translation by incremental training.
In: Journal of the Association for Information Science and Technology. 72(2021) no.2, S.190-203.
Abstract: A common intermediate language representation in neural machine translation can be used to extend bilingual systems by incremental training. We propose a new architecture based on introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we can train multiple encoders and decoders for each language, sharing among them a common intermediate representation. Translation results on the low-resource tasks (Turkish-English and Kazakh-English tasks) show a BLEU improvement of up to 2.8 points. However, results on a larger dataset (Russian-English and Kazakh-English) show BLEU losses of a similar amount. While our system provides improvements only for the low-resource tasks in terms of translation quality, our system is capable of quickly deploying new language pairs without the need to retrain the rest of the system, which may be a game changer in some situations. Specifically, what is most relevant regarding our architecture is that it is capable of: reducing the number of production systems, with respect to the number of languages, from quadratic to linear; incrementally adding a new language to the system without retraining the languages already there; and allowing for translations from the new language to all the others present in the system.
Inhalt: Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24395.
Themenfeld: Computerlinguistik
-
4Giesselbach, S. ; Estler-Ziegler, T.: Dokumente schneller analysieren mit Künstlicher Intelligenz.
In: Mail an Inetbib vom 06.02.2021, von Tania Estler-Ziegler.
Abstract: Künstliche Intelligenz (KI) und natürliches Sprachverstehen (natural language understanding/NLU) verändern viele Aspekte unseres Alltags und unserer Arbeitsweise. Besondere Prominenz erlangte NLU durch Sprachassistenten wie Siri, Alexa und Google Now. NLU bietet Firmen und Einrichtungen das Potential, Prozesse effizienter zu gestalten und Mehrwert aus textuellen Inhalten zu schöpfen. So sind NLU-Lösungen in der Lage, komplexe, unstrukturierte Dokumente inhaltlich zu erschließen. Für die semantische Textanalyse hat das NLU-Team des IAIS Sprachmodelle entwickelt, die mit Deep-Learning-Verfahren trainiert werden. Die NLU-Suite analysiert Dokumente, extrahiert Eckdaten und erstellt bei Bedarf sogar eine strukturierte Zusammenfassung. Mit diesen Ergebnissen, aber auch über den Inhalt der Dokumente selbst, lassen sich Dokumente vergleichen oder Texte mit ähnlichen Informationen finden. KI-basierten Sprachmodelle sind der klassischen Verschlagwortung deutlich überlegen. Denn sie finden nicht nur Texte mit vordefinierten Schlagwörtern, sondern suchen intelligent nach Begriffen, die in ähnlichem Zusammenhang auftauchen oder als Synonym gebraucht werden. Der Vortrag liefert eine Einordnung der Begriffe "Künstliche Intelligenz" und "Natural Language Understanding" und zeigt Möglichkeiten, Grenzen, aktuelle Forschungsrichtungen und Methoden auf. Anhand von Praxisbeispielen wird anschließend demonstriert, wie NLU zur automatisierten Belegverarbeitung, zur Katalogisierung von großen Datenbeständen wie Nachrichten und Patenten und zur automatisierten thematischen Gruppierung von Social Media Beiträgen und Publikationen genutzt werden kann.
Inhalt: Vgl.: https://www.iais.fraunhofer.de/.
Anmerkung: Vortrag im Rahmen des Berliner Arbeitskreis Information (BAK) am 25.02.2021.
Themenfeld: Computerlinguistik ; Automatisches Indexieren
Wissenschaftsfach: Informatik ; Sprachwissenschaft
-
5Lee, G.E. ; Sun, A.: Understanding the stability of medical concept embeddings.
In: Journal of the Association for Information Science and Technology. 72(2021) no.3, S.346-356.
Abstract: Frequency is one of the major factors for training quality word embeddings. Several studies have recently discussed the stability of word embeddings in general domain and suggested factors influencing the stability. In this work, we conduct a detailed analysis on the stability of concept embeddings in medical domain, particularly in relations with concept frequency. The analysis reveals the surprising high stability of low-frequency concepts: low-frequency (<100) concepts have the same high stability as high-frequency (>1,000) concepts. To develop a deeper understanding of this finding, we propose a new factor, the noisiness of context words, which influences the stability of medical concept embeddings regardless of high or low frequency. We evaluate the proposed factor by showing the linear correlation with the stability of medical concept embeddings. The correlations are clear and consistent with various groups of medical concepts. Based on the linear relations, we make suggestions on ways to adjust the noisiness of context words for the improvement of stability. Finally, we demonstrate that the linear relation of the proposed factor extends to the word embedding stability in general domain.
Inhalt: Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24411.
Themenfeld: Computerlinguistik
Wissenschaftsfach: Medizin
-
6Soni, S. ; Lerman, K. ; Eisenstein, J.: Follow the leader : documents on the leading edge of semantic change get more citations.
In: Journal of the Association for Information Science and Technology. 72(2021) no.4, S.478-492.
Abstract: Diachronic word embeddings-vector representations of words over time-offer remarkable insights into the evolution of language and provide a tool for quantifying sociocultural change from text documents. Prior work has used such embeddings to identify shifts in the meaning of individual words. However, simply knowing that a word has changed in meaning is insufficient to identify the instances of word usage that convey the historical meaning or the newer meaning. In this study, we link diachronic word embeddings to documents, by situating those documents as leaders or laggards with respect to ongoing semantic changes. Specifically, we propose a novel method to quantify the degree of semantic progressiveness in each word usage, and then show how these usages can be aggregated to obtain scores for each document. We analyze two large collections of documents, representing legal opinions and scientific articles. Documents that are scored as semantically progressive receive a larger number of citations, indicating that they are especially influential. Our work thus provides a new technique for identifying lexical semantic leaders and demonstrates a new link between progressive use of language and influence in a citation network.
Inhalt: Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24421.
Themenfeld: Computerlinguistik
-
7Hausser, R.: Language and nonlanguage cognition.
In: https://www.researchgate.net/publication/351747516_Language_and_Nonlanguage_Cognition.
Abstract: A basic distinction in agent-based data-driven Database Semantics (DBS) is between language and nonlanguage cognition. Language cognition transfers content between agents by means of raw data. Nonlanguage cognition maps between content and raw data inside the focus agent. {\it Recognition} applies a concept type to raw data, resulting in a concept token. In language recognition, the focus agent (hearer) takes raw language-data (surfaces) produced by another agent (speaker) as input, while nonlanguage recognition takes raw nonlanguage-data as input. In either case, the output is a content which is stored in the agent's onboard short term memory. {\it Action} adapts a concept type to a purpose, resulting in a token. In language action, the focus agent (speaker) produces language-dependent surfaces for another agent (hearer), while nonlanguage action produces intentions for a nonlanguage purpose. In either case, the output is raw action data. As long as the procedural implementation of place holder values works properly, it is compatible with the DBS requirement of input-output equivalence between the natural prototype and the artificial reconstruction.
Themenfeld: Computerlinguistik
Wissenschaftsfach: Kognitionswissenschaft ; Sprachwissenschaft
-
8Xiang, R. ; Chersoni, E. ; Lu, Q. ; Huang, C.-R. ; Li, W. ; Long, Y.: Lexical data augmentation for sentiment analysis.
In: Journal of the Association for Information Science and Technology. 72(2021) no.11, S.1432-1447.
Abstract: Machine learning methods, especially deep learning models, have achieved impressive performance in various natural language processing tasks including sentiment analysis. However, deep learning models are more demanding for training data. Data augmentation techniques are widely used to generate new instances based on modifications to existing data or relying on external knowledge bases to address annotated data scarcity, which hinders the full potential of machine learning techniques. This paper presents our work using part-of-speech (POS) focused lexical substitution for data augmentation (PLSDA) to enhance the performance of machine learning algorithms in sentiment analysis. We exploit POS information to identify words to be replaced and investigate different augmentation strategies to find semantically related substitutions when generating new instances. The choice of POS tags as well as a variety of strategies such as semantic-based substitution methods and sampling methods are discussed in detail. Performance evaluation focuses on the comparison between PLSDA and two previous lexical substitution-based data augmentation methods, one of which is thesaurus-based, and the other is lexicon manipulation based. Our approach is tested on five English sentiment analysis benchmarks: SST-2, MR, IMDB, Twitter, and AirRecord. Hyperparameters such as the candidate similarity threshold and number of newly generated instances are optimized. Results show that six classifiers (SVM, LSTM, BiLSTM-AT, bidirectional encoder representations from transformers [BERT], XLNet, and RoBERTa) trained with PLSDA achieve accuracy improvement of more than 0.6% comparing to two previous lexical substitution methods averaged on five benchmarks. Introducing POS constraint and well-designed augmentation strategies can improve the reliability of lexical data augmentation methods. Consequently, PLSDA significantly improves the performance of sentiment analysis algorithms.
Inhalt: Vgl.: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24493.
Themenfeld: Computerlinguistik
-
9Morris, V.: Automated language identification of bibliographic resources.
In: Cataloging and classification quarterly. 58(2020) no.1, S.1-27.
Abstract: This article describes experiments in the use of machine learning techniques at the British Library to assign language codes to catalog records, in order to provide information about the language of content of the resources described. In the first phase of the project, language codes were assigned to 1.15 million records with 99.7% confidence. The automated language identification tools developed will be used to contribute to future enhancement of over 4 million legacy records.
Inhalt: Vgl.: https://doi.org/10.1080/01639374.2019.1700201.
Themenfeld: Formalerschließung ; Computerlinguistik
Land/Ort: GB
-
10Zhang, Y. ; Zhang, C. ; Li, J.: Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction.
In: Journal of the Association for Information Science and Technology. 71(2020) no.5, S.553-567.
Abstract: Millions of messages are produced on microblog platforms every day, leading to the pressing need for automatic identification of key points from the massive texts. To absorb salient content from the vast bulk of microblog posts, this article focuses on the task of microblog keyphrase extraction. In previous work, most efforts treat messages as independent documents and might suffer from the data sparsity problem exhibited in short and informal microblog posts. On the contrary, we propose to enrich contexts via exploiting conversations initialized by target posts and formed by their replies, which are generally centered around relevant topics to the target posts and therefore helpful for keyphrase identification. Concretely, we present a neural keyphrase extraction framework, which has 2 modules: a conversation context encoder and a keyphrase tagger. The conversation context encoder captures indicative representation from their conversation contexts and feeds the representation into the keyphrase tagger, and the keyphrase tagger extracts salient words from target posts. The 2 modules were trained jointly to optimize the conversation context encoding and keyphrase extraction processes. In the conversation context encoder, we leverage hierarchical structures to capture the word-level indicative representation and message-level indicative representation hierarchically. In both of the modules, we apply character-level representations, which enables the model to explore morphological features and deal with the out-of-vocabulary problem caused by the informal language style of microblog messages. Extensive comparison results on real-life data sets indicate that our model outperforms state-of-the-art models from previous studies.
Inhalt: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24279.
Themenfeld: Automatisches Indexieren ; Computerlinguistik
-
11Azpiazu, I.M. ; Soledad Pera, M.: Is cross-lingual readability assessment possible?.
In: Journal of the Association for Information Science and Technology. 71(2020) no.6, S.644-656.
Abstract: Most research efforts related to automatic readability assessment focus on the design of strategies that apply to a specific language. These state-of-the-art strategies are highly dependent on linguistic features that best suit the language for which they were intended, constraining their adaptability and making it difficult to determine whether they would remain effective if they were applied to estimate the level of difficulty of texts in other languages. In this article, we present the results of a study designed to determine the feasibility of a cross-lingual readability assessment strategy. For doing so, we first analyzed the most common features used for readability assessment and determined their influence on the readability prediction process of 6 different languages: English, Spanish, Basque, Italian, French, and Catalan. In addition, we developed a cross-lingual readability assessment strategy that serves as a means to empirically explore the potential advantages of employing a single strategy (and set of features) for readability assessment in different languages, including interlanguage prediction agreement and prediction accuracy improvement for low-resource languages.Friend request acceptance and information disclosure constitute 2 important privacy decisions for users to control the flow of their personal information in social network sites (SNSs). These decisions are greatly influenced by contextual characteristics of the request. However, the contextual influence may not be uniform among users with different levels of privacy concerns. In this study, we hypothesize that users with higher privacy concerns may consider contextual factors differently from those with lower privacy concerns. By conducting a scenario-based survey study and structural equation modeling, we verify the interaction effects between privacy concerns and contextual factors. We additionally find that users' perceived risk towards the requester mediates the effect of context and privacy concerns. These results extend our understanding about the cognitive process behind privacy decision making in SNSs. The interaction effects suggest strategies for SNS providers to predict user's friend request acceptance and to customize context-aware privacy decision support based on users' different privacy attitudes.
Inhalt: https://asistdl.onlinelibrary.wiley.com/toc/23301643/current.
Themenfeld: Computerlinguistik
-
12Geißler, S.: Natürliche Sprachverarbeitung und Künstliche Intelligenz : ein wachsender Markt mit vielen Chancen. Das Beispiel Kairntech.
In: Information - Wissenschaft und Praxis. 71(2020) H.2/3, S.95-106.
Abstract: Vor rund einem Jahr haben wir an dieser Stelle die aufregende Dynamik auf den Gebieten der Natürlichen Sprachverarbeitung (NLP) und der Künstlichen Intelligenz (KI) beschrieben: Seit einigen Jahren sorgen Fortschritte in den algorithmischen Grundlagen, in der einsetzbaren Rechenleistung sowie in der Verfügbarkeit von großen Datenmengen für immer leistungsfähigere Systeme. NLP-Anwendungen seien damit mehr denn je reif für den praktischen Einsatz, hatten wir argumentiert. Diese Entwicklung verfolgen wir bei Kairntech nicht allein als interessierte Beobachter, sondern sie stellt die Grundlage unserer Arbeit dar, mit der wir NLP- und KI-Ansätze zur Anwendung auf konkreten geschäftskritischen Prozessen entwickeln und einsetzen. Experten gehen auch für die kommenden Jahre von einem anhaltenden Wachstum des weltweiten Marktes für NLP aus: Mit einem durchschnittlichen Wachstum von über 20 Prozent pro Jahr werde der Markt bis 2025 auf geschätzte 6,24 Milliarden US-$ anwachsen. Im Bereich der Forschung ist das Wachstum sogar noch stürmischer: So ist die Zahl der Einreichungen zur ACL-Konferenz, dem vielleicht wichtigsten jährlichen Event in diesem Gebiet, von 2018 bis 2019 um ganze 75 Prozent angestiegen. Im vorliegenden Text wollen wir die Richtung, die wir bei Kairntech mit der Gründung vor einem Jahr eingeschlagen haben, beschreiben sowie von ersten Erfolgen auf diesem Weg berichten.
Inhalt: Vgl.: https://doi.org/10.1515/iwp-2020-2079.
Themenfeld: Computerlinguistik
Wissenschaftsfach: Informatik
-
13Hausser, R.: Grammatical disambiguation : the linear complexity hypothesis for natural language.
In: https://www.researchgate.net/publication/344883384_Grammatical_Disambiguation_The_Linear_Complexity_Hypothesis_for_Natural_Language.
Abstract: DBS uses a strictly time-linear derivation order. Therefore the basic computational complexity degree of DBS is linear time. The only way to increase DBS complexity above linear is repeating ambiguity. In natural language, however, repeating ambiguity is prevented by grammatical disambiguation. A classic example of a grammatical ambiguity is the 'garden path' sentence The horse raced by the barn fell. The continuation horse+raced introduces an ambiguity between horse which raced and horse which was raced, leading to two parallel derivation strands up to The horse raced by the barn. Depending on whether the continuation is interpunctuation or a verb, they are grammatically disambiguated, resulting in unambiguous output. A repeated ambiguity occurs in The man who loves the woman who feeds Lucy who Peter loves., with who serving as subject or as object. These readings are grammatically disambiguated by continuing after who with a verb or a noun.
Themenfeld: Computerlinguistik
Wissenschaftsfach: Sprachwissenschaft
-
14Pepper, S. ; Arnaud, P.J.L.: Absolutely PHAB : toward a general model of associative relations.
In: ¬The Mental Lexicon. 15(2020) no.1, S.101-122.
Abstract: There have been many attempts at classifying the semantic modification relations (R) of N + N compounds but this work has not led to the acceptance of a definitive scheme, so that devising a reusable classification is a worthwhile aim. The scope of this undertaking is extended to other binominal lexemes, i.e. units that contain two thing-morphemes without explicitly stating R, like prepositional units, N + relational adjective units, etc. The 25-relation taxonomy of Bourque (2014) was tested against over 15,000 binominal lexemes from 106 languages and extended to a 29-relation scheme ("Bourque2") through the introduction of two new reversible relations. Bourque2 is then mapped onto Hatcher's (1960) four-relation scheme (extended by the addition of a fifth relation, similarity , as "Hatcher2"). This results in a two-tier system usable at different degrees of granularities. On account of its semantic proximity to compounding, metonymy is then taken into account, following Janda's (2011) suggestion that it plays a role in word formation; Peirsman and Geeraerts' (2006) inventory of 23 metonymic patterns is mapped onto Bourque2, confirming the identity of metonymic and binominal modification relations. Finally, Blank's (2003) and Koch's (2001) work on lexical semantics justifies the addition to the scheme of a third, superordinate level which comprises the three Aristotelean principles of similarity, contiguity and contrast.
Inhalt: Vgl.: https://www.researchgate.net/publication/346023398_Absolutely_PHAB_Toward_a_general_model_of_associative_relations. DOI: 10.1075/ml.00016.pep.
Themenfeld: Computerlinguistik ; Wissensrepräsentation
-
15Pepper, S.: ¬The typology and semantics of binominal lexemes : noun-noun compounds and their functional equivalents.
Oslo : University of Oslo / Faculty of Humanities / Department of Linguistics and Scandinavian Studies, 2020. IX, 515 S.
Abstract: The dissertation establishes 'binominal lexeme' as a comparative concept and discusses its cross-linguistic typology and semantics. Informally, a binominal lexeme is a noun-noun compound or functional equivalent; more precisely, it is a lexical item that consists primarily of two thing-morphs between which there exists an unstated semantic relation. Examples of binominals include Mandarin Chinese ?? (tielù) [iron road], French chemin de fer [way of iron] and Russian ???????? ?????? (zeleznaja doroga) [iron:adjz road]. All of these combine a word denoting 'iron' and a word denoting 'road' or 'way' to denote the meaning railway. In each case, the unstated semantic relation is one of composition: a railway is conceptualized as a road that is composed (or made) of iron. However, three different morphosyntactic strategies are employed: compounding, prepositional phrase and relational adjective. This study explores the range of such strategies used by a worldwide sample of 106 languages to express a set of 100 meanings from various semantic domains, resulting in a classification consisting of nine different morphosyntactic types. The semantic relations found in the data are also explored and a classification called the Hatcher-Bourque system is developed that operates at two levels of granularity, together with a tool for classifying binominals, the Bourquifier. The classification is extended to other subfields of language, including metonymy and lexical semantics, and beyond language to the domain of knowledge representation, resulting in a proposal for a general model of associative relations called the PHAB model. The many findings of the research include universals concerning the recruitment of anchoring nominal modification strategies, a method for comparing non-binary typologies, the non-universality (despite its predominance) of compounding, and a scale of frequencies for semantic relations which may provide insights into the associative nature of human thought.
Inhalt: Vgl.: https://www.researchgate.net/publication/345312044_The_typology_and_semantics_of_binominal_lexemes_Noun-noun_compounds_and_their_functional_equivalents. DOI: 10.13140/RG.2.2.24009.36967.
Anmerkung: Thesis for: PhD.
Themenfeld: Computerlinguistik
-
16Geißler, S.: Maschinelles Lernen und NLP : Reif für die industrielle Anwendung!.
In: Information - Wissenschaft und Praxis. 70(2019) H.2/3, S.134-140.
Abstract: Anwendungen von maschinellen Lernverfahren (ML) haben in jüngster Zeit aufsehenerregende Durchbrüche bei einer ganzen Reihe von Aufgaben in der maschinellen Sprachverarbeitung (NLP) erzielt. Der Fokus vieler Arbeiten liegt hierbei in der Entwicklung immer besserer Modelle, während der Anteil der Aufgaben in praktischen Projekten, der sich nicht mit Modellbildung, sondern mit Themen wie Datenbereitstellung sowie Evaluierung, Wartung und Deployment von Modellen beschäftigt, oftmals noch nicht ausreichend Beachtung erfährt. Im Ergebnis fehlen gerade Unternehmen, die nicht die Möglichkeit haben, eigene Plattformen für den Einsatz von ML und NLP zu entwerfen, oft geeignete Werkzeuge und Best Practices. Es ist zeichnet sich ab, dass in den kommenden Monaten eine gerade diesen praktischen Fragen zugewandte Ingenieurssicht auf ML und ihren Einsatz im Unternehmen an Bedeutung gewinnen wird.
Inhalt: Vgl.: https://doi.org/10.1515/iwp-2019-2007.
Themenfeld: Computerlinguistik
Wissenschaftsfach: Sprachwissenschaft
-
17Rötzer, F.: Kann KI mit KI generierte Texte erkennen?.[13. März 2019].
In: https://www.heise.de/tp/features/Kann-KI-mit-KI-generierte-Texte-erkennen-4332657.html?view=print.
(Telepolis)
Abstract: OpenAI hat einen Algorithmus zur Textgenerierung angeblich nicht vollständig veröffentlicht, weil er so gut sei und Missbrauch und Täuschung ermöglicht. Das u.a. von Elon Musk und Peter Thiel gegründete KI-Unternehmen OpenAI hatte im Februar erklärt, man habe den angeblich am weitesten fortgeschrittenen Algorithmus zur Sprachverarbeitung entwickelt. Der Algorithmus wurde lediglich anhand von 40 Gigabyte an Texten oder an 8 Millionen Webseiten trainiert, das nächste Wort in einem vorgegebenen Textausschnitt vorherzusagen. Damit könne man zusammenhängende, sinnvolle Texte erzeugen, die vielen Anforderungen genügen, zudem könne damit rudimentär Leseverständnis, Antworten auf Fragen, Zusammenfassungen und Übersetzungen erzeugt werden, ohne dies trainiert zu haben.
Inhalt: Vgl.: http://www.heise.de/-4332657.
Themenfeld: Computerlinguistik
Wissenschaftsfach: Informatik
-
18Doval, Y. ; Gómez-Rodríguez, C.: Comparing neural- and N-gram-based language models for word segmentation.
In: Journal of the Association for Information Science and Technology. 70(2019) no.2, S.187-197.
Abstract: Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.
Inhalt: Vgl.: https://onlinelibrary.wiley.com/doi/10.1002/asi.24082.
Themenfeld: Computerlinguistik
-
19Voss, O.: Übersetzer überflüssig? : Sprachsoftware DeepL und Acrolinx.[07.02.2019].
In: https://www.tagesspiegel.de/wirtschaft/sprachsoftware-deepl-und-acrolinx-uebersetzer-ueberfluessig/23884348.html.
Abstract: Deutsche Sprachsoftware ist besser als Google. Sogar professionelle Übersetzer diskutieren schon, ob sie überflüssig werden.
Themenfeld: Computerlinguistik
Objekt: DeepL ; Acrolinx
-
20Lu, C. ; Bu, Y. ; Wang, J. ; Ding, Y. ; Torvik, V. ; Schnaars, M. ; Zhang, C.: Examining scientific writing styles from the perspective of linguistic complexity : a cross-level moderation model.
In: Journal of the Association for Information Science and Technology. 70(2019) no.5, S.462-475.
Abstract: Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. To uncover the differences in English scientific writing between native English-speaking scholars (NESs) and NNESs, we collected a large-scale data set containing more than 150,000 full-text articles published in PLoS between 2006 and 2015. We divided these articles into three groups according to the ethnic backgrounds of the first and corresponding authors, obtained by Ethnea, and examined the scientific writing styles in English from a two-fold perspective of linguistic complexity: (a) syntactic complexity, including measurements of sentence length and sentence complexity; and (b) lexical complexity, including measurements of lexical diversity, lexical density, and lexical sophistication. The observations suggest marginal differences between groups in syntactical and lexical complexity.
Inhalt: Vgl.: https://onlinelibrary.wiley.com/doi/10.1002/asi.24126.
Themenfeld: Computerlinguistik