Background

JMIR Med Inform

medinform

JMIR Medical Informatics

JMIR Med Inform

2291-9694

JMIR Publications

Toronto, Canada

v14i1e86965

10.2196/86965

Original Paper

Advancing Alzheimer Disease Prediction With Large Language Model–Based Linguistic Feature Analysis: Development and Validation Study

Hsu

Ming-Hsia

MS12Hwang

San-Yih

PhD1Tsai

Yi-Hang

MS1Chang

Yun-Chi

MS1Liang

Chih-Kuang

PhD345Chang

Chiung-Yun

MS2

Department of Information Management, National Sun Yat-sen University

No. 70, Lienhai Rd

Kaohsiung

TaiwanDepartment of Information Systems, Kaohsiung Municipal United Hospital

Kaohsiung

TaiwanDivision of Neurology, Kaohsiung Veterans General Hospital

Kaohsiung

TaiwanCenter for Geriatrics and Gerontology, Kaohsiung Veterans General Hospital

Kaohsiung

TaiwanCenter for Healthy Longevity and Aging Sciences, National Yang Ming Chiao Tung University

Taipei

Taiwan

Sen

Anando

Baladhandapani

Arunadevi

Matos

David Manuel Martins de

Shah

Namra Bhadreshkumar

Correspondence to San-Yih Hwang, PhD, Department of Information Management, National Sun Yat-sen University, No. 70, Lienhai Rd, Kaohsiung, 804201, Taiwan, +886-7-5252000 ext 4723; syhwang@mis.nsysu.edu.tw

2026

2852026

e86965

021120252004202621042026

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

Background

Alzheimer disease (AD) is a progressive neurodegenerative disorder with rapidly growing global prevalence. Early detection is critical for timely intervention; yet, conventional diagnostic methods remain costly and invasive. Speech-based assessment has emerged as a noninvasive alternative, as AD characteristically impairs linguistic abilities including fluency, coherence, and informational content. Recent advances in large language models (LLMs) offer new opportunities to extract structured linguistic features from transcribed speech for automated AD classification. However, existing LLM-based approaches often lack transparency and clinical interpretability, limiting their adoption in clinical workflows.

Objective

This study aims to investigate the influence of linguistic features extracted from transcribed speech, as analyzed by LLMs, on the accuracy and interpretability of AD prediction.

Methods

We propose a framework that leverages LLMs to analyze linguistic features extracted from transcribed speech for AD classification. Our approach focuses on 4 key aspects, including readability, fluency, richness of detail, and keyword relevance. To enhance classification accuracy, the framework integrates transcript embeddings with feature explanation embeddings, forming a comprehensive linguistic representation. We conducted extensive ablation studies to evaluate the contributions of individual features and benchmarked our framework against existing LLM-driven methodologies through pairwise explainability evaluations. Output stability was assessed across 3 independent pipeline runs. A fully local configuration (Llama 3 8B + nomic-embed-text) was tested to evaluate privacy-preserving deployment feasibility. Explainability was assessed via LLM-based pairwise comparison (Gemini-3.1-flash-lite) against the method of Bang et al across 54 correctly classified cases and by blinded evaluation from 2 neurologists.

Results

The proposed framework achieved a mean precision of 91.52%, a sensitivity of 91.08%, a specificity of 96.29%, and F₁-score of 91.05% across 3 independent runs on the ADReSSo 2021 dataset, outperforming existing LLM-based approaches. A fully-local configuration (Llama 3 8B+nomic-embed-text, requiring no cloud application programming interface access) achieved an F₁-score of 81.58%, demonstrating framework transferability to privacy-preserving deployment environments. Keyword relevance was the most influential feature (F₁-score drop of 13.22 pp when removed). Explainability evaluations showed our method was preferred in 49 out of 54 cases via Gemini-3.1-flash-lite, with human experts preferring our method in 89 of 108 blinded assessments.

Conclusions

These findings highlight that a structured linguistic feature analysis using LLMs provides a robust and interpretable framework for preliminary AD detection. Our approach offers a scalable and accessible solution that bridges artificial intelligence–driven text analysis with clinical applications, supporting early detection of cognitive decline through noninvasive assessment methods.

alzheimer diseaselarge language modelslinguistic featuresprompt engineeringearly detection

IntroductionBackground

Alzheimer disease (AD) is a progressive neurodegenerative disorder that primarily affects older adults. The global prevalence of AD is significant, as the number of people living with dementia is projected to increase from 55 million in 2019 to 139 million by 2050, according to the 2023 World Health Organization (WHO) report [1]. In the United States alone, as of 2024, approximately 6.9 million individuals aged 65 years and older are affected by Alzheimer dementia [2,3]. These alarming figures highlight the critical need for early detection and intervention strategies to improve patient outcomes, optimize treatment strategies, and reduce the burden on health care systems [4].

Early diagnosis of dementia offers substantial benefits, enabling individuals to make informed decisions, access essential services and treatments, and implement preventive measures to maintain safety and quality of life [5,6]. Patients, caregivers, and researchers highly value early detection due to its role in supporting life planning, improving end-of-life care, and facilitating research on disease pathology before advanced neuronal damage occurs [5,7]. Furthermore, early diagnosis may promote lifestyle modifications that could potentially delay or prevent the onset of AD, offering a proactive approach to health management [7].

Although neuroimaging and cerebrospinal fluid biomarkers are considered the gold standard for early detection, their limitations—including high costs, limited availability, and invasive procedures—create accessibility challenges, particularly in resource-limited settings [8,9]. Plasma biomarkers are emerging as promising, noninvasive alternatives that could improve accessibility and reduce wait times for specialized care [8,9]. However, they also face limitations, including variability in results, lower specificity compared to traditional biomarkers, and standardization challenges [10,11]. Additionally, dementia risk scores, such as Cardiovascular Risk Factors, Aging, and Dementia (CAIDE) and the Australian National University Alzheimer Disease Risk Index (ANU-ADRI), provide valuable support for early identification, though ongoing refinement is required for broader clinical application [12].

Integrating routine cognitive assessments in primary care can further enhance early detection rates, promoting proactive dementia management and improved patient outcomes [13-15]. Recent technological developments, including computerized adaptive testing (CAT), enhance the flexibility of digital assessments by tailoring testing experiences based on individual cognitive capabilities [16,17]. Additionally, digital cognitive assessments enable the discovery and longitudinal monitoring of novel digital biomarkers outside traditional clinical settings [18,19]. These tools offer significant advantages over traditional paper-based methods in both clinical practice and research, demonstrating increased sensitivity in detecting subtle cognitive changes that conventional assessments might overlook [18-20]. Digital assessments also provide greater reliability and validity through repeated measurements across multiple days, yielding more consistent intra- and interparticipant data compared to single-time-point traditional assessments [20,21].

The accessibility of digital cognitive assessments is another key advantage, as they allow remote self-administration, reducing the need for in-person clinic visits and improving access for underserved populations [18,22]. Additionally, these digital methods offer high ecological validity by enabling frequent, brief assessments in real-world settings, supporting a patient-centered approach, and generating data that better reflects daily cognitive functioning [20]. Their cost-effectiveness and efficiency make them particularly suitable for large-scale studies and clinical trials requiring scalable solutions [18,22].

AD is increasingly recognized for its impact on cognitive function, particularly memory impairment as an early hallmark. However, language impairment is also a notable early symptom that significantly hinders communication [23]. Studies indicate that integrating language assessments into conventional cognitive evaluations enhances the precision of AD progression prediction [24]. As cognitive decline advances, linguistic tasks become more challenging, highlighting the importance of understanding language dynamics in this patient population. Moreover, the distinct relationship between language deficits and cognitive deterioration underscores the potential of linguistic measures as critical markers for assessing the progression from mild cognitive impairment (MCI) to AD [25]. Recent findings suggest that language assessments can not only identify individuals at greater risk for developing AD but also aid in monitoring the severity of language impairments as the disease progresses [26]. Incorporating language assessment into routine cognitive evaluations could improve early detection strategies, enabling targeted interventions designed to help maintain communication abilities. Thus, recognizing and addressing language impairments in early AD stages could play a pivotal role in patient care and management.

In recent years, deep learning-based approaches have gained traction for automating feature extraction by learning complex representations from speech data. Various speech embeddings, such as VGGish, X-vectors, and Wav2Vec [27], along with language embeddings like Bidirectional Encoder Representations from Transformers (BERT), Robustly Optimized BERT Pretraining Approach (RoBERTa), and GPT, have been used to capture rich acoustic and linguistic information for AD detection [28-31]. Moreover, studies demonstrate that combining acoustic and linguistic features enhances AD detection performance by integrating multiple aspects of speech and language affected by cognitive decline [32-34].

Despite the strong performance of deep learning-based approaches, they face significant challenges regarding explainability and interpretability. A systematic review conducted by Shi et al [31] analyzed 72 studies and found that most only provided selective examples for their deep learning models, with very few explicitly addressing explainability. However, in clinical settings, there is a growing need for simpler yet interpretable methods for preliminary AD screening. Recent developments in large language models (LLMs) have demonstrated promise in analyzing speech transcripts from the Cookie Theft picture description task, with studies reporting accuracies of 80.3% using GPT-3 [35] (OpenAI). Notably, Bang et al [36] developed a method incorporating LLM-driven explainability, achieving 85.92% accuracy using GPT-4 (OpenAI).

In this research, we expand on the explainability capabilities of LLMs and systematically refine their prompt design. We guide the LLM to evaluate 4 key linguistic dimensions—readability, fluency, richness of detail, and keyword relevance—before making a diagnosis. This approach enhances both accuracy and interpretability. Our proposed framework also introduces a structured prompt template that directs LLMs in systematic linguistic analysis. Our design includes (1) expert role definition as a speech therapist, (2) specific task instructions for feature analysis, (3) standardized feature definitions, and (4) exemplar demonstrations.

Using Whisper (OpenAI) for speech-to-text conversion and GPT-4 with our specialized prompt template, our framework addresses a binary classification task—distinguishing individuals with AD from cognitively control normal (CN)—and achieved around 91% precision and recall, with 96% specificity on the ADReSSo 2021 dataset, demonstrating that structured linguistic analysis alone can support reliable preliminary AD screening. Key contributions of this work include:

Proposing a novel framework integrating LLM-based linguistic analysis with structured feature evaluation and prompt engineering for generating explainable AD diagnosis.

Developing an AD classifier that incorporates both transcript content and LLM-explained features to enhance prediction accuracy.

Providing empirical evidence of stable and competitive diagnostic accuracy (F₁-score=91.05%; sensitivity=91.08%; specificity=96.29%) on the ADReSSo 2021 Challenge dataset using speech transcripts from the Cookie Theft picture description task [36].

Demonstrating superior explainability through a structured multifeature framework, significantly outperforming an existing approach in diagnostic reasoning, evidence support, and clinical insight, winning 49 out of 54 pairwise evaluations via Gemini-3.1-flash-lite.

Related Work

Before the advent of ChatGPT (OpenAI) in late 2022, most deep learning-based research for AD detection predominantly used BERT as the underlying machine learning framework. In these studies, linguistic features and language embeddings served as inputs to BERT. Subsequent studies adopted more advanced artificial intelligence (AI) models, such as GPT-3 and GPT-4, for AD detection. Below, we review these studies and highlight their strengths and weaknesses.

AD Detection With BERT

BERT is a transformer-based pretrained language model introduced by Google researchers in 2018 [37]. It represents text as a sequence of vectors, which can then be used to train a classification model. The application of BERT for direct transcript embedding has shown promising results in several studies. For instance, Padhee et al [38] applied BERT to raw transcriptions, achieving an F₁-score of 80% for classifying patients with AD, MCI, and, while Rohanian et al [39] highlighted BERT’s adaptability to variations in transcription quality.

For feature-enriched embedding approaches, researchers have enhanced BERT’s capabilities by incorporating additional linguistic features. Mahajan and Baths [40] improved multimodal classification by integrating lexical diversity and syntactic complexity features with BERT embeddings. Qiao et al [41] focused on model explainability by combining fluency and disfluency features, while Yuan et al [42] examined semantic similarity and information density as complementary inputs to increase diagnostic accuracy.

Despite these successes, BERT-based methods face key challenges; direct embedding approaches lack direct clinical explainability, while feature-enriched methods introduce additional complexity and require domain expertise.

AD Detection With Advanced LLM

The emergence of more advanced AI models, such as GPT-3 and GPT-4, has transformed AD detection through innovative approaches to linguistic feature extraction and analysis. Compared to earlier models like BERT, these LLMs excel at classification tasks, even with minimal or no additional training data, while also providing textual explanations.

Agbavor and Liang [35] pioneered the use of GPT-3 for dementia prediction from spontaneous speech, leveraging text embeddings to capture semantic meaning and achieving 80.3% accuracy in distinguishing between patients with AD and CNs. Wang et al [43] expanded this research by exploring GPT-4’s capabilities for MCI screening, analyzing linguistic indicators via standardized prompts, and achieving 77.3% sensitivity and 83.3% specificity.

Bang et al [36] introduced a novel methodology by using GPT-4 for speech fluency evaluation, integrating AI-generated opinions with original text, and achieving 85.92% accuracy and 94.44% specificity. Additionally, Balamurali and Chen [44] conducted a comparative analysis of multiple LLMs in a zero-shot learning context, highlighting both the potential and the limitations of LLM technology in clinical settings at the time of this writing.

These developments illustrate the progression from basic text analysis to sophisticated diagnostic tools, while also emphasizing the need for standardized prompt engineering and clinical validation. This growing body of research suggests that well-structured LLM approaches could provide valuable support for preliminary AD screening while maintaining interpretability for health care professionals.

However, these LLM-based approaches face two major limitations:

Lack of transparency and interpretability: many LLM-driven approaches do not clearly specify the evaluation process. For instance, while Bang et al [36] used GPT-4 to assess speech fluency, they did not define which fluency aspects were being measured. Similarly, Agbavor and Liang’s [35] approach using GPT-3 embeddings lacked clarity regarding the specific semantic features captured.

Absence of a structured analysis framework: previous studies varied significantly in their methodological approaches, lacking a standardized framework for LLM-based analysis. Although Balamurali and Chen [44] specified multiple linguistic aspects for evaluation, their LLM-based approach remained exploratory, without establishing scoring criteria or standardization. This lack of methodological structure limits the clinical applicability of these models.

These limitations underscore the need for a systematic and interpretable approach to LLM-based AD detection, one that integrates structured feature evaluation with explainability. In this work, we address these challenges by developing a framework that enhances transparency, interpretability, and clinical relevance.

MethodsOverview

Before introducing our methodology, we first describe the dataset used in this research. Our approach begins with extracting AD-relevant linguistic features using an LLM. We meticulously design prompts to guide the LLM in analyzing linguistic patterns associated with AD. The resulting feature explanations, along with their corresponding transcripts, are then used to construct an AD classifier. Detailed descriptions are provided in the following subsections.

Dataset

The primary dataset used in this study is the ADReSSo 2021 Challenge corpus [45], derived from DementiaBank [46]. This dataset contains audio recordings of participants describing the Cookie Theft picture from the Boston Diagnostic Aphasia Examination (BDAE) [47], a standardized task widely used in cognitive assessments. The dataset has been carefully balanced to address common demographic biases in medical datasets, ensuring matched distributions of age and gender between groups.

The dataset comprises 166 training samples, including 87 AD cases and 79 CNs, and 71 test samples, including 35 AD cases. All audio recordings underwent preprocessing to ensure consistency. The dataset’s balanced structure mitigates the common issue of group imbalance in clinical datasets, making it particularly suitable for developing and evaluating AD classification models.

The Framework

The overall framework of our AD prediction method is illustrated in Figure 1.

Figure 1.

The framework of the proposed method that incorporates both transcript and its feature explanations for Alzheimer disease (AD) prediction. AD: Alzheimer disease; CN: control normal; LLM: large language model; MLP: multilayer perceptron.

The process consists of the following steps.

Speech-to-text conversion performed by Whisper.

Speaker attribution to extract the participant’s description from the transcript using an LLM-based approach.

The first 2 preprocessing steps are described in the Preprocessing subsection. The examiner-free transcript is then analyzed by an LLM, which generates feature-based explanations across 4 linguistic categories. These features are detailed in the AD Related Linguistic Feature section. Both the transcript and its corresponding explanations are converted into n-dimensional vector representations. These embeddings are then concatenated into a 2n-dimensional vector, integrating both semantic content and feature-based assessment. A multilayer perceptron (MLP) classifier processes the final representation to classify participants into AD or CN groups.

Preprocessing

Our preprocessing pipeline consists of 2 essential steps to derive transcripts from the dataset.

Speech-to-text conversion: all audio recordings were transcribed using the Whisper automatic speech recognition (ASR) system [48].

Speaker attribution: since the raw transcripts contain dialogues between examiners and participants, this step aims to isolate the participant’s descriptions by removing examiner interventions (eg, “What’s happening in that picture?”). We compared 2 methods: an LLM-based speaker attribution method and a conventional speaker diarization method using PyAnnote (pyannoteAI). Experimental results on the ADReSSo 2021 dataset show that the LLM-based speaker attribution method yields better performance (F₁-score=0.82) compared to the conventional speaker diarization method (F₁-score=0.92). We observe that the speaker diarization method using PyAnnote [49] was prone to failure or speaker misidentification in low-volume recordings and cases where examiner and participant voices were acoustically similar, resulting in incomplete or contaminated transcripts. The LLM-based method, by contrast, leverages semantic content to reliably identify and remove examiner turns regardless of audio quality, preserving richer participant descriptions. Based on these results, the LLM-based approach was adopted for subsequent analyses. Readers are referred to Multimedia Appendix 1 for the detailed prompt design.

AD-Related Linguistic Feature Analysis With LLM

This study identifies 4 key linguistic features—readability, fluency, richness of detail, and keyword relevance—as critical for detecting AD from speech. These features were derived from a comprehensive review of linguistic studies and correspond to established categories in speech and language processing, including syntactic, semantic, lexical, disfluency, and pragmatic features. Each feature captures a distinct aspect of language impairment observed in AD, offering a structured framework for assessing cognitive decline. Table 1 summarizes these features, their relevant linguistic categories, primary focus, quantification metrics, and key references.

Table 1.

Linguistic features for Alzheimer disease (AD) detection.

Feature	Relevant linguistic category	Core focus	Metrics	Key references
Readability	Syntactic and lexical	Evaluating syntactic complexity, lexical diversity, and discourse coherence to reflect the organization of speech.	Lexical diversity (Type-Token Ratio; TTR) Syntactic complexity (dependency parsing) Discourse coherence (Coh-Metrix indices)	[38,40-42,50,51]
Fluency	Disfluencies	Measuring smoothness and flow of language, including hesitations, filled pauses, and repetitions.	Pause frequency and duration Speech rate Self-repair rates	[39,40,52-54]
Richness of detail	Semantic and pragmatic	Assessing the density and specificity of meaningful content in descriptions.	Proportion of information-bearing nouns and verbs Content density (Latent Semantic Analysis) Ratio of semantically empty words	[42,55,56]
Keywordrelevance	Semantic	Evaluating alignment of spoken content with predefined key elements in a given context (eg, the Cookie Theft image).	Frequency of target keywords Semantic similarity (cosine similarity and topic modeling) Neural attention weights for key terms.	[54,57,58]

Readability encompasses syntactic complexity, lexical diversity, and discourse coherence, all of which are essential for evaluating the organization and comprehensibility of speech. Studies indicate that patients with AD often produce grammatically simplified sentences, shorter sentence segments, and less coherent discourse, reflecting a decline in their ability to construct complex and information-dense narratives [40,41,50]. Traditional readability assessments rely on dependency parsers and Coh-Metrix indices, but LLMs provide a holistic alternative by integrating syntactic and lexical patterns into a unified framework [42,51].

Fluency captures the smoothness and temporal flow of speech, focusing on pauses, hesitations, and repetitions. Patients with AD frequently exhibit disfluencies due to word retrieval challenges and sentence formulation difficulties, which signal cognitive decline [55,56]. Common metrics include speech rate, pause frequency, and self-repair rates. LLMs enhance fluency analysis by automatically detecting patterns within transcripts, offering a scalable and nuanced approach to fluency evaluation [40,52].

Richness of detail refers to the density of meaningful and specific information within a description. Studies indicate that patients with AD tend to provide fewer information-bearing propositions, relying on vague or semantically empty words [55,56]. This reflects impairments in semantic memory and access to stored knowledge. Metrics such as latent semantic analysis and content density analysis quantify these deficits. LLMs enable dynamic evaluations by assessing the use of descriptive language and narrative coherence, complementing traditional feature extraction methods [42].

Keyword relevance evaluates the degree to which spoken content aligns with predefined key elements of a context or scenario, such as the Cookie Theft picture used in our study. The patients with AD often omit critical objects or actions, reflecting impairments in lexical retrieval and semantic memory [55,56]. Neural attention models have shown that diagnostically relevant keywords (eg, “sink,” “water,” and “cookie”) appear significantly less frequently in the descriptions of patients with AD[57]. LLMs improve keyword analysis by leveraging attention mechanisms to quantify the inclusion and contextual relevance of key terms [58].

These 4 linguistic features collectively address syntactic, lexical, semantic, and pragmatic aspects of AD-related language impairments. Traditional methods rely heavily on manual feature engineering and domain-specific tools, whereas LLMs provide an automated and holistic approach by integrating structured, semantic, and pragmatic analyses via carefully designed prompts. This LLM-driven approach provides a scalable and interpretable method for analyzing speech, aligning with recent advancements in biomedical informatics for dementia diagnosis [55,56,59].

Prompt-Template Design

We developed a structured prompt template to facilitate consistent and comprehensive LLM-based analysis of Cookie Theft picture descriptions. The complete template is provided in Multimedia Appendix 1. The template is designed hierarchically, consisting of 4 key components (illustrated in Figure 2).

Figure 2.

The hierarchical structure of the prompt template, illustrating the relationships between components and their respective subelements. AD: Alzheimer disease.

The 4 key components are:

Role definition: establishes the expert context to ensure that the LLM provides clinically relevant responses [60]. This component positions the LLM as a speech and language therapist with expertise in identifying language dysfunctions in individuals with cognitive impairment.

Task specification: provides analysis instructions and scoring guidelines to ensure structured evaluations. The analysis instructions explain that language dysfunctions in AD often arise from compromised semantic and pragmatic processing abilities and that the Cookie Theft picture description task is specifically designed to assess cognitive function and memory. The scoring guidelines direct the LLM to evaluate each feature on a standardized 1‐7 scale, where higher scores indicate better cognitive function. These guidelines emphasize the importance of providing detailed explanations and specific evidence from the transcript to support each feature assessment.

Feature definition: features to be evaluated include readability (syntactic complexity and comprehensibility), fluency (speech smoothness and coherence), richness of detail (information density and specificity), and keyword relevance (inclusion of essential elements from the Cookie Theft picture). Each feature definition includes brief descriptions of how these linguistic elements typically manifest in patients with AD compared to healthy individuals.

Exemplar demonstrations: the template includes 2 contrastive examples to calibrate the analysis, including a control example, generated from a participant with high cognitive function, and an AD example, produced by a participant with AD. The control example consists of a detailed, well-structured description demonstrating high scores across all features, whereas the AD example involves a description showing typical patterns of cognitive decline, characterized by fragmented expression and limited detail. Refer to Multimedia Appendix 1 for the full prompt and an example of AD-related linguistic feature analysis generated by the LLM.

Ethical Considerations

This study uses the publicly available ADReSSo 2021 dataset from the DementiaBank corpus, which does not involve direct patient contact or new clinical data collection. Institutional Review Board (IRB) approval (25-CT1-02[241015-2]) was obtained through an expedited review at Kaohsiung Veterans General Hospital, covering the methodological framework development and serving as a prerequisite for the associated clinical pilot study. No additional ethics approval was required for use of the publicly available dataset.

ResultsConfiguration

We developed a multicomponent framework for AD detection, with each component specifically configured. We adopted GPT-4 [61] as the large language model in our experiments. For the embedding process, we used OpenAI’s text-embedding-ada-002 [61] model to generate embeddings from both transcript content and feature explanations. Each component produced a 1536-dimensional vector, resulting in a concatenated 3072-dimensional vector for the subsequent classification task. For the classification model, we implemented a 2-layer MLP architecture. The first layer transformed the 3072-dimensional input into 512 hidden units with rectified linear unit activation and dropout (rate=0.1), followed by a second layer outputting binary classification probabilities. The model was trained using the Adam optimizer [62], a batch size of 8, a learning rate of 0.01, and ran for 50 epochs. We used 5-fold cross-validation for model evaluation, with each fold maintaining balanced AD/Control ratios, using the 176 training samples of the ADReSSo 2021 dataset. Cross-entropy loss was used as the optimization criterion.

Performance Results

The proposed framework was benchmarked against 3 comparative approaches that used LLMs for AD detection from speech transcripts using the ADReSSo 2021 dataset.

We compared against 2 published methods, including Agbavor and Liang [35], who derived semantic representations using GPT-3 embeddings, and Bang et al [36], who combined GPT-4-based fluency assessments with transcript embeddings.

To assess output stability under the inherent stochasticity of LLM-based feature generation, we conducted 3 independent runs of the full pipeline using identical configurations and report the mean results, which are shown in Table 2. The mean F₁-score of 91.05% and mean specificity of 96.29% demonstrate consistent performance across runs. Importantly, the mean F₁-score exceeds that reported in Bang et al [36] (85.80%) by 5.25 percentage points, confirming that the framework maintains its advantage over existing methods even under stochastic variation.

Table 2.

Performance comparison of different LLM^a-based approaches for AD^b detection.

Research	LLM	Approach	F₁-score%	Accuracy%	Precision%	Sensitivity%	Specificity%
Agbavor.and Liang(2022) [35]	GPT3	Semantic embeddings	—^c	80.3	80.6	80.6	—
Bang et al(2024) [36]	GPT4	Fluency assessment+ embeddings	85.80	85.92	86.94	85.92	94.44
Our proposed framework	GPT4	Multifeature embeddings + MLP^d	91.05	91.08	91.52	91.08	96.29
Our proposed frameworkLlama 3 8B+nomic-embed (fully local)	Llama 3	Multifeature embeddings + MLP(fully local, no cloud API^e)	81.58	81.69	82.30	81.69	88.89

^aLLM: large language model.

^bAD: Alzheimer disease.

^cNot applicable.

^dMLP: multilayer perceptron.

^eAPI: application programming interface.

To evaluate feasibility for privacy-sensitive clinical environments where cloud-based application programming interfaces may not be permissible, we tested a fully local configuration using Llama 3 (8B) [63] for feature extraction and nomic-embed-text [64] for embedding generation, requiring no cloud application programming interface access. This configuration achieved an F₁-score of 81.58% and a specificity of 88.89%, as shown in the last row in Table 2. The performance gap relative to the proposed framework’s mean (9.47 pp=91.05% mean F₁-score) highlights the capability limitations of the fully local configuration, specifically the LLM and embedding model used, namely, Llama 3 (8B) and nomic-embed-text. Note that in the future, as more capable open-source LLMs and embedding models mature, the performance of fully local deployments is expected to improve substantially.

Compared to previous LLM-based methods, our framework demonstrates substantial improvements. The enhancement over Bang et al [36], which also uses GPT-4, is particularly notable, with improvements of approximately 5 percentage points in accuracy and F₁-score and 1.9 percentage points in specificity. This improvement can be attributed to our systematic integration of 4 distinct linguistic features (readability, fluency, richness of detail, and keyword relevance) and the combination of transcript and feature explanation embeddings.

The mean specificity (96.29%) achieved by our framework is particularly significant in clinical contexts, as it indicates strong capability to correctly identify non-AD cases, thereby reducing false positives in preliminary screening scenarios. This specificity exceeds Bang et al [36] (94.44%), highlighting the clinical utility of our structured multifeature approach for reducing unnecessary follow-up.

Ablation Study

To assess the contribution of individual components in our framework, we conducted comprehensive ablation experiments across 3 key aspects, namely the impact of individual features, the effect of embedding combinations, and the consequence of prompt design. All ablation experiments were conducted using a single fixed pipeline run to ensure that observed performance differences reflect the contribution of each component rather than run-to-run stochastic variation. The results are shown in Table 3.

Table 3.

Ablation study results: Impact of individual linguistic features on model performance.

Configuration	F₁-score	Accuracy	Precision	Sensitivity	Specificity
Full model	91.52	91.55	92.07	91.55	97.22
Without readability	85.90	85.92	86.02	85.92	88.89
Without fluency	88.73	88.73	88.73	88.73	88.89
Without richness of detail	84.50	84.51	84.53	84.51	86.11
Without keyword relevance	78.3	78.87	81.87	78.87	94.44

Individual Feature Impact

The full model (F₁-score=91.52%) in Table 3 represents the result of the same fixed run used across all ablation conditions, providing a controlled baseline for comparing the effect of each component removal. We evaluated the importance of each linguistic feature by removing them one at a time, with the results shown in Table 3. Our findings show that all features contributed greatly to the model’s performance, with keyword relevance being particularly crucial (dropping from 91.52% to 78.30% F₁-score when removed). The other features showed smaller yet still notable impacts: readability (F₁-score=85.90%), fluency (F₁-score=88.73%), and richness of detail (F₁-score=84.50%).

Embedding Combination Effect

Our approach uses both transcript and feature embeddings. We intend to investigate the impact of each embedding type separately. Our findings, shown in Table 4, reveal that the use of both text and feature embeddings significantly outperformed those using either type of embeddings alone. The full model achieved a 91.52% F₁-score, while using only text embeddings or feature embeddings yielded 77.44% and 57.70%, respectively. This demonstrates the complementary nature of these representations.

Table 4.

Impact of different embedding configurations and prompt structures on model performance.

Configuration	F₁-score	Accuracy	Precision	Sensitivity	Specificity
Full model	91.52	91.55	92.07	91.55	97.22
Embedding configuration
Text embedding only (1536 dim)	77.44	77.46	77.53	77.46	80.56
Feature embedding only (1536 dim)	57.70	57.75	57.74	57.75	61.11
Prompt structure
Without few-shot examples	85.86	85.92	86.36	85.92	91.67
Without feature definitions	85.80	85.92	86.94	85.92	94.44
Without AD^a relation descriptions	85.90	85.92	86.02	85.92	88.89

^aAD: Alzheimer disease.

Prompt Design Impact

Our approach involves a structured prompt with several key elements. In this experiment, we analyzed performance changes after removing specific prompt components. The results, as detailed in Table 4, show that the full structured prompt template significantly enhanced performance (F₁-score 91.52%), while removing few-shot examples (85.86%), feature definitions (85.80%), or AD relation descriptions (85.90%) led to performance declines.

DiscussionPrincipal Findings

To assess the explainability of our structured prompt design, we conducted a pairwise comparison with Bang et al [36], which also provides diagnostic explanations for each transcript. Pairwise comparison is a well-established evaluation method in which 2 alternatives are directly compared based on specific criteria. The comparison focused on 54 cases where both methods correctly classified the samples, ensuring a fair assessment that isolates the evaluation of explanation quality. The feature explanations used in this comparison were drawn from a single randomly selected pipeline run, ensuring that the evaluation reflects a representative output rather than a curated best-case result.

We used Gemini-3.1-flash-lite [65] as an independent judge, instead of GPT-4 used in our method, to avoid self-preference, following recent natural language processing practices. This approach enables a cost-efficient, systematic, and consistent evaluation across multiple samples, making it particularly well-suited for comparing natural language explanations [66]. We evaluated the explanations using 5 key criteria derived from explainable AI literature [67-71], namely diagnostic connection, which assesses how well linguistic features are linked to AD diagnosis [67]; evidence support, which examines the concrete evidence provided from the transcript [68]; clinical insight, which evaluates the value of insights for clinical assessment [69]; feature coverage, which measures the comprehensiveness of linguistic feature analysis [70]; and actionable information, which assesses the usefulness for health care professionals [71].

Table 5 reports the numbers of wins for the method of Bang et al [36], wins for our method, and ties, as evaluated by Gemini-3.1-flash-lite. It shows that our method significantly outperformed the approach of Bang et al [36] across all criteria, with our method preferred in 49 out of 54 cases. The most notable difference appeared in feature coverage, where our approach received 52 out of 54 preferences. These results highlight the strength of our structured prompt design, which systematically addresses multiple linguistic dimensions (readability, fluency, richness of detail, and keyword relevance), providing a more comprehensive and clinically relevant explanation compared to fluency-focused approaches.

Table 5.

Pairwise comparison results via Gemini-3.1-flash-lite.

Criteria	Wins for the method by Bang et al [36]^a	Wins for our method^a	Ties
Diagnostic connection	3	31	20
Evidence supports	16	38	0
Clinical insight	9	45	0
Feature coverage	2	52	0
Actionable information	5	49	0
Overall winner	5	49	0
Confidence level
High	9.26	90.74	0.00
Medium	0.00	0.00	0.00
Low	0.00	0.00	0.00

^aValues represent counts out of 54 pairwise comparisons.

Figure 3 illustrates a representative case comparison between the 2 explanation methods. In this example, the method of Bang et al [36] provides a general assessment of fluency, lacking structured analysis or explicit links to AD symptomatology. By contrast, our method offers a systematic evaluation across multiple linguistic features, with numerical scoring and specific observations linked to language dysfunction in AD.

Figure 3.

Comparison of explanation methods. AD: Alzheimer disease.

Human Expert Evaluation

To address potential concerns regarding LLM-based evaluation bias, we conducted an independent human evaluation involving 2 neurologists from Taiwan medical centers. The experts were presented with randomized explanations from both methods for the same 54 cases in a blinded fashion, without knowledge of which method generated each explanation.

Evaluator A assigned preference to our method in 53 (98.1%) cases and to the method of Bang et al [36] in 1 case, with no tied evaluations. Evaluator B assigned preference to our method in 36 (66.7%) cases and to the method of Bang et al [36] in 9 cases, with 9 tied evaluations. Combined results showed 89 total preferences for our method (53+36) versus 10 for the method of Bang et al [36] (1+9), with 9 ties. Such a difference between human annotators is not a surprise, as several research studies have demonstrated that human judgments are more diverse and variable [72-74]. Nevertheless, these results demonstrate strong interevaluator agreement favoring our structured multifeature approach, with Evaluator A showing near-unanimous preference and Evaluator B displaying greater variability while maintaining overall preference for our method.

The convergence between automated and human expert evaluations strengthens the validity of our explainability assessment and demonstrates that our structured multifeature approach provides substantially more interpretable and clinically relevant explanations for AD detection from speech.

Analysis of representative cases reveals that our structured approach explicitly connects linguistic observations to cognitive impairment patterns typical of AD. For example, in case “Adrsdt19,” our method identified specific language disruptions such as hesitations, repetitions, and fragmented expressions, directly linking these to potential cognitive decline. In contrast, the approach of Bang et al [36] often described language performance without establishing clear connections to AD-related impairments.

Note that LLM-based evaluation metrics should be interpreted with appropriate care; the convergence between Gemini-3.1-flash-lite and human expert assessments collectively strengthens the validity of our explainability findings.

To provide empirical support for feature selection, we conducted 2-tailed independent-samples t tests comparing LLM-derived feature scores between AD (n=122) and CN (n=115) groups (RStudio; Posit Software and R v4.5.0; R Foundation for Statistical Computing [75]). The feature scores were obtained from a single randomly selected pipeline run, consistent with the approach used for the pairwise explainability evaluation. As shown in Table 6, all 4 features showed highly significant group differences (all P<.001; df=235) with large effect sizes (Cohen d range: 1.11‐1.19). The largest mean difference was observed for richness of detail (AD: 3.42 vs CN: 5.04; Δ=1.62), consistent with reduced informational specificity in AD speech. These findings independently validate the selection of the 4 features for the classification framework.

Table 6.

T-test results comparing linguistic feature scores between Alzheimer disease (AD) and control normal (CN) groups.

Linguistic feature	AD^a group, mean (SD)	AD, median (IQR)	CN^b group, mean (SD)	CN, median (IQR)	P value
Readability	4.00 (1.47)	4 (3-5)	5.49 (1.12)	6 (5-6)	<.001^c
Fluency	3.68 (1.37)	4 (2-5)	5.23 (1.24)	6 (5-6)	<.001^c
Richness of detail	3.42 (1.48)	3 (2-4)	5.04 (1.44)	5 (4-6)	<.001^c
Keyword relevance	3.84 (1.61)	4 (3-5)	5.54 (1.43)	6 (5-7)	<.001^c

^aAD: Alzheimer disease.

^bCN: control normal.

^cDegrees of freedom (df)=235.

We acknowledge that certain non-LLM state-of-the-art methods have reported higher diagnostic precision. For instance, Liu et al [76] achieved an accuracy of 97.18% and an F₁-score of 97.09% on the same dataset by using a Mozilla Deep Speech ASR and BERT-based pipeline. While such deep-learning models excel in raw performance, they often function as ’black boxes’ with limited clinical interpretability. In contrast, our framework prioritizes explainable AI (XAI) by generating structured linguistic evidence across 4 dimensions. Our results demonstrate that while achieving an accuracy of 91.08%, the primary value of this work lies in providing transparent, actionable insights that are essential for clinical trust and diagnostic reasoning.

These findings suggest that our structured multifeature prompt design provides substantially more interpretable explanations for AD detection from speech. Enhanced explainability can improve clinical trust in AI-assisted diagnosis, potentially facilitating adoption in health care settings where transparency is essential. Furthermore, the detailed linguistic breakdowns generated by our method could support more targeted intervention strategies, enabling clinicians to focus on specific linguistic deficits observed in patients.

Conclusion

Our structured LLM-based framework, which leverages 4 key linguistic features (readability, fluency, richness of detail, and keyword relevance), achieved 92% precision and 97% specificity in detecting AD from speech transcripts. While further clinical validation is needed, this work demonstrates that well-structured linguistic analysis using LLMs can provide a reliable and explainable method for preliminary AD screening. Our framework offers an accessible tool for early detection of cognitive decline, potentially reducing barriers to timely diagnosis. The high performance and interpretable outputs of our model suggest promising directions for integrating AI-assisted cognitive assessment into clinical practice. Particularly in early screening stages, where accessible, noninvasive assessment tools are most valuable, our method could serve as an effective complement to traditional diagnostic approaches.

Despite presenting key innovations in AD detection through structured linguistic analysis, our study has several limitations. The ADReSSo 2021 dataset, while balanced, may not fully capture the linguistic variability found across the patient with AD population. Additionally, our method is currently designed for English-specific transcripts, requiring multilingual adaptation for broader clinical applicability. Although the ADReSSo 2021 dataset is demographically balanced with age and gender matched between groups, broader clinical deployment may require demographic-adjusted scoring baselines to account for variables such as education level, native language, and cultural background; this remains a direction for future work. Further studies are needed to assess how AI-based assessments align with real-world medical diagnoses and clinical outcomes. Furthermore, our reliance on transcript-based analysis introduces potential ASR errors, suggesting future work should incorporate direct audio processing techniques.

Despite these challenges, our findings demonstrate that structured LLM-based linguistic assessment provides a scalable, interpretable tool for early AD detection. This approach effectively bridges the gap between AI-based text analysis and real-world clinical applications, offering a step forward in leveraging AI for cognitive health monitoring.

During the preparation of this work, the authors used Claude AI and ChatGPT in order to improve the language and flow of the manuscript. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Funding

This work was supported by research grants from Kaohsiung Municipal United Hospital, Taiwan (KMUH11305) and the National Science andTechnology Council, Taiwan (NSTC 112-2410-H-110-016-MY3).

None declared.

Abbreviations

Alzheimer disease

artificial intelligence

ANU-ADRI

Australian National University Alzheimer Disease Risk Index

ASR

automatic speech recognition

BDAE

Boston Diagnostic Aphasia Examination

BERT

Bidirectional Encoder Representations from Transformers

CAIDE

Cardiovascular Risk Factors, Aging, and Dementia

CAT

computerized adaptive testing

control normal

IRB

Institutional Review Board

LLM

large language model

MCI

mild cognitive impairment

MLP

multilayer perceptron

RoBERTa

Robustly Optimized BERT Pretraining Approach

WHO

World Health Organization

XAI

explainable artificial intelligence

References1

Long

Benoist

Weidner

World alzheimer report 2023

2023

2026-05-06

Alzheimer’s Disease International

https://www.alzint.org/u/World-Alzheimer-Report-2023.pdf

2022 Alzheimer’s disease facts and figures

Alzheimers Dement202204184700789

10.1002/alz.12638

2024 Alzheimer’s disease facts and figures

Alzheimers Dement20240520537083821

10.1002/alz.13809

Hurd

Martorell

Delavande

Mullen

Langa

Monetary costs of dementia in the United States

N Engl J Med20130443681413261334

10.1056/NEJMsa1204629

23550670

Banerjee

Rahman‐Amin

Eul‐Barker

New evidence on the value of timely diagnosis for people with dementia

Int J Geriat Psychiatry202206376

10.1002/gps.5733

Couch

Albertyn

Prina

Lawrence

A qualitative study of informal caregiver perceptions of the benefits of an early dementia diagnosis

BMC Health Serv Res20240424241508

10.1186/s12913-024-10957-6

38658907

Incorvaia

Early detection of Alzheimer’s disease benefits research, quality of life, and end-of-life planning

AJOB Neurosci2021124243244

10.1080/21507740.2021.1941404

34704912

Barkhof

Pressman

Early detection of incipient Alzheimer pathology

Neurology (ECronicon)20210469614647648

10.1212/WNL.0000000000011739

Nasreddine

Garibotto

Kyaga

Padovani

The early diagnosis of Alzheimer’s disease: a patient-centred conversation with the care team

Neurol Ther2023021211123

10.1007/s40120-022-00428-7

36528836

Altomare

Stampacchia

Ribaldi

Plasma biomarkers for Alzheimer’s disease: a field-test in a memory clinic

J Neurol Neurosurg Psychiatry202306946420427

10.1136/jnnp-2022-330619

37012066

Blennow

Galasko

Perneczky

The potential clinical value of plasma biomarkers in Alzheimer’s disease

Alzheimer’s & Dementia202312191258055816

10.1002/alz.13455

Anstey

Zheng

Peters

Dementia risk scores and their role in the implementation of risk reduction guidelines

Front Neurol202112765454

10.3389/fneur.2021.765454

35058873

Galvin

Aisen

Langbaum

Early stages of Alzheimer’s disease: evolving the care team for optimal patient management

Front Neurol202011592302

10.3389/fneur.2020.592302

33551954

Bernstein Sideman

Chalmer

Ayers

Lessons from detecting cognitive impairment including dementia (DetectCID) in primary care

J Alzheimers Dis2022862655665

10.3233/JAD-215106

35124639

Reiss

de Levante Raphael

Chin

Sinha

The physician’s Alzheimer’s disease management guide: early detection and diagnosis of cognitive impairment, Alzheimer’s disease and related dementia

AIMS Public Health202294661689

10.3934/publichealth.2022047

36636148

McGlohen

Chang

Combining computer adaptive testing technology with cognitively diagnostic assessment

Behav Res Methods200808403808821

10.3758/brm.40.3.808

18697677

Van Patten

Introduction to the special issue - neuropsychology from a distance: psychometric properties and clinical utility of remote neurocognitive tests

J Clin Exp Neuropsychol202110438767773

10.1080/13803395.2021.2021645

35133240

Öhman

Hassenstab

Berron

Schöll

Papp

Current advances in digital cognitive assessment for preclinical Alzheimer’s disease

Alzheimers Dement (Amst)202101

2026-05-06

131e12217

https://alz-journals.onlinelibrary.wiley.com/toc/23528729/13/1

10.1002/dad2.12217

Roque

Sliwinski

Katz

Anderson-Hanley

Revolutionizing cognitive assessment: the power of digital biomarkers and ecologically valid methods

Innov Aging202312217Supplement_1380380

10.1093/geroni/igad104.1260

Moore

Swendsen

Depp

Applications for self-administered mobile cognitive assessments in clinical research: a systematic review

Int J Methods Psychiatr Res201712264e1562

10.1002/mpr.1562

28370881

Björngrim

van den Hurk

Betancort

Machado

Lindau

Comparing traditional and digitized cognitive tests used in standard clinical evaluation - a study of the digital application minnemera

Front Psychol2019102327

10.3389/fpsyg.2019.02327

31681117

Staffaroni

Tsoy

Taylor

Boxer

Possin

Digital cognitive assessments for dementia: digital assessments may enhance the efficiency of evaluations in neurology and other clinics

Pract Neurol (Fort Wash Pa)2020

2026-05-23

20202445

https://pubmed.ncbi.nlm.nih.gov/33927583/

33927583

Laske

Sohrabi

Frost

Innovative diagnostic tools for early detection of Alzheimer’s disease

Alzheimers Dement201505115561578

10.1016/j.jalz.2014.06.004

Chou

Chang

Screening for early Alzheimer’s disease: enhancing diagnosis with linguistic features and biomarkers

Front Aging Neurosci2024161451326

10.3389/fnagi.2024.1451326

39376506

Mao

Rasmussen

AD-BERT: using pre-trained language model to predict the progression from mild cognitive impairment to Alzheimer’s disease

J Biomed Inform202308144104442

10.1016/j.jbi.2023.104442

37429512

Gagliardi

Natural language processing techniques for studying language in pathological ageing: a scoping review

Int J Lang Commun Disord2024591110122

10.1111/1460-6984.12870

36960885

Balagopalan

Novikova

Comparing acoustic-based approaches for alzheimer’s disease detection

Interspeech 20212026-05-06Preprint posted online on 2021

https://www.isca-archive.org/interspeech_2021

10.21437/Interspeech.2021-759

Ortiz-Perez

Ruiz-Ponce

Tomás

Garcia-Rodriguez

García Bringas

Deep learning-based dementia prediction using multimodal data

17th International Workshop on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022)2023531

Springer, Cham

260269Lecture Notes in Networks and Systems

10.1007/978-3-031-18050-7_25

Soroski

da Cunha Vasco

Newton-Mason

Evaluating web-based automatic transcription for Alzheimer speech data: transcript comparison and machine learning analysis

JMIR Aging2022092153e33460

10.2196/33460

36129754

Wang

Exploring linguistic feature and model combination for speech recognition based automatic AD detection

2022

2026-05-06

Interspeech 2022

Sep 18-22, 2022

Incheon, South Korea

33283332

https://www.isca-archive.org/interspeech_2022

10.21437/Interspeech.2022-723

Shi

Cheung

Shahamiri

Speech and language processing with deep learning for dementia diagnosis: a systematic review

Psychiatry Res202311329115538

10.1016/j.psychres.2023.115538

37864994

Graham

Lee

Jeste

Artificial intelligence approaches to predicting and detecting cognitive decline in older adults: a conceptual review

Psychiatry Res202002284112732

10.1016/j.psychres.2019.112732

31978628

Palliya Guruge

Oviatt

Delir Haghighi

Pritchard

Advances in multimodal behavioral analytics for early dementia diagnosis: a review

20211018

2026-05-06

ICMI ’21

Oct 18, 2021

Montréal QC Canada

328340

https://dl.acm.org/doi/proceedings/10.1145/3462244

10.1145/3462244.3479933

Pandey

Shekhawat

Bhasin

Jasuja

Prasanna

Kim

Singh

Khan

Tiwary

Sur

Singh

Alzheimer’s dementia recognition using multimodal fusion of speech and text embeddings

Intelligent Human Computer Interaction (IHCI)202113184

Springer, Cham

718728Lecture Notes in Computer Science

10.1007/978-3-030-98404-5_64

Agbavor

Liang

Predicting dementia from spontaneous speech using large language models

PLOS Digit Health202212112e0000168

10.1371/journal.pdig.0000168

36812634

Bang

Han

Kang

Alzheimer’s disease recognition from spontaneous speech using large language models

ETRI Journal20240246196105

10.4218/etrij.2023-0356

Devlin

Chang

Lee

Toutanova

Bert: pre-training of deep bidirectional transformers for language understanding

2019

2026-05-23

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Jun 2-7, 2019

Minneapolis, Minnesota

Association for Computational Linguistics

41714186

https://aclanthology.org/N19-1423/

10.18653/v1/N19-1423

Padhee

Illendula

Sadler

Predicting early indicators of cognitive decline from verbal utterances

2026-05-06

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

2020

Seoul, Korea (South

477480

https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=9312958

10.1109/BIBM49941.2020.9313106

Rohanian

Hough

Purver

Alzheimer’s dementia recognition using acoustic, lexical, disfluency and speech pause features robust to noisy inputs

2026-05-06

Interspeech 2021

2021

https://www.isca-archive.org/interspeech_2021

10.21437/Interspeech.2021-1633

Mahajan

Baths

Acoustic and language based deep learning approaches for Alzheimer’s dementia detection from spontaneous speech

Front Aging Neurosci202113623607

10.3389/fnagi.2021.623607

33613269

Qiao

Yin

Wiechmann

Kerz

Alzheimer’s disease detection from spontaneous speech through combining linguistic complexity and (dis)fluency features with pretrained language models

2026-05-06

Interspeech 2021

2021

https://www.isca-archive.org/interspeech_2021

10.21437/Interspeech.2021-1415

Yuan

Bian

Cai

Huang

Church

Disfluencies and fine-tuning pre-trained language models for detection of Alzheimer’s disease

2026-05-06

Interspeech 2020

2020

21622166

https://www.isca-archive.org/interspeech_2020

10.21437/Interspeech.2020-2516

Wang

Liu

Text dialogue analysis for primary screening of mild cognitive impairment: development and validation study

J Med Internet Res2023122925e51501

10.2196/51501

38157230

Balamurali

Chen

Performance assessment of ChatGPT versus Bard in detecting Alzheimer’s dementia

Diagnostics (Basel)20240415148817

10.3390/diagnostics14080817

38667463

Luz

Haider

Fuente

S de la

Fromm

MacWhinney

Detecting cognitive decline using speech only: the adresso challenge

2026-05-06

Interspeech 2021

2021

37803784

https://www.isca-archive.org/interspeech_2021

10.21437/Interspeech.2021-1220

Becker

Boller

Lopez

Saxton

McGonigle

The natural history of Alzheimer’s disease. Description of study cohort and accuracy of diagnosis

Arch Neurol199406516585594

10.1001/archneur.1994.00540180063015

8198470

Goodglass

Kaplan

Barresi

BDAE: The Boston Diagnostic Aphasia Examination2001

Lippincott Williams & Wilkins

185

9780683305593

Radford

Kim

Brockman

McLeavey

Sutskever

Robust speech recognition via large-scale weak supervision

Proceedings of the International Conference on Machine Learning (ICML)2026-05-23202

Proceedings of Machine Learning Research

2849228518

https://proceedings.mlr.press/v202/radford23a.html

Bredin

Laurent

End-to-end speaker segmentation for overlap-aware resegmentation

2026-05-06

Interspeech 2021

2021

37073711

https://www.isca-archive.org/interspeech_2021

10.21437/Interspeech.2021-560

Liu

Yuan

Tang

Improving Alzheimer’s disease detection for speech based on feature purification network

Front Public Health20219835960

10.3389/fpubh.2021.835960

35310782

Alkenani

Zhang

Predicting Alzheimer’s disease from spoken and written language using fusion-based stacked generalization

J Biomed Inform202106118103803

10.1016/j.jbi.2021.103803

33965639

Sarawgi

Zulfikar

Soliman

Maes

Multimodal inductive transfer learning for detection of Alzheimer’s dementia and its severity

2026-05-06

Interspeech 2020

Oct 25-29, 2020

https://www.isca-archive.org/interspeech_2020

10.21437/Interspeech.2020-3137

Berisha

Wang

LaCross

Liss

Tracking discourse complexity preceding Alzheimer’s disease diagnosis: a case study comparing the press conferences of Presidents Ronald Reagan and George Herbert Walker Bush

J Alzheimers Dis2015453959963

10.3233/JAD-142763

25633673

Nasreen

Rohanian

Hough

Purver

Alzheimer’s dementia recognition from spontaneous speech using disfluency and interactional features

Front Comput Sci20213640669

10.3389/fcomp.2021.640669

Fraser

Meltzer

Rudzicz

Linguistic features identify Alzheimer’s disease in narrative speech

J Alzheimers Dis2016492407422

10.3233/JAD-150520

26484921

Toledo

Aluísio

dos Santos

Analysis of macrolinguistic aspects of narratives from individuals with Alzheimer’s disease, mild cognitive impairment, and no cognitive impairment

Alzheimers Dement (Amst)2018011013140

10.1016/j.dadm.2017.08.005

Lindsay

Tröger

König

Language impairment in Alzheimer’s disease-robust and explainable evidence for AD-related deterioration of spontaneous speech through multilingual machine learning

Front Aging Neurosci202113642033

10.3389/fnagi.2021.642033

34093165

Karlekar

Niu

Bansal

Detecting linguistic characteristics of alzheimer’s dementia by interpreting neural models

2026-05-06

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Jun 1-6, 2018

http://aclweb.org/anthology/N18-2

10.18653/v1/N18-2110

Kong

Exploring neural models for predicting dementia from language

2019

2026-05-23

University of British Columbia

https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0380363

10.14288/1.0380363

Wei

Tay

Bommasani

Emergent abilities of large language models

Transactions on Machine Learning Research2022

2026-05-06

https://openreview.net/forum?id=yzkSU5zdwD

OpenAI

GPT-4 technical report

arXiv2026-05-06Preprint posted online on Mar 4, 2024

https://arxiv.org/abs/2303.08774

Kingma

Adam: a method for stochastic optimization

arXivPreprint posted online on Jan 30, 2017

10.48550/arXiv.1412.6980

Grattafiori

Dubey

Jauhri

Pandey

Kadian

The llama 3 herd of models

arXiv2026-05-25Preprint posted online on Nov 23, 2024

https://arxiv.org/abs/2407.21783

Nussbaum

Morris

Duderstadt

Mulyar

Nomic embed: training a reproducible long context text embedder

arXiv2026-05-06Preprint posted online on Feb 3, 2025

https://arxiv.org/abs/2402.01613

Gemini Team

Google deepmind, gemini: a family of highly capable multimodal models

arXiv2026-05-06Preprint posted online on May 5, 2025

https://arxiv.org/abs/2312.11805

Liu

Iter

Wang

Zhu

Bouamor

Pino

Bali

G-eval: NLG evaluation using gpt-4 with better human alignment

2026-05-06

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Dec 6-10, 2023

Singapore

25112522

https://aclanthology.org/2023.emnlp-main

10.18653/v1/2023.emnlp-main.153

Barredo Arrieta

Díaz-Rodríguez

Del Ser

Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI

Information Fusion2020065882115

10.1016/j.inffus.2019.12.012

Ribeiro

Singh

Guestrin

“Why should i trust you?”: explaining the predictions of any classifier

2026-05-07

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics

Aug 13, 2016

San Diego, California

11351144

http://aclweb.org/anthology/N16-3

10.18653/v1/N16-3020

Lundberg

Lee

Guyon

von Luxburg

Bengio

Wallach

Fergus

Vishwanathan

SVN

A unified approach to interpreting model predictions

Adv Neural Inf Process Syst2017

2026-05-25

3047654774

https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions

Holzinger

Langs

Denk

Zatloukal

Müller

Causability and explainability of artificial intelligence in medicine

Wiley Interdiscip Rev Data Min Knowl Discov201994e1312

10.1002/widm.1312

32089788

Tonekaboni

Joshi

McCradden

Goldenberg

Doshi-Velez

Fackler

Jung

What clinicians want: contextualizing explainable machine learning for clinical end use

2026-05-25

Proceedings of the 4th Machine Learning for Healthcare Conference

Aug 9-10, 2019

Ann Arbor, Michigan

359380

https://proceedings.mlr.press/v106/tonekaboni19a.html

Kahneman

Sibony

Sunstein

Noise: A Flaw in Human Judgment2021

2026-05-25

Little, Brown Spark

1454

https://www.hachettebookgroup.com/titles/daniel-kahneman/noise/9780316451406/?

9780316451406

Bavaresco

Bernardi

Bertolazzi

LLMs instead of human judges? a large scale empirical study across 20 nlp evaluation tasks

2026-05-06

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2)

Jul 27 to Aug 1, 2025

https://aclanthology.org/2025.acl-short

10.18653/v1/2025.acl-short.20

Van Den Burg

G j.

Suzuki

Liu

Sensoy

Aligning black-box language models with human judgments

2026-05-06

Findings of the Association for Computational Linguistics

Apr 29 to May 4, 2025

https://aclanthology.org/2025.findings-naacl

10.18653/v1/2025.findings-naacl.376

R Core Team

R: a language and environment for statistical computing

R foundation for statistical computing2024

2026-05-07

https://www.R-project.org/

Liu

Proctor

Collier

Zhao

Automatic diagnosis and prediction of cognitive decline associated with Alzheimer’s dementia through spontaneous speech

2021 IEEE International Conference on Signal and Image Processing Applications (ICSIPA)

Sep 13-15, 2021

Kuala Terengganu, Malaysia

3943

10.1109/ICSIPA52582.2021.9576784

Multimedia Appendix 1

Large language model prompt templates for speaker attribution and linguistic feature analysis.