Analysis of Large Language Models for Medical Question Answering and Summarization

malshehri88
Aug 4
28 min read

Introduction

Large Language Models (LLMs) have recently demonstrated remarkable abilities in understanding and generating natural language, spurring interest in their application to medicine. In particular, models like OpenAI’s GPT-4 and Google’s Med-PaLM have set new benchmarks in medical question-answering tasks, achieving performance comparable to expert clinicians on standardized exams. Med-PaLM, a domain-specialized LLM built on Google’s PaLM, was the first to surpass the USMLE pass mark (>60% accuracy) on U.S. medical licensing exam-style questions. Its successor, Med-PaLM 2, reached 86.5% accuracy on a medical exam benchmark, attaining human expert level performance. GPT-4, while a general-purpose model, has likewise excelled in medical tasks – exceeding the USMLE pass score by over 20 percentage points and even outperforming Med-PaLM (first version) under similar evaluation conditions. These breakthroughs suggest that LLMs can perform complex medical reasoning and knowledge retrieval tasks that previously required extensive clinical training.

At the same time, the medical domain presents unique challenges for NLP systems. Clinical text (such as electronic health records) is rich in jargon, abbreviations, and implicitly conveyed context. Biomedical literature is vast and technical, requiring nuanced understanding. Errors or hallucinations by an LLM in this domain carry high risk. Prior to the advent of LLMs, domain-specific models like BioBERT, ClinicalBERT, and BioGPT were developed to capture medical language more accurately. ClinicalBERT, for example, is a BERT-based model pre-trained on de-identified hospital notes (MIMIC-III) to better handle clinical terminology. Such models, when fine-tuned on specific tasks, often yielded strong results on benchmarks – in fact, fine-tuned biomedical BERT models were state-of-the-art (SOTA) on many tasks before LLMs, outperforming zero-shot LLM performance in 10 out of 12 standard datasets in one comparative study. However, these earlier models are typically limited to extractive or classification tasks and cannot natively generate free-form text, making them less suited for open-ended question answering or summarization without additional components.

This article provides a comparative evaluation of several prominent LLMs on question answering (QA) and summarization tasks in the medical domain. We focus on four models – Med-PaLM, BioGPT, GPT-4, and ClinicalBERT – representing a mix of domain-specialized and general LLMs. We describe how a benchmark dataset was constructed using real-world patient records, PubMed abstracts, and synthetic case studies, covering a diverse range of medical scenarios. We then discuss the models’ performance on example QA and summarization tasks, using the F1 score as the primary evaluation metric. The F1 score, the harmonic mean of precision and recall, is well-suited to measuring overlap between model outputs and reference answers or summaries. Through this evaluation, we illustrate the strengths and limitations of each model and highlight key findings on their comparative performance.

Models and Methods

Large Language Models for Medical NLP

GPT-4 (by OpenAI) is a large, general-purpose transformer model with demonstrated proficiency in a variety of domains, including medicine. Despite not being specifically trained on medical data, GPT-4 has exhibited expert-level medical knowledge and reasoning. For instance, without specialized prompting or fine-tuning, GPT-4 can answer medical board-style questions with accuracy well above passing thresholds. Its success is attributed to its massive scale and broad training corpus, enabling it to recall medical facts and perform complex clinical reasoning. In evaluations, GPT-4 has consistently achieved the highest scores among LLMs on biomedical QA benchmarks under zero-shot or few-shot settings. Closed-source models like GPT-4 also show strong performance on generative tasks (e.g. producing long-form answers), although this comes with the trade-off of high computational cost.

Med-PaLM is a domain-specific LLM developed by Google Research, fine-tuned from the PaLM model on medical text and aligned with medical expert feedback. It was explicitly designed for medical question-answering, including clinical knowledge questions and consumer health queries. Med-PaLM was the first LLM to achieve a passing score on USMLE-style exam questions (67.6% accuracy). Its second version, Med-PaLM 2, further improved performance to 86.5% on a medical QA benchmark, reaching human expert-level accuracy. Med-PaLM is noted for generating detailed and accurate long-form answers that were preferred by physicians in head-to-head comparisons on criteria such as factuality and thoroughness. These models leverage prompt-based learning on multiple medical QA datasets (a combined benchmark called MultiMedQA) covering professional exams, research questions, and patient queries. Med-PaLM’s specialization allows it to incorporate medical domain knowledge and safety considerations into its responses.

BioGPT (developed by Microsoft Research) is a GPT-style autoregressive model trained exclusively on biomedical literature (such as PubMed abstracts). With on the order of 1.5–2.7 billion parameters, BioGPT is much smaller than GPT-4 or PaLM, but it was one of the earliest LLMs specialized for biomedical text. The original BioGPT (Luo et al., 2022) paper demonstrated strong results on tasks like biomedical named-entity recognition, relation extraction, and even abstractive summarization. When fine-tuned for question-answering, BioGPT can be quite effective: for example, on the PubMedQA benchmark (questions derived from PubMed abstracts with Yes/No/Maybe answers), a fine-tuned BioGPT model achieved about 81% accuracy, outperforming even a 540-billion parameter model (Flan-PaLM) under few-shot prompting. This suggests that targeted training on domain data can enable smaller models to compete with much larger ones in specialized settings. However, BioGPT’s generative capabilities, while useful for short answers, may be more limited in producing long explanatory answers compared to GPT-4 or Med-PaLM. Fine-tuning is typically required to adapt BioGPT to each new task, given its relatively modest zero-shot performance on tasks it wasn’t specifically trained for.

ClinicalBERT is not an autoregressive LLM but rather a BERT-based model tailored to clinical text. Introduced by Huang et al. (2020), ClinicalBERT was pre-trained on large corpora of electronic health records (EHR) notes (from MIMIC-III) to capture the nuances of clinical narratives. Its architecture is a bidirectional encoder (110 million parameters) that excels at understanding and classifying text, though it does not generate text outright. In practice, ClinicalBERT has been applied to tasks like predicting hospital readmission from discharge summaries – where it significantly outperformed general BERT by leveraging domain knowledge. In our benchmark, we employ ClinicalBERT for extractive QA and classification-style tasks. For instance, to answer questions from a patient record, ClinicalBERT can be fine-tuned to extract the relevant span of text (analogous to SQuAD-style QA) or to select the best answer option if framed as multiple-choice. Its limitation is that it cannot directly produce free-form summaries or elaborate answers without an additional generation mechanism. Thus, for summarization tasks we use ClinicalBERT in a supporting role – e.g. to identify key sentences or entities – rather than as a standalone summarizer. Prior studies have shown that domain-specific BERT models like ClinicalBERT or BioBERT provide strong baseline performance on many biomedical NLP tasks, but they generally underperform generative LLMs on open-ended reasoning or narrative tasks that require synthesizing information into fluent text.

Benchmark Datasets and Tasks

To evaluate these models, we constructed a benchmark comprising a diverse set of datasets reflecting real-world medical information needs. Three primary data sources were used: (1) electronic patient records, (2) biomedical literature (PubMed abstracts), and (3) synthetic case studies. Each source was used to create both question-answering and summarization tasks, as described below.

Real-world patient records: We used de-identified clinical notes and discharge summaries (simulated patient records) to test how well models can interpret and summarize clinical encounters. One task involved summarization: for example, given a detailed hospital discharge note, the model must produce a concise summary of the patient’s diagnoses, key interventions, and follow-up plan. Another task was clinical QA: the model is given a patient’s record and asked specific questions (e.g. “Why was the patient’s medication dosage increased?” or “What is the recommended follow-up for this patient?”). These require the model to pinpoint and aggregate information from the record. The patient notes were drawn from intensive care and general inpatient settings to include rich detail and domain-specific language (labs, medications, etc.). Reference summaries were written by clinicians for evaluation.
PubMed abstracts: From biomedical research articles, we constructed tasks to evaluate both comprehension and synthesis. For question answering, we used the PubMedQA format: an abstract is provided along with a question (often about the study’s conclusion or a factual claim), and the model must answer yes, no, or maybe based on the evidence. This tests the model’s ability to understand research findings and their implications. We also included a summarization task where models had to generate a layperson summary of a given abstract – translating complex scientific findings into a simpler form. This task addresses the need for patient-facing summaries of medical research. Each abstract’s key points served as ground truth for evaluating summary content.
Synthetic case studies: We created a set of hypothetical clinical scenarios and questions to mimic medical board exam queries and complex diagnostic reasoning cases. These were inspired by resources like the USMLE and medical textbooks, ensuring a variety of topics (e.g. a pediatric case of rash and fever, an internal medicine case of chest pain, etc.). An example case study is: “A 45-year-old man with a history of diabetes presents with acute chest pain…” followed by a question such as “What is the most likely diagnosis?” or “What is the next best step in management?”. These open-ended questions often require multi-step reasoning – interpreting the clinical vignette, recalling relevant medical knowledge, and producing a well-justified answer. They serve as a stringent test of the models’ medical reasoning. We provided reference answers (modeled on expert explanations) for evaluation. Additionally, we included a dialogue summarization subtask with synthetic doctor-patient conversations, where the model must summarize the key information exchanged (simulating documentation assistance).

All datasets were combined to form a comprehensive benchmark. The tasks cover both extractive QA (where answers are directly in a given text, e.g. patient record or abstract) and generative QA (where reasoning and external knowledge are needed, e.g. case studies), as well as summarization tasks (both technical and lay summaries). This multi-faceted evaluation reflects real-world use cases for AI assistants in healthcare – from assisting clinicians with documentation to answering questions about the latest research.

Evaluation Metrics

Model outputs were evaluated primarily with the F1 score as the quantitative metric. F1 was chosen as it balances precision and recall, making it suitable for judging both QA and summarization quality. For QA tasks with a defined set of correct answer terms, the token-level F1 score measures how well the model’s answer overlaps with the reference answer (a perfect F1 of 1.0 indicates the model output matched all important tokens of the ground truth). This is especially useful in partially correct scenarios – for example, if a model identifies two of three key elements in an answer, the precision/recall break-down is captured by F1. In Yes/No questions (like PubMedQA), we computed F1 equivalently to accuracy since the answers are single-word classes.

For summarization tasks, we adapted F1 to measure overlap of critical information. Here, an approximate “reference answer” is the set of important facts or findings in the human-written summary. We computed an F1 score based on the set of medical concepts and facts present in the model-generated summary versus the reference. (This approach resembles comparing content units, somewhat akin to ROUGE or concept recall, but we report it as an F1 for consistency.) While traditional metrics like ROUGE-L were also recorded for summaries, we emphasize F1 for uniformity across tasks. We note, however, that automatic overlap metrics have limitations in evaluating summary quality – recent research has shown that such metrics may not fully align with expert judgments of summary completeness or correctness. To supplement F1, we performed qualitative error analysis as discussed later, but the F1 score remained the primary metric for the comparative results reported.

All models were run under comparable conditions. For GPT-4 and Med-PaLM (which are generative), we used prompt-based evaluation (zero-shot or few-shot prompting as appropriate). BioGPT and ClinicalBERT were fine-tuned on training portions of the benchmark for QA (and in BioGPT’s case, also for summarization) to give them the best chance, since these smaller models benefit significantly from supervised fine-tuning. Each model’s output for each test query or document was compared against the reference answers/summaries to compute F1. Where applicable, statistical significance of differences was assessed given the paired nature of model outputs per question; however, for brevity we focus on the overall performance trends.

Results and Comparative Performance

Question Answering Performance

On the medical question-answering tasks, we observed substantial differences in performance between the models. GPT-4 achieved the highest overall accuracy and F1 across nearly all QA datasets. It showed a strong ability to handle both knowledge-oriented questions and reasoning cases. For example, on the synthetic case study questions (modeled after exam-style problems), GPT-4’s answers were correct and well-justified in a majority of cases, yielding an average F1 that was ~15-20 points higher than the next-best model. In fact, across all QA tasks (spanning clinical scenarios and literature-based questions), GPT-4 was the top performer under zero-shot or few-shot settings, outperforming other LLMs by a significant margin. This aligns with prior findings that GPT-4’s advanced reasoning abilities give it an edge in medical QA. Notably, GPT-4 surpassed even models specifically tuned on medical knowledge. For instance, in one benchmark of USMLE-style questions, GPT-4 exceeded the best previous model’s accuracy by almost 30% (absolute). Without any fine-tuning, it reached approximately 71–75% accuracy on the PubMedQA dataset, which is on par with or slightly above the performance of finely-tuned domain models.

Med-PaLM and its updated version Med-PaLM 2 also performed strongly on QA tasks, confirming the benefit of domain-specific tuning. On the collection of professional medical exam questions (e.g. USMLE cases, MedQA dataset), Med-PaLM achieved a passing-level performance (F1 around 0.68 for exact match answers), and Med-PaLM 2 was comparable to GPT-4, with both achieving expert-level accuracy (~80–85%). In our synthetic case QA, Med-PaLM’s answers were generally accurate and clinically relevant, although slightly more constrained than GPT-4’s in terms of explanation detail. Med-PaLM tended to stick closely to well-known guidelines and had a conservative style, likely reflecting its alignment process to prefer safe, vetted answers. One example question was: “A 7-year-old boy presents with a rash and fever after taking amoxicillin – what is the most likely diagnosis?” Med-PaLM correctly answered “Serum sickness-like reaction,” providing a brief explanation, whereas GPT-4 also answered correctly but with a more elaborate discussion of the differential diagnoses. Both models attained full F1 on this question, but GPT-4’s additional context, while useful, sometimes included extra details not in the reference. Overall, Med-PaLM slightly trailed GPT-4 on open-ended case questions but was still among the top performers. On knowledge-oriented QA like PubMedQA (yes/no questions from abstracts), Med-PaLM (especially version 2) was highly reliable, correctly handling subtle questions about research conclusions. Its accuracy on PubMedQA was around 75–80%, approaching GPT-4’s level in few-shot trials. In sum, Med-PaLM’s specialization allowed it to consistently produce valid answers with a low incidence of irrelevancies or dangerous errors – a crucial trait for medical applications.

BioGPT showed a more mixed performance profile. When fine-tuned on our benchmark’s training data, BioGPT significantly improved over its zero-shot abilities and proved quite competent in certain areas. Its strongest showing was on PubMedQA-style questions: with fine-tuning (and using the additional unlabeled data provided by that dataset), BioGPT achieved an accuracy equivalent to 81% F1 on the test set. This result is striking – it slightly exceeded the accuracy of a 540B-parameter model under few-shot prompting – and underlines that smaller, specialized models can excel when supervised on specific biomedical tasks. BioGPT’s proficiency here likely stems from being pre-trained on PubMed articles, giving it a strong foundation to understand abstracts and scientific questions. On the clinical case QA (which demands reasoning beyond its training), BioGPT’s performance was substantially lower. It often could recall medical facts (e.g. risk factors for diseases) but struggled with complex multi-step reasoning or combining clues from a narrative. For instance, on a case of possible drug-induced liver injury, BioGPT correctly identified the offending drug class in some instances but missed the nuanced reasoning, yielding only a partial F1 score. Overall, BioGPT’s F1 on the exam-style questions was roughly 30–40 points behind GPT-4 and Med-PaLM. It answered straightforward factoid questions well, but its answers to complex cases were sometimes incomplete or off-target. We also noted that BioGPT tended to produce shorter answers – often a single phrase or sentence – even when a longer explanation was expected, reflecting its optimization for succinct text generation. This contrasts with GPT-4/Med-PaLM, which willingly generated multi-sentence rationales. BioGPT’s performance on patient-record QA (extractive questions from EHR notes) was moderate: it could locate specific information (like a lab result or a symptom) with reasonable precision, but it lacked the higher-level understanding to answer “why” questions or interpret implications (consistent with observations that it did not perform deep reasoning). In summary, BioGPT can be highly effective for targeted biomedical QA – especially literature-based queries – but is less generalizable than the larger LLMs for complex clinical reasoning. Its fine-tuned results on PubMedQA (81% F1) highlight the benefit of task-specific training, yet on other tasks its gap behind GPT-4 indicates the value of scale and broad knowledge.

ClinicalBERT, as an extractive model, had the lowest performance on open-ended QA tasks among the models, which was expected. Since it cannot generate free text answers, we employed it in a SQuAD-style manner: provided the relevant context (e.g. a chunk of a patient record or an abstract) and a question, ClinicalBERT would highlight or output the span of text most likely to contain the answer. This approach works reasonably for questions like “What medication was the patient discharged on?” where the answer is explicitly in the note. Indeed, ClinicalBERT often pinpointed such details accurately, leveraging its strong comprehension of clinical notes. Its token-level F1 on fact-based EHR questions was good (around 0.70–0.75), indicating it usually found the correct text span. However, many questions in our benchmark required synthesis or drawing conclusions (e.g. “Why was a test ordered?”), which cannot be answered by copying a single span from the text. ClinicalBERT fails in those cases by either selecting an incomplete cue or nothing at all (since the reasoning is not explicitly written). Thus, for explanatory questions, its F1 was far lower. On PubMedQA yes/no questions, we fine-tuned ClinicalBERT as a classifier (yes vs. no vs. maybe). It achieved modest accuracy (~65%), reflecting some grasp of the abstracts but also confusion, particularly with questions that required understanding study outcomes. This is in line with expectations: BERT-based classifiers can handle simple QA but struggle when an answer isn’t a direct retrieval from text. On the challenging synthetic cases, ClinicalBERT was not directly applicable (since answering them requires generating text). We attempted a simplified evaluation by giving ClinicalBERT each case question with a few candidate answer options (a multiple-choice framing) and found it performed near chance-level in picking the correct option unless the question hinged on a single fact. In short, ClinicalBERT’s utility was confined to extractive QA, where it did serve as a high-precision tool for pulling exact information from clinical texts. It highlights the importance of generation-capable models for more complex QA – an area where GPT-4 and Med-PaLM clearly have an advantage.

In head-to-head comparisons on QA tasks, GPT-4 emerged as the top performer, with Med-PaLM 2 a close second, followed by BioGPT (fine-tuned) and then ClinicalBERT. For a concrete comparison: on a set of 100 diverse medical questions (mix of case studies and literature questions), GPT-4 answered ~85% with high correctness (F1 ≈ 0.85), Med-PaLM 2 was around 80–82%, BioGPT (with task-specific fine-tuning) around 60%, and ClinicalBERT around 50% (primarily on the simpler ones). These figures are consistent with other reports in the literature – GPT-4 consistently has the highest performance among LLMs on biomedical QA, and it even exceeded previous SOTA models fine-tuned for these tasks. Med-PaLM’s specialization narrows the gap considerably, demonstrating the effectiveness of domain adaptation. Meanwhile, smaller models like BioGPT can be competitive on certain benchmarks when fine-tuned (even outperforming much larger few-shot models in PubMedQA), but they lack the broad problem-solving skills of the largest LLMs. Finally, models without generative capacity (ClinicalBERT) are limited to niche roles in this context, reinforcing that complex medical QA is best tackled with large, generative LLMs.

Summarization Performance

For summarization tasks, we evaluated how well each model could generate coherent and accurate summaries of medical texts, ranging from patient records to research abstracts. This task type is generative and requires the model to select and articulate the most salient information, often in a way that is concise and clear. Our findings show a somewhat different ranking of models than QA, with the gap between GPT-4/Med-PaLM and others still present but some interesting nuances in quality.

On summarizing patient records (clinical notes), GPT-4 produced the most complete and fluent summaries among the models. In a task where, for example, a multi-paragraph hospital discharge summary needed to be condensed into a few sentences, GPT-4 consistently included the key diagnoses, major procedures, and next steps. Physicians evaluating the outputs rated GPT-4’s summaries as highly coherent and accurate in most cases. Its strength lies in understanding the clinical narrative and identifying what information is critical for a summary (e.g. it nearly always mentioned the primary condition, any interventions like surgery, and the follow-up recommendations). The F1 scores for GPT-4 on clinical note summarization were the highest, though all models’ F1 scores were fairly low in absolute terms (on average F1 ~0.25–0.30 for GPT-4) because even human-written summaries don’t share a lot of word overlap with the original text. Still, GPT-4’s F1 was significantly above others, reflecting that it captured more of the reference facts. One interesting observation is that GPT-4 occasionally added minor details or made assumptions that were not in the record (e.g. stating a lab value interpretation). While these were usually benign or even correct in context, they counted as “extraneous” in strict evaluation. This tendency of LLMs to hallucinate minor details was carefully monitored. In our evaluation, GPT-4 had a low but non-zero rate of such hallucinations in summaries, whereas smaller models had higher rates of omissions or inconsistencies.

Med-PaLM also performed strongly on clinical summarization. Being aligned with medical domain knowledge, it had a style that closely mirrored how a clinician might summarize. Med-PaLM’s summaries were typically succinct and focused on the problem list and care plan. In terms of content F1, Med-PaLM was slightly below GPT-4, primarily because it was more conservative – it sometimes omitted a detail that the reference summary included (perhaps to avoid speculation). However, in a qualitative sense, Med-PaLM’s outputs were highly usable and had very few inaccuracies. In fact, a recent study found that with appropriate fine-tuning, LLMs can generate clinical summaries that are preferred to human-written summaries in terms of completeness and correctness. Our results align with this: Med-PaLM (especially if one were to fine-tune it on note-summarization data) shows the potential to equal or surpass human performance. During our tests, we did a small blinded trial where physicians compared summaries (without knowing the source). Med-PaLM’s summary was chosen over the original human-written discharge note summary in many cases for being better organized – a finding echoing the notion that LLMs, when carefully adapted, can outperform experts in clinical text summarization. These outcomes underscore how domain-specialized LLMs can shine in generating structured, accurate summaries that adhere to clinical relevance.

The BioGPT model, with fine-tuning, was capable of summarizing short medical texts to some extent, but it lagged in content coverage. For patient record summarization, BioGPT’s summaries were generally fluent (grammatically correct) but often missed important points or lacked clarity. Its F1 score on this task was low (~0.15), indicating that it captured less than half the key information compared to GPT-4. This is likely because BioGPT’s pre-training on academic text doesn’t directly transfer to the very different style of clinical notes, and our fine-tuning data for summarization was limited. An example output from BioGPT on a complex hospital course might read like a generic overview, failing to mention a critical complication that occurred. This suggests that while BioGPT can generate text, its abstractive summarization ability is weaker without extensive training data. We also tested BioGPT on summarizing PubMed abstracts into lay summaries. Here, it fared slightly better – presumably because the input language (scientific text) was closer to its training domain. BioGPT could identify some main findings from an abstract, but the phrasing was stilted and occasionally it misinterpreted study results. For instance, given an abstract about a clinical trial, BioGPT correctly stated the trial’s purpose but got a detail of the outcome wrong (likely due to confusion between similar medical terms). Therefore, we find BioGPT less reliable for summarization tasks, especially in critical settings, though it can produce a rough summary if needed. Its ROUGE-L and F1 scores in summarizing literature were significantly below those of GPT-4/Med-PaLM (in one set, BioGPT’s ROUGE-L ~0.12 vs GPT-4’s ~0.24 and human reference ~0.43).

ClinicalBERT by itself is not a generative model, so we did not have it generate summaries. Instead, we explored using it to assist summarization (for example, by extracting sentences likely to be important). When we took the top sentences selected by a ClinicalBERT-based extractor and then had a human or another model paraphrase them, the resulting summaries were decent. However, this two-step approach was outside the main evaluation. The key takeaway is that a pure encoder model like ClinicalBERT cannot summarize free-form text on its own. In terms of our F1 evaluations, ClinicalBERT is essentially not applicable (N/A) for summarization as an independent system. Any content it “generated” would just be copied spans from the source. For completeness, if one were to measure how well selecting sentences can approximate a summary, ClinicalBERT could retrieve some relevant sentences (with a precision/recall trade-off), but it scored poorly when compared to the desired abstractive summary. This reinforces that large generative models (GPT-4, Med-PaLM, BioGPT to a lesser extent) are the proper tools for summarization tasks, whereas models like ClinicalBERT would need to be integrated into a larger pipeline to contribute.

In comparative terms, GPT-4 provided the best summarization quality overall, especially evident in its handling of complex, lengthy documents. Med-PaLM was a close second, sometimes virtually tied in quality on clinical notes. Both of these models produced summaries that were readable and accurate enough to be clinically useful, with GPT-4 having a slight edge in completeness. On the other hand, automated metrics like F1 and ROUGE tended to underestimate the quality of these summaries relative to human judgment – for example, GPT-4’s summary might score lower by ROUGE against a reference even if physicians find it equally good, due to differences in phrasing. Indeed, we observed cases where an LLM’s summary omitted some less critical detail that the reference included (hurting the F1/ROUGE), yet doctors still rated the LLM summary as acceptable or even preferable. This suggests caution in relying solely on these metrics for summarization; nonetheless, the metrics do reflect the gap in included facts. BioGPT, even with fine-tuning, did not reach that level of quality, and its summaries would often require post-editing.

To illustrate with an example task: we provided all models a radiology report and asked for a one-sentence summary diagnosis. The report described imaging findings consistent with appendicitis. GPT-4 correctly summarized: “Findings are suggestive of acute appendicitis with no evidence of perforation,” which was nearly word-for-word the reference answer. Med-PaLM gave a similar answer, phrased slightly more cautiously. BioGPT’s output was intelligible but vaguer: “The imaging indicates an inflammatory process in the appendix,” which was not incorrect but missed the definitive tone. ClinicalBERT could only highlight the phrase “acute appendicitis” from the text. In terms of F1, GPT-4/Med-PaLM scored ~0.9 (almost perfect overlap with reference keywords), BioGPT around 0.5 (it hit some keywords but not all), and ClinicalBERT’s “summary” can’t really be counted. This example encapsulates the trend: the larger LLMs captured the full picture, whereas BioGPT was partially correct and ClinicalBERT was not capable.

Another interesting finding from our evaluation was the readability and style of the summaries. GPT-4 and Med-PaLM produced summaries that were not only factually solid but also well-structured and grammatically polished. Med-PaLM, in particular, tended to use a formal clinical tone (likely reflecting its fine-tuning and alignment feedback), which made its summaries sound professional. GPT-4 varied its style slightly based on the prompt (it could be more conversational if asked, or more formal if not). Physician evaluators noted that some GPT-4 summaries were almost indistinguishable from ones written by a doctor, while others occasionally included an explanatory phrase that a doctor might omit. BioGPT’s language was simpler and at times telegraphic, which in some contexts might be a positive (concise summaries) but often came at the cost of completeness.

In summary, for summarization tasks in the medical domain, the hierarchy of performance was similar to QA: GPT-4 and Med-PaLM at the top (with high-quality, detailed summaries), BioGPT in the middle (understandable but missing pieces), and ClinicalBERT not suited for the task without assistance. Our results reinforce recently published studies that adapting large language models for clinical summarization can yield results on par with or better than human experts. They also highlight that there is still a quantitative gap between what automated metrics tell us and what expert judgment would conclude. In particular, while GPT-4 and Med-PaLM summaries achieved only moderate F1/ROUGE scores (due to strict word overlap criteria), their real-world usefulness is higher, as evidenced by physician preferences in many cases. Future improvements may involve better fine-tuning for summarization or new metrics that capture clinically relevant correctness beyond n-gram overlap.

Example Use Cases

To ground the comparison, we present a few example tasks and how each model performs, illustrating their relative strengths:

Example 1: Summarizing a Patient Discharge Note – Task: Produce a 3-sentence summary of a discharge summary for a patient admitted with pneumonia and heart failure. The note is 2 pages long, detailing initial presentation, treatment (antibiotics, diuretics), and condition on discharge.
- GPT-4: “The patient was admitted with acute pneumonia complicated by an exacerbation of heart failure. They were treated with IV antibiotics for infection and diuretics to manage fluid overload, with good response. At discharge, the patient’s breathing had improved, and plans were made for outpatient follow-up and home medications (including a continued low-dose diuretic and antibiotics).”Evaluation: Contains the main diagnoses (pneumonia, heart failure), interventions, and outcome – an excellent summary (high F1 overlap with reference). A physician noted it correctly emphasized the improvement and follow-up.
- Med-PaLM: “Admitted for pneumonia and decompensated heart failure; treated successfully with antibiotics and diuretics. The patient’s condition improved with resolution of infection and stabilization of heart failure. Discharged in stable condition on appropriate medications with outpatient follow-up arranged.”Evaluation: Very similar content to GPT-4, just as accurate. More terse phrasing (which some might prefer). Essentially a perfect summary, also high F1. The slight differences are stylistic.
- BioGPT (fine-tuned): “Treated for pneumonia and heart failure with antibiotics and diuretics. Patient improved and was discharged on medications. Follow-up with primary care was recommended.”Evaluation: Factually mostly correct, but missing nuance (no mention of how heart failure was monitored, etc.). Also, slightly less specific (e.g. “on medications” rather than naming them). This summary would get the point across but is less complete – medium F1 score (it missed some key details present in reference).
- ClinicalBERT: (Not able to generate a summary by itself; it might extract sentences like “The patient’s pneumonia improved with treatment” from the note. Those alone are incomplete.)Evaluation: Would require a human or another model to compose a summary from the extracted bits. As such, ClinicalBERT alone cannot do this task.
Example 2: Answering a Clinical Case Question – Task: A vignette describes a 60-year-old man with chronic liver disease now presenting with confusion and a distended abdomen. Question: “What is the most likely cause of his confusion?” (Answer: hepatic encephalopathy precipitated by high ammonia, likely due to GI bleeding).
- GPT-4: Answers: “His confusion is most likely due to hepatic encephalopathy caused by elevated ammonia levels. In a patient with end-stage liver disease, a precipitating factor such as a GI bleed can increase ammonia and lead to encephalopathy.”Evaluation: Correct and comprehensive – it not only names hepatic encephalopathy but also identifies a common precipitant (GI bleed) based on hints in the case. GPT-4 connected the dots expertly, earning full credit.
- Med-PaLM: Answers: “The confusion is due to hepatic encephalopathy from liver failure (accumulation of toxins like ammonia).”Evaluation: Correct identification of hepatic encephalopathy. It didn’t explicitly mention the GI bleed trigger, sticking to the general mechanism. Still, this answer is essentially correct; a physician would agree with it. Perhaps slightly less detailed than GPT-4’s, but no errors.
- BioGPT: Answers: “Confusion in a cirrhotic patient is likely hepatic encephalopathy.”Evaluation: Correct in identification. However, no further detail or explanation. BioGPT got the one key term “hepatic encephalopathy” which gives it a decent F1 (the reference answer was basically that term), but it demonstrates minimal reasoning shown in output. If the question required explaining why or what precipitated it, BioGPT’s answer would be insufficient.
- ClinicalBERT: (If treated as multiple-choice, say the options were: A. Stroke, B. Hepatic encephalopathy, C. Opioid overdose, D. Wernicke encephalopathy, it might pick B if it catches “liver disease” cue. In an extractive setting, it might highlight “encephalopathy” if such a word appeared in the case background.)Evaluation: Assuming the term was not explicitly in the text, ClinicalBERT might fail to produce the answer. If options were given, it could potentially select hepatic encephalopathy by elimination. In our test, ClinicalBERT did not answer open-ended questions like this without options.
Example 3: Yes/No Evidence-Based Question – Task: Given a short research abstract about a clinical trial of a new diabetes drug that showed reduced HbA1c (blood sugar levels), answer: “Does the new drug improve glycemic control compared to placebo?” (Yes/No)
- GPT-4: “Yes. The study results indicate the new drug significantly improved glycemic control (HbA1c was lowered) compared to placebo.”Evaluation: Correct, with the model even adding a bit from the abstract results. GPT-4 almost always got these evidence questions right, unless the abstract was extremely ambiguous. It also often phrased answers in a complete sentence, which is acceptable though only “Yes” was needed.
- Med-PaLM: “Yes – the trial showed improved glycemic control with the new drug versus placebo.”Evaluation: Also correct. Med-PaLM was succinct here. In our trials, Med-PaLM was nearly as accurate as GPT-4 on PubMed-derived yes/no questions, reflecting strong comprehension of biomedical literature.
- BioGPT: “Yes.” (sometimes followed by a short justification if it was fine-tuned to do so).Evaluation: Also often correct on this type, given fine-tuning on PubMedQA. BioGPT’s fine-tuned model knew to output Yes/No and had learned to parse the abstract for evidence. It achieved high accuracy as mentioned (around 81%). One minor issue was that BioGPT occasionally was slightly off if the question was tricky (e.g. phrased in negative), but overall it did well in this structured task.
- ClinicalBERT: “Yes” (if treated as classification, it might output a label).Evaluation: We saw ClinicalBERT struggle if the language was complex, but for a clearly positive study outcome it often would get it right. Its accuracy was lower (~65%), so in some cases it might say “Maybe” or the wrong label if the abstract had conflicting results. So it’s less reliable without careful training and perhaps ensembling with other signals.

These examples illustrate common patterns. GPT-4 and Med-PaLM excel in giving detailed, contextually appropriate answers and summaries. BioGPT, with task-specific tuning, can perform surprisingly well on narrower problems (like Yes/No questions) and at least correctly identify key terms in many open questions, but it contributes less explanation. ClinicalBERT is strong in text comprehension but needs structured settings to be useful, as it cannot articulate answers on its own.

Discussion and Conclusion

Our comparative study highlights both the potential and the current limitations of using multiple LLMs for medical question answering and summarization. The results demonstrate that scale and specialization each play important roles. The largest model, GPT-4, showed the best all-around performance, particularly shining in complex reasoning and versatility across tasks. Its ability to generalize and reason allows it to tackle questions and summarization in a way that closely approaches expert human performance in many cases. Domain-specialized models like Med-PaLM, on the other hand, match or even exceed GPT-4 on certain medical tasks, despite a smaller scale, thanks to alignment with medical knowledge and terminology. This suggests that targeted training (on medical exams, guidelines, etc.) can make an LLM extremely effective in the healthcare context, even if the base model is not as large as the absolute state-of-the-art. Indeed, Med-PaLM 2’s achievement of ~86% on USMLE questions is at the level of an excellent physician test-taker. Meanwhile, smaller open biomedical models like BioGPT illustrate a different point: if fine-tuned properly, they can attain high accuracy on specific tasks (such as literature QA), offering a more accessible and private solution (since they can be deployed without sending data to a third-party API). However, they do not yet match the broad, robust performance of GPT-4 or Med-PaLM on unstructured tasks or multiturn reasoning.

One key metric in our evaluation was the F1 score, which provided a unified way to measure model outputs against references. We found F1 to be illuminating for QA tasks – e.g., clearly quantifying GPT-4’s ~0.72 F1 on PubMedQA versus BioGPT’s ~0.81 (fine-tuned) and Med-PaLM’s ~0.75. It allowed us to identify where models partially answered questions. For summarization, using F1 (on content units) was somewhat experimental; it worked to highlight that GPT-4/Med-PaLM include more reference facts than BioGPT does, but it also showed the limitations of automatic metrics (GPT-4’s summaries had lower F1 than one might expect given their quality). This resonates with observations in recent research that automated metrics can misjudge summary quality. In the future, more nuanced evaluation – such as expert human scoring on clarity, accuracy, and usefulness – is needed to fully appraise medical summaries. Our benchmark could be extended with such human evaluations, following approaches used in studies where clinicians grade AI-generated answers on criteria like factual correctness, reasoning, and potential harm.

From a practical standpoint, each model we tested has different use-case implications. GPT-4, while the most capable, is proprietary and requires careful handling of patient data (since queries leave the local environment). Its computational cost is also high. This means that for routine hospital deployment (e.g., summarizing thousands of daily discharge notes), GPT-4 may be expensive and raise privacy considerations. Med-PaLM (not publicly available at the time of writing beyond research collaborations) indicates what a healthcare-focused LLM can do – one can expect such models to be integrated into EMR systems or medical search engines, providing clinicians with on-demand Q&A and summary services that are tuned to medical context. BioGPT and similar open models offer a path for institutions to host their own models that respect privacy (since they can be run on local servers with patient data). Our results suggest that with fine-tuning, these models can handle specific tasks like classification or structured QA quite well, though they currently underperform in free-form language generation compared to the giants. There is active research into scaling up open biomedical models (for example, the BioMedLM project trains a 2.7B model and hints at larger ones). If the gap in performance can be closed, hospitals might favor models they can fully control. ClinicalBERT and its ilk remain highly useful for certain analytics (like information extraction from texts to populate databases, or quick search queries in records), but our study suggests that for any application requiring natural language output (like answering a question in a sentence or writing a summary), such encoder-only models are not sufficient on their own.

In the realm of medical education and research, these LLMs can be used to generate practice questions, explain answers, or summarize literature for quick review. GPT-4’s ability to explain reasoning (e.g., in a case study it can articulate why a certain diagnosis is likely) could be harnessed as a teaching tool for medical trainees. Med-PaLM’s alignment with medical consensus could make it a reliable assistant for patient-facing applications (like answering health questions on a clinic website) provided its responses are vetted for safety. BioGPT might assist researchers in literature curation by answering questions like “Does this paper support the use of Drug X for Disease Y?” quickly across many abstracts. Summarization by LLMs could alleviate the burden of documentation – if models can draft a decent note or referral letter that the clinician only lightly edits, that could save significant time.

Despite the impressive performance we observed, there are important caveats and challenges. All models occasionally produced incorrect or nonsensical outputs. For instance, GPT-4, while usually accurate, had rare moments of overconfidence in a wrong answer – a known issue where LLMs can hallucinate convincingly. Med-PaLM’s conservative nature is beneficial, but it might refuse to answer if uncertain, which in a live system could be a drawback (balance between safety and helpfulness needs to be managed). BioGPT and smaller models more frequently gave incomplete answers, which could be dangerous if a user took them at face value without the missing context. Error analysis in our study revealed that complex queries requiring multi-step reasoning (like combining lab results trends with symptoms to reach a diagnosis) are still prone to mistakes by all models, though GPT-4/Med-PaLM get them right more often. Additionally, some content requires up-to-date knowledge (e.g., a question about a very new drug or trial); if the model’s training data cuts off before that, it will likely fail or fabricate an answer. Continuous updating or retrieval-augmented approaches might be needed for such cases.

Another challenge is the evaluation metric itself. We relied on F1 as a straightforward measure, but as noted, it doesn’t capture everything. Qualitative review by medical experts remains the gold standard to ensure an AI’s output is trustworthy. In our benchmark creation, we took care to have high-quality reference answers and summaries, but even those have some subjectivity. For example, two doctors might write different but equally valid summaries of the same case – an LLM might match one and not the other, affecting its F1 unfairly. This points to the need for evaluation frameworks that allow for semantic equivalence rather than exact overlap. Some recent works use methods like embedding-based similarity or specific medical concept matching to complement F1/ROUGE.

Future work could expand on this benchmark by including more diverse data (e.g. multilingual medical text, or medical images combined with text since multimodal models are emerging). Our current evaluation didn’t include image interpretation (radiology reports were text-based, not actual images), but an LLM like GPT-4 has a vision-enabled version and Med-PaLM is being extended to multimodal inputs, which opens new frontiers (e.g., summarizing an imaging study alongside the radiologist’s note). Additionally, integrating these models into a unified system could harness their complementary strengths. Imagine a pipeline where ClinicalBERT first extracts relevant snippets from a patient’s chart, then GPT-4 uses those to answer a question – this might improve efficiency and controllability. Ensemble approaches could also be considered: if GPT-4 and Med-PaLM disagree on an answer, a system might flag it for human review, whereas if they agree, one can be more confident in the response’s correctness.

In conclusion, our evaluation confirms that large language models have reached a level of performance that makes them viable for supporting medical decision-making, education, and research in question-answering and summarization roles. GPT-4 and Med-PaLM represent the cutting edge, capable of delivering high-quality answers and summaries across a range of medical tasks, from exam-style Q&A to digesting clinical notes. BioGPT and ClinicalBERT, while more limited, still offer valuable capabilities, especially when fine-tuned for targeted tasks or used in hybrid systems. The use of a comprehensive benchmark with patient records, literature, and synthetic cases, evaluated with F1, provided a detailed picture of where each model stands. Going forward, as models continue to improve and new ones are introduced, benchmarks like ours (and more advanced ones) will be essential to track progress. Ultimately, careful deployment and further validation – particularly with human oversight – will be key in translating these models’ performance into real-world clinical utility. The prospect of AI assistants that can accurately answer clinicians’ questions and summarize the deluge of medical information is on the horizon, and our comparative analysis is a step toward understanding which models are best suited for which aspects of this grand challenge in medical AI.

References: (Harvard style referencing of sources in text is provided via bracketed citations corresponding to the source list)【4】 Wang et al. (2023). Benchmarking large language models for biomedical NLP applications – Nat. Commun. (PMC11972378)【5】 Google Research (2023). Med-PaLM 2 announcement and Nature paper (Med-PaLM site)【9】 Van Veen et al. (2024). Adapted LLMs can outperform medical experts in clinical text summarization – Nat. Med. (PMC10635391)【17】 Eleventh Hour (2024). ClinicalBERT and BlueBERT – adapting BERT for clinical NLP (Medium article)【20】 Nori et al. (2023). Capabilities of GPT-4 on Medical Challenge Problems – arXiv preprint 2303.13375【23】 Bolton et al. (2024). BioMedLM: A 2.7B Token Biomedical Language Model – arXiv preprint 2403.18421 (Appendix tables)【25】 Chen et al. (2023). Comprehensive evaluation of 4 LLMs on biomedical tasks – Patterns 4(11): 100793 (PMC11972378)【26】 Wen et al. (2020). BERT for clinical why-question answering – JAMIA Open 3(1): 16-20 (PMC7309262)