Evaluating large language models for WAO/EAACI guideline compliance in hereditary angioedema management

Main Article Content

Mehmet Emin Gerek
Tuğba Önalan
Fatih Çölkesen
Şevket Arslan

Keywords

artificial Intelligence, hereditary angioedema, large language models, medical guidelines compliance, WAO/EAACI guidelines

Abstract

Introduction: Hereditary angioedema (HAE) is a rare but potentially life-threatening disorder characterized by recurrent swelling episodes. Adherence to clinical guidelines, such as the World Allergy Organization/European Academy of Allergy & Clinical Immunology (WAO/EAACI) guidelines, is crucial for effective management. With the increasing role of artificial intelligence in medicine, large language models (LLMs) offer potential for clinical decision support. This study evaluates the performance of ChatGPT, Gemini, Perplexity, and Copilot in providing guideline-adherent responses for HAE management.


Methods: Twenty-eight key recommendations from the WAO/EAACI HAE guidelines were reformulated into interrogative formats and posed to the selected LLMs. Two independent clinicians assessed responses based on accuracy, adequacy, clarity, and citation reliability using a five-point Likert scale. References were categorized as guideline-based, trustworthy, or untrustworthy. A reevaluation with explicit citation instructions was conducted, with discrepancies resolved by a third reviewer.


Results: ChatGPT and Gemini outperformed Perplexity and Copilot, achieving median accuracy and adequacy scores of 5.0 versus 3.0, respectively. ChatGPT had the lowest rate of unreliable references, whereas Gemini showed inconsistency in citation behavior. Significant differences in response quality were observed among models (p < 0.001). Providing explicit sourcing instructions improved performance consistency, particularly for Gemini.


Conclusion: ChatGPT and Gemini demonstrated superior adherence to WAO/EAACI guidelines, suggesting that LLMs can support clinical decision-making in rare diseases. However, inconsistencies in citation practices highlight the need for further validation and optimization to enhance reliability in medical applications.

Abstract 1056 | PDF Downloads 933 HTML Downloads 0 XML Downloads 17

References

1 Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: An analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33. 10.1007/s10916-023-01925-4

2 Golan R, Reddy R, Ramasamy R. The rise of artificial intelligence-driven health communication. Transl Androl Urol. 2024;13:356–8. 10.21037/tau-23-556

3 Altıntaş E, Ozkent MS, Gül M, Batur AF, Kaynar M, Kılıç Ö, et al. Comparative analysis of artificial intelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance. Fr J Urol. 2024;34(7–8):102666. 10.1016/j.fjurol.2024.102666

4 Reyhan AH, Mutaf Ç, Uzun İ, Yüksekyayla F. A performance evaluation of large language models in keratoconus: A comparative study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. J Clin Med. 2024;13(21):6512. 10.3390/jcm13216512

5 Boyd CJ, Hemal K, Sorenson TJ, Patel PA, Bekisz JM, Choi M, et al. Artificial intelligence as a triage tool during the perioperative period: Pilot study of accuracy and accessibility for clinical application. Plast Reconstr Surg Glob Open. 2024;12(2):e5580. 10.1097/GOX.0000000000005580

6 Fu L, Kanani A, Lacuesta G, Waserman S, Betschel S. Canadian physician survey on the medical management of hereditary angioedema. Ann Allergy Asthma Immunol. 2018;121(5):598–603. 10.1016/j.anai.2018.06.017

7 Greve J, Lochbaum R, Trainotti S, Ebert EV, Buttgereit T, Scherer A, et al. The international HAE guideline under real-life conditions: From possibilities to limits in daily life–current real-world data of 8 German angioedema centers. Allergologie select. 2024;8:346–57. 10.5414/ALX02530E

8 McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89–94. 10.1038/s41586-019-1799-6

9 Aggarwal R, Sounderajah V, Martin G, Ting DSW, Karthikesalingam A, King D, et al. Diagnostic accuracy of deep learning in medical imaging: A systematic review and meta-analysis. NPJ Digit Med. 2021;4(1):65.

10 Rocha-Silva R, de Lima BE, Costa TG, Morais NS, José G, Cordeiro DF, et al. Can people with epilepsy trust AI chatbots for information on physical exercise? Epilepsy Behav. 2024;163:110193. 10.1016/j.yebeh.2024.110193

11 Behers BJ, Stephenson-Moe CA, Gibons RM, Vargas IA, Wojtas CN, Rosario MA, et al. Assessing the quality of patient education materials on cardiac catheterization from artificial intelligence chatbots: An observational cross-sectional study. Cureus. 2024;16(9):e69996. 10.7759/cureus.69996

12 Maurer M, Magerl M, Betschel S, Aberer W, Ansotegui IJ, Aygören-Pürsün E, et al. The international WAO/EAACI guideline for the management of hereditary angioedema—the 2021 revision and update. Allergy. 2022;77(7):1961–90. 10.1111/all.15214

13 Tsai CY, Cheng PY, Deng JH, Jaw FS, Yii SC. ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance. Digit Health. 2024;10:20552076241269538. 10.1177/20552076241269538

14 Gokmen O, Gurbuz T, Devranoglu B, Karaman MI. Artificial intelligence and clinical guidance in male reproductive health: ChatGPT4.0’s AUA/ASRM guideline compliance evaluation. Andrology. 2025;13(2):176–83. 10.1111/andr.13693

15 Birkun AA, Gautam A. Large language model (LLM)-powered chatbots fail to generate guideline-consistent content on resuscitation and may provide potentially harmful advice. Prehosp Disaster Med. 2023;38(6):757–63. 10.1017/S1049023X23006568

16 Kolbinger FR, Veldhuizen GP, Zhu J, Truhn D, Kather JN. Reporting guidelines in medical artificial intelligence: A systematic review and meta-analysis. Commun Med (Lond). 2024;4(1):71. 10.1038/s43856-024-00492-0

17 Olczak J, Pavlopoulos J, Prijs J, Ijpma FFA, Doornberg JN, Lundström C, et al. Presenting artificial intelligence, deep learning, and machine learning studies to clinicians and healthcare stakeholders: An introductory reference with a guideline and a Clinical AI Research (CAIR) checklist proposal. Acta Orthop. 2021;92(5):513–25. 10.1080/17453674.2021.1918389

18 Mirzaei T, Amini L, Esmaeilzadeh P. Clinician voices on ethics of LLM integration in healthcare: A thematic analysis of ethical concerns and implications. BMC Med Inform Decis Mak. 2024;24(1):250. 10.1186/s12911-024-02656-3

19 Klement W, El Emam K. Consolidated reporting guidelines for prognostic and diagnostic machine learning modeling studies: Development and validation. J Med Internet Res. 2023;25:e48763. 10.2196/48763

20 Deb Roy A, Bharat Jaiswal I, Nath Tiu D, Das D, Mondal S, Behera JK, et al. Assessing the utilization of large language model chatbots for educational purposes by medical teachers: A nationwide survey from India. Cureus. 2024;16(11):e73484. 10.7759/cureus.73484

Full Question List
Should all patients suspected of having HAE be assessed for blood levels of C1-INH function, C1-INH protein, and C4?

Should testing for C1-INH function, C1-INH protein, and C4 be repeated in patients who test positive to confirm the diagnosis of HAE-1/2?

Should patients suspected of having HAE and exhibiting normal C1-INH levels and function be assessed for known mutations underlying HAE-nC1-IN?

Should on-demand treatment be considered for all HAE attacks?

Should any HAE attack affecting or potentially affecting the upper airway be treated?

Is it recommended to treat HAE attacks as early as possible?

Should HAE attacks be treated with either intravenous C1 inhibitor, ecallantide, or icatibant?

Is it recommended to consider early intubation or surgical airway intervention in cases of progressive upper airway edema for HAE patients?

Is it necessary that all patients with HAE carry a sufficient number of on-demand medications at all times?

Is it recommended for HAE patients to consider short-term prophylaxis before medical, surgical, or dental procedures as well as exposure to other angioedema attack-inducing events?

Is the use of intravenous plasma-derived C1 inhibitor recommended as the first-line short-term prophylaxis for HAE patients?

Is it suggested for HAE patients to consider prophylaxis prior to exposure to patient-specific angioedema-inducing situations?

Are the goals of treatment recommended to be achieving total control of the disease and normalizing patients’ lives for HAE patients?

Is it recommended for HAE patients to evaluate patients for long-term prophylaxis at every visit, considering disease activity, burden, and control as well as patient preference?

Is the use of plasma-derived C1 inhibitor recommended as the first-line long-term prophylaxis for HAE patients?

Is the use of lanadelumab recommended as the first-line long-term prophylaxis for HAE patients?

Is the use of berotralstat recommended as the first-line long-term prophylaxis for HAE patients?

Is the use of androgens recommended only as second-line long-term prophylaxis for HAE patients?

Is it suggested that all HAE patients using long-term prophylaxis be routinely monitored for disease activity, impact, and control to inform the optimization of treatment dosages and outcomes?

Is it recommended to carry out testing for children from HAE-affected families as soon as possible and to test all offspring of an affected parent?

Is it recommended to use C1 inhibitor or icatibant for the treatment of attacks in HAE diagnosed children under the age of 12?

Is it recommended to use plasma-derived C1 inhibitors as the preferred therapy during pregnancy and lactation for HAE patients?

Is it recommended for all HAE patients to have an action plan and treatment plan?

Is it recommended to have HAE-specific comprehensive, integrated care available for all HAE patients?

Is it recommended for HAE patients to be treated by a specialist with specific expertise in managing HAE?

Is it recommended that all HAE patients provided with on-demand treatment licensed for self-administration should be taught to self-administer?

Is it recommended that all HAE patients be educated about triggers that may induce attacks?

Is it recommended to screen family members of HAE patients for HAE?