Assessing Artificial Intelligence–Generated Responses to Urology Patient In-Basket Messages
This article is commented on by the following:
Abstract
Introduction:
Electronic patient messaging utilization has increased in recent years and has been associated with physician burnout. ChatGPT is a language model that has shown the ability to generate near-human level text responses. This study evaluated the quality of ChatGPT responses to real-world urology patient messages.
Methods:
One hundred electronic patient messages were collected from a practicing urologist’s inbox and categorized based on the question content. Individual responses were generated by entering each message into ChatGPT. The questions and responses were independently evaluated by 5 urologists and graded on a 5-point Likert scale. Questions were graded based on difficulty, and responses were graded based on accuracy, completeness, harmfulness, helpfulness, and intelligibleness. Whether or not the response could be sent to a patient was also assessed.
Results:
Overall, 47% of responses were deemed acceptable to send to patients. ChatGPT performed better on easy questions with 56% of responses to easy questions being acceptable to send as compared to 34% of difficult questions (P = .03). Responses to easy questions were more accurate, complete, helpful, and intelligible than responses to difficult questions. There was no difference in response quality based on question content.
Conclusions:
ChatGPT generated acceptable responses to nearly 50% of patient messages with better performance for easy questions compared to difficult questions. Use of ChatGPT to help respond to patient messages can help to decrease the time burden for the care team and improve wellness. Artificial intelligence performance will likely continue to improve with advances in generative artificial intelligence technology.
Artificial intelligence (AI)–driven large language models such as ChatGPT utilize machine learning to process complex data and generate near-human quality text responses in sectors like health care. ChatGPT has previously shown promise in the generation of medical knowledge, with prior studies showing that it performed at or near the passing level on the United States Medical Licensing Examination Step 3 licensing examination. Moreover, ChatGPT was able to make accurate clinical decisions in response to clinical vignettes.1-3 ChatGPT’s response to general medical questions obtained from a public online forum were also rated as more empathetic and of higher quality than those from physicians.4
Despite these promising initial findings, the performance of language models has been less robust in urology. ChatGPT was found to perform poorly on the urology Self-Assessment Study Program for in-service examination preparation, with ChatGPT correctly answering only 30% of questions.5 When generating responses to clinical vignettes within urology, ChatGPT outputs were found to be overall inappropriate and of poor quality.6
Improvements in electronic health record (EHR) technology, meaningful use requirements/incentives, and societal changes following the COVID-19 pandemic have led to increased utilization of virtual health care delivery. This shift has been associated with an increased time burden of EHR use for physicians with a large portion of this attributed to patient in-basket messages. Patient in-basket messages have increased by 157% with each message adding over 2 minutes of additional time spent on the EHR.7 This time is often uncompensated and frequently occurs after hours, consequences that contribute to physician burnout.8
While patient messages pertaining to generic issues such as appointment scheduling and medication refills can be handled quickly by nonphysician staff, those pertaining to specific medical questions often require more advanced knowledge and complex thinking. Utilization of generative AI technology such as ChatGPT represents a novel tool to decrease the administrative burden afflicting physicians and other members of the health care team. This study seeks to assess the ability of ChatGPT to generate responses to urology patient questions.
Methods
After receiving Institutional Review Board approval (IRB No. 70909), 100 unique patient messages requesting medical advice were collected from the in-basket of a practicing academic urologist specializing in andrology. Questions were selected consecutively with exclusion of nonclinical questions such as those relating to scheduling or medication refills. The questions were categorized based on content into groups of clinical decision-making, health and treatment plan, postoperative concerns, symptoms, or test results. Identifying information was removed to ensure patient confidentiality. Each question was entered into ChatGPT 3.5 using a new session for each question, and the generated responses were recorded.
The questions and generated responses were independently reviewed by 5 urologists. The questions were graded based on difficulty on a 5-point Likert scale (5 = difficult, 4 = somewhat difficult, 3 = neutral, 2 = somewhat easy, 1 = easy) considering the complexity of the question, level of knowledge needed to appropriately answer the question, and the number of components required to generate a complete answer for the query. Questions with a mean difficulty score ≥ 3 were considered difficult questions. Responses were graded on a 5-point Likert scale (1 = strongly disagree, 2 = agree, 3 = neutral, 4 = agree, 5 = strongly agree) in areas of accuracy, completeness, harmfulness, helpfulness, and intelligibleness. Reviewers also graded whether they would send the generated response to a patient. A question was deemed to be acceptable to send to a patient if the majority of reviewers answered yes.
Descriptive statistics such as mean, median, SD, and range were reported for each content group and overall score on areas of accuracy, completeness, harmfulness, helpfulness, and intelligibleness including all questions. Additional stratified analysis was performed according to the level of difficulty of questions.
Ordinal logistic regression was applied and odds ratios were calculated for each area to assess interrater reliability. χ2 test was performed to evaluate the difference of acceptable rates between easy and difficult questions. All tests were 2-sided, and P < .05 was considered significant. All analyses were done using SAS version 9.4.
Results
Overall, ChatGPT was judged to give accurate (mean 4.0) and intelligible (mean 4.7) answers to questions (Table 1). Completeness (mean 3.9) and helpfulness (mean 3.5) of responses were lower. Importantly, little harm was detected across all question types (mean 1.4). ChatGPT did significantly better on easy (n = 59) vs difficult (n = 41) questions on all parameters except harm, which was similarly low in both levels (Table 2). Comparisons between question categories showed some differences (Table 3). While questions involving patient symptoms generally scored higher than others, little consistent ranking could be discerned. Additionally, there was not a relationship between question category and rating of easy or difficult.
Clinical decision-making | Treatment plan | Postoperative | Symptoms | Test results | Overall | |
(n = 25) | (n = 30) | (n = 18) | (n = 18) | (n = 9) | (n = 100) | |
Accuracya | ||||||
Mean (SD) | 4.25 (0.89) | 3.95 (1.22) | 3.84 (1.12) | 4.26 (0.95) | 4.07 (1.07) | 4.07 (1.08) |
Median (min, max) | 4 (1, 5) | 4 (1, 5) | 4 (1, 5) | 5 (1, 5) | 4 (1, 5) | 4 (1, 5) |
Completenessa | ||||||
Mean (SD) | 3.85 (1.08) | 4.05 (1.18) | 3.83 (1.21) | 4.14 (1.02) | 3.58 (1.12) | 3.94 (1.14) |
Median (min, max) | 4 (1, 5) | 5 (1, 5) | 4 (1, 5) | 5 (1, 5) | 4 (1, 5) | 4 (1, 5) |
Harmfulnessa | ||||||
Mean (SD) | 1.28 (0.71) | 1.26 (0.66) | 1.36 (0.84) | 1.63 (1.15) | 1.42 (0.87) | 1.36 (0.84) |
Median (min, max) | 1 (1, 5) | 1 (1, 4) | 1 (1, 5) | 1 (1, 5) | 1 (1, 4) | 1 (1, 5) |
Helpfulnessa | ||||||
Mean (SD) | 3.46 (1.23) | 3.55 (1.34) | 3.42 (1.32) | 3.5 (1.18) | 3.09 (1.35) | 3.45 (1.28) |
Median (min, max) | 4 (1, 5) | 4 (1, 5) | 3 (1, 5) | 4 (1, 5) | 3 (1, 5) | 4 (1, 5) |
Intelligiblenessa | ||||||
Mean (SD) | 4.69 (0.66) | 4.8 (0.52) | 4.72 (0.65) | 4.77 (0.62) | 4.47 (1.04) | 4.72 (0.66) |
Median (min, max) | 5 (2, 5) | 5 (2, 5) | 5 (1, 5) | 5 (1, 5) | 5 (1, 5) | 5 (1, 5) |
Difficultya | ||||||
Mean (SD) | 3.03 (1.44) | 2.37 (1.31) | 2.51 (1.27) | 2.77 (1.39) | 2.98 (1.39) | 2.69 (1.38) |
Median (min, max) | 3 (1, 5) | 2 (1, 5) | 2 (1, 5) | 3 (1, 5) | 3 (1, 5) | 3 (1, 5) |
Send to patient?b | ||||||
Mean (SD) | 0.46 (0.5) | 0.49 (0.5) | 0.47 (0.5) | 0.53 (0.5) | 0.33 (0.48) | 0.47 (0.5) |
Median (min, max) | 0 (0, 1) | 0 (0, 1) | 0 (0, 1) | 1 (0, 1) | 0 (0, 1) | 0 (0, 1) |
Category | Clinical decision-making | Health and treatment plan | Postoperative | Symptoms | Test results | All responses | OR (95% CI) | P value |
Accuracy, mean (SD)a | 1.41 (1.02-1.97) | .0396 | ||||||
Difficult | 4.03 (0.85) | 3.88 (1.04) | 3.75 (1.1) | 4.57 (0.68) | 4.07 (0.88) | 4.02 (0.96) | ||
Easy | 4.53 (0.86) | 3.99 (1.31) | 3.92 (1.14) | 4.1 (1.04) | 4.07 (1.17) | 4.11 (1.15) | ||
Completeness, mean (SD)a | 2.57 (1.84-3.58) | < .0001 | ||||||
Difficult | 3.51 (1.02) | 3.9 (1.11) | 3.43 (1.28) | 4 (1.11) | 2.93 (0.96) | 3.62 (1.13) | ||
Easy | 4.27 (1.01) | 4.13 (1.21) | 4.16 (1.06) | 4.22 (0.98) | 3.9 (1.06) | 4.16 (1.09) | ||
Harmfulness, mean (SD)a | 0.85 (0.55-1.32) | .4691 | ||||||
Difficult | 1.3 (0.73) | 1.42 (0.84) | 1.35 (0.8) | 1.37 (0.76) | 1.67 (1.05) | 1.38 (0.8) | ||
Easy | 1.25 (0.7) | 1.18 (0.54) | 1.36 (0.88) | 1.77 (1.28) | 1.3 (0.75) | 1.36 (0.86) | ||
Helpfulness, mean (SD)a | 2.30 (1.66-3.18) | < .0001 | ||||||
Difficult | 2.99 (1.06) | 3.28 (1.31) | 3 (1.4) | 3.63 (1.1) | 2.6 (1.12) | 3.13 (1.22) | ||
Easy | 4.05 (1.18) | 3.68 (1.34) | 3.76 (1.15) | 3.43 (1.23) | 3.33 (1.4) | 3.68 (1.28) | ||
Intelligible, mean (SD)a | 1.92 (1.23-2.99) | .004 | ||||||
Difficult | 4.64 (0.7) | 4.72 (0.54) | 4.53 (0.85) | 4.77 (0.5) | 4.53 (0.83) | 4.65 (0.68) | ||
Easy | 4.75 (0.62) | 4.84 (0.51) | 4.88 (0.39) | 4.77 (0.67) | 4.43 (1.14) | 4.77 (0.64) | ||
Send to patient, mean (SD)?b | 2.08 (1.45-3.00) | < .0001 | ||||||
Difficult | 0.31 (0.47) | 0.36 (0.48) | 0.4 (0.5) | 0.57 (0.5) | 0.13 (0.35) | 0.37 (0.48) | ||
Easy | 0.65 (0.48) | 0.55 (0.5) | 0.52 (0.5) | 0.52 (0.5) | 0.43 (0.5) | 0.55 (0.5) |
Comparison | Accurate | Complete | Harmful | Helpful | Intelligible | Send to patients |
Symptoms vs postoperative | 2.06 (1.19, 3.55) | 1.60 (0.93, 2.74) | 2.00 (1.00, 3.98) | 1.09 (0.65, 1.83) | 1.25 (0.59, 2.67) | 1.31 (0.73, 2.35) |
Clinical decision vs postoperative | 1.91 (1.16, 3.16) | 0.95 (0.58, 1.55) | 0.89 (0.44, 1.81) | 1.03 (0.64, 1.67) | 0.91 (0.47, 1.77) | 0.99 (0.58, 1.70) |
Plan vs postoperative | 1.34 (0.84, 2.16) | 1.49 (0.92, 2.41) | 0.87 (0.44, 1.73) | 1.24 (0.78, 1.98) | 1.31 (0.67, 2.58) | 1.08 (0.64, 1.83) |
Test results vs postoperative | 1.48 (0.77, 2.85) | 0.63 (0.33, 1.19) | 1.31 (0.55, 3.14) | 0.62 (0.33, 1.18) | 0.55 (0.25, 1.25) | 0.57 (0.27, 1.20) |
Symptoms vs test results | 1.39 (0.71, 2.71) | 2.55 (1.33, 4.91) | 1.52 (0.67, 3.46) | 1.74 (0.92, 3.30) | 2.26 (0.98, 5.23) | 2.29 (1.09, 4.82) |
Clinical decision vs test results | 1.29 (0.69, 2.44) | 1.51 (0.82, 2.80) | 0.68 (0.29, 1.57) | 1.66 (0.90, 3.04) | 1.64 (0.77, 3.51) | 1.73 (0.85, 3.53) |
Plan vs test results | 0.91 (0.49, 1.68) | 2.39 (1.30, 4.37) | 0.66 (0.29, 1.50) | 1.99 (1.10, 3.61) | 2.37 (1.10, 5.11) | 1.90 (0.94, 3.81) |
Symptoms vs plan | 1.53 (0.94, 2.50) | 1.07 (0.66, 1.75) | 2.29 (1.24, 4.25) | 0.88 (0.55, 1.40) | 0.95 (0.47, 1.93) | 1.21 (0.71, 2.04) |
Clinical decision vs plan | 1.42 (0.91, 2.22) | 0.64 (0.41, 0.98) | 1.02 (0.54, 1.93) | 0.83 (0.55, 1.27) | 0.69 (0.38, 1.27) | 0.91 (0.57, 1.47) |
Symptoms vs clinical decision | 1.08 (0.64, 1.80) | 1.69 (1.02, 2.78) | 2.25 (1.18, 4.27) | 1.05 (0.65, 1.71) | 1.38 (0.69, 2.77) | 1.32 (0.77, 2.27) |
Overall, 47% of responses were deemed acceptable to send to patients. ChatGPT performance varied based on questions difficulty with 56% of responses to easy questions being acceptable to send as compared to 34% of difficult questions (P = .03). While questions with more ratings of acceptable to send were more likely to be easy, this was not statistically significant (P = .07; Table 4).
No. Yes answers | Easy, No. (%) | Difficult, No. (%) | P value |
5 | 9 (15.25) | 1 (2.44) | .0746 |
4 | 12 (20.34) | 7 (17.07) | |
3 | 12 (20.34) | 6 (14.63) | |
2 | 10 (16.95) | 7 (17.07) | |
1 | 12 (20.34) | 10 (24.39) | |
0 | 4 (6.78) | 10 (24.39) |
Discussion
In this study, nearly 50% of responses generated by ChatGPT were felt to be acceptable to send to patients, with a trend to better performance with easy clinical questions. These results show promise for the utilization of generative AI technology to help improve clinical efficiency. A likely application of this technology is to integrate this technology into an electronic medical record to automatically generate a response to all patient messages. Our results suggest the feasibility of integrating this new technology into clinical care to improve efficiency while maintaining quality of patient communication.
The ability of AI to generate high-quality responses has potential not only to improve clinician wellness by decreasing the time burden of responding to patient messages, but also to help deliver higher-quality patient care. In comparison to human responses, Ayers et al found that AI-generated responses to patient questions were more empathetic and overall higher quality.4 Prompt, empathetic, and high-quality responses to patient questions could help to alleviate patient concerns and limit unnecessary clinic and emergency room visits, freeing up clinic time and resources to address other complex tasks.
Though utilization of AI has promise to improve efficiency and quality of health care delivery, there are potential drawbacks that must be considered. As large language models are only able to generate responses based on data to which they have already been exposed, these models are often unable to respond adequately to new information. This has resulted in patient frustration as well as exacerbations of existing health care biases and inequity in previous studies.9,10 These findings underscore the importance of ensuring that AI in health care is trained on large, diverse datasets.
Patient privacy is an important consideration when utilizing language models for patient care. As new health care–specific generative AI models are developed, steps must be taken to ensure that sensitive patient data are protected.11 Use of patient data within health care is strictly regulated in the United States by HIPAA (Health Insurance Portability and Accountability Act); however, much of the development of generative AI models is being performed by private companies that fall outside of the scope of regulation provided by HIPAA.12 Further regulation and consideration are necessary to ensure protection of patient privacy if these models are to be safely integrated into patient care.
Additionally, ChatGPT’s model is trained on information from the internet in general as opposed to validated medical sources. This introduces risk of generating inaccurate or misleading responses as has been seen in prior studies.13,14 While responses in our study were generally found to be accurate, especially for easy questions, awareness of this potential is important if utilizing this technology for patient communication. Future medical-specific iterations of generative AI technology are planned and should reduce risk of misinformation through training with validated medical literature.
Our study does have limitations. At the time of our study, ChatGPT 3.5 was the most current version available. Though we would expect newer versions to perform better, further research would be needed to assess the performance of these updated models. Additionally, responses in our study were generated using an independent ChatGPT session for each question which did not allow for the model to learn based on prior inputs and likely improve performance over time. Further improvements in response quality could be achieved by including patient-specific data such as progress notes in training the model as well as prior responses. Importantly, our study is also limited by lack of direct comparison to physician-generated responses. Though other studies have shown AI and physician responses to be of comparable quality, our current study is unable to directly compare AI and physician performance.4 The questions posed in our study are taken from a single urologist in a men’s health practice, which could limit generalizability to other areas of urology. Future studies evaluating performance of large language models in other areas of urology are essential to help guide development and improve performance of these models across disciplines.
Conclusions
ChatGPT is able to generate acceptable responses to nearly 50% of patient messages sent to a practicing urologist with better performance for easy questions compared to difficult questions. Use of ChatGPT to help respond to patient messages can help to decrease the time burden for the care team and improve wellness. AI performance will likely continue to improve with advances in generative AI technology.
References
- 1. . Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023; 2(2):e0000198. doi: 10.1371/JOURNAL.PDIG.0000198 Crossref, Medline, Google Scholar
- 2. . ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2024; 46(3):366-372. doi: 10.1080/0142159X.2023.2249588 Crossref, Medline, Google Scholar
- 3. . Assessing the utility of ChatGPT throughout the entire clinical workflow. medRxiv. Printed online February 26, 2023. doi: 10.1101/2023.02.21.23285886 Crossref, Google Scholar
- 4. . Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023; 183(6):589-596. doi: 10.1001/jamainternmed.2023.1838 Crossref, Medline, Google Scholar
- 5. . New artificial intelligence ChatGPT performs poorly on the 2022 self-assessment study program for urology. Urol Pract. 2023; 10(4):409-415. doi: 10.1097/upj.0000000000000406 Link, Google Scholar
- 6. . Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis. 2023: 27;103-108. doi: 10.1038/s41391-023-00705-y Crossref, Medline, Google Scholar
- 7. . Erratum: Assessing the impact of the COVID-19 pandemic on clinician ambulatory electronic health record use. J Am Med Inform Assoc. 2022; 29(4):749. doi: 10.1093/jamia/ocab288 Crossref, Medline, Google Scholar
- 8. . Re: Physicians’ well-being linked to in-basket messages generated by algorithms in electronic health records. J Urol. 2019; 202(5):859. doi: 10.1097/01.JU.0000579848.28213.b5 Crossref, Google Scholar
- 9. . Self-diagnosis through AI-enabled chatbot-based symptom checkers: user experiences and design considerations. AMIA Annu Symp Proc. 2020; 2020:1354-1363. Medline, Google Scholar
- 10. . Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci U S A. 2020; 117(23):12592-12594. doi: 10.1073/PNAS.1919012117 Crossref, Medline, Google Scholar
- 11. . Artificial intelligence and urology: ethical considerations for urologists and patients. Nat Rev Urol. 2024; 21(1):50-59. doi: 10.1038/s41585-023-00796-1 Crossref, Medline, Google Scholar
- 12. . HIPAA and protecting health information in the 21st century. JAMA. 2018; 320(3):231-232. doi: 10.1001/JAMA.2018.5630 Crossref, Medline, Google Scholar
- 13. . ChatGPT and other large language models are double-edged swords. Radiology 2023; 307(2):e230163. doi: 10.1148/RADIOL.230163 Crossref, Medline, Google Scholar
- 14. . Answering head and neck cancer questions: an assessment of ChatGPT responses. Am J Otolaryngol. 2024; 45(1):104085. doi: 10.1016/J.AMJOTO.2023.104085 Crossref, Medline, Google Scholar
Recusal: Dr Mehta, editorial committee member of Urology Practice®, was recused from the editorial and peer review processes due to affiliation with Emory University.
Funding/Support: None.
Conflict of Interest Disclosures: The Authors have no conflicts of interest to disclose.
Ethics Statement: This study received Institutional Review Board approval (IRB No. 70909).
Author Contributions:
Data analysis and interpretation: Ha, Zhang, Belladelli, Del Giudice, Eisenberg, Scott, Seranio, Li.
Critical revision of the manuscript for scientific and factual content: Ha, Zhang, Belladelli, Del Giudice, Glover, Eisenberg, Li, Muncey.
Conception and design: Glover, Eisenberg, Seranio, Muncey.
Drafting the manuscript: Scott, Seranio, Muncey.
Statistical analysis: Del Giudice, Li.
Supervision: Ha, Glover, Eisenberg, Seranio, Muncey.
Other: Zhang, Belladelli.
Data Availability: Data will be made available upon reasonable request.