Studies show that AI Chatbots provide inconsistent accuracy for musculoskeletal health information

Studies show that AI Chatbots provide inconsistent accuracy for musculoskeletal health information

  • Researchers agree: orthopedic surgeons remain the most reliable source of information
  • All chatbots exhibited significant limitations, leaving out critical steps in processing
  • Researchers summarize: ChatGPT is not yet an adequate resource for answering patient questions; further work is needed to develop an accurate chatbot focused on orthopedics

SAN FRANCISCO, February 12, 2024 /PRNewswire/ — With the growing popularity of the large language model (LLM) chatbotsa type of artificial intelligence (AI) used by ChatGPTGoogle Bard and Bingaiit is important to outline the accuracy of musculoskeletal the health information they provide. Three new studies presented at the 2024 Annual Meeting of American Academy of Orthopedic Surgeons (AAOS) analyzed the validity of the information chatbots was given to patients for several orthopedic procedures, assessing the accuracy of how chatbots present research advances and clinical decision making.

While studies found that some chatbots provide concise summaries across a broad spectrum of orthopedic conditions, each showed limited accuracy depending on the category. Researchers agree that orthopedic surgeons remain the most reliable source of information. The findings will help those in the field understand the efficacy of these AI tools, whether use by patients or non-specialist colleagues may introduce bias or misunderstanding, and how future improvements can make chatbots a potentially valuable to patients and doctors in the future.

Potential misinformation and risks associated with the clinical use of LLM chatbots
This study, led by Branden Sosa, a fourth-year medical student at Weill Cornell Medicine, evaluated the accuracy of Open AI ChatGPT 4.0, Google Bard, and BingAI chatbots to explain basic orthopedic concepts, integrate clinical information, and address patient questions. Each chatbot was prompted to answer 45 orthopedic-related questions that included the categories of “Bone Physiology,” “Referring Physician,” and “Patient Question,” and then scored for accuracy. Two independent, blinded reviewers scored responses on a scale of 0–4, rating accuracy, completeness, and usability. Responses were analyzed for strengths and limitations within categories and across chatbots. The research team found the following trends:

  • When prompted with orthopedic questions, OpenAI ChatGPT, Google Bard, and BingAI provided correct answers that covered the most critical salient points in 76.7%, 33%, and 16.7% of questions, respectively.
  • When providing clinical management suggestions, all chatbots exhibited significant limitations by deviating from the standard of care and omitting critical steps at work, such as ordering antibiotics before cultures or neglecting to include key studies in the diagnostic workup.
  • When asked less complex patient questions, ChatGPT and Google Bard were able to provide mostly accurate answers, but often failed to elicit an appropriate critical medical history to fully address the question.
  • A careful analysis of the citations provided by the chatbots revealed an oversampling of a small number of references and 10 erroneous links that were nonfunctional or led to incorrect articles.

Is ChatGPT ready for prime time? Assessing the accuracy of AI in answering common questions of arthroplasty patients
Researchers, led by Jenna A. Bernstein, MD, orthopedic surgeon at Connecticut Orthopaedics, set out to investigate how accurately ChatGPT 4.0 answered patient questions by developing a list of 80 common patient questions about knee and hip replacements. Each question was asked twice in ChatGPT; first asking the questions as written and then prompting the ChatGPT to answer the patient’s questions “like an orthopedic surgeon”. Each surgeon on the team evaluated the accuracy of each set of answers and rated them on a scale of one to four. Agreement was assessed between two surgeons’ assessment of each set of ChatGPT responses. The association between question speed and response accuracy were both assessed using two statistical analysis tools (respectively, Cohen’s kappa and Wilcoxon signed rank test). The findings included:

  • When assessing the quality of ChatGPT responses, 26% (21 of 80 responses) had an average rating of three (partially correct but incomplete) or less when asked without a prompt, and 8% (six of 80 responses) had a grade point average of less than three when preceded by a requirement. As such, the researchers concluded that ChatGPT is not yet an adequate resource for answering patient questions, and further work is needed to develop an accurate orthopedics-focused chatbot.
  • ChatGPT performed significantly better when properly prompted to answer patient questions “like an orthopedic surgeon” with 92% accuracy.

Can ChatGPT 4.0 be used to answer patient questions about the Latarjet procedure for anterior shoulder instability?
Researchers at the Hospital for Special Surgery in New Yorkled by Kyle outMD, assessed the trend for ChatGPT 4.0 to provide medical information about Flashlight procedure for patients with anterior shoulder instability. The overall aim of this study was to understand whether this chatbot may demonstrate potential to serve as a clinical adjunct and assist patients and providers by providing accurate medical information.

To answer this question, the team first conducted a Google search using the query “Lottery” to pull up the ten most frequently asked questions (FAQs) and related resources about the procedure. They then asked ChatGPT to perform the same FAQ search to identify questions and resources provided by the chatbot. Key findings include:

  • ChatGPT demonstrated the ability to provide a wide range of clinically relevant questions and answers and information derived from academic sources 100% of the time. This is in contrast to Google, which included a small percentage of academic sources, combined with information found on surgeons’ personal websites and larger medical practices.
  • The most common category of questions for ChatGPT and Google were technical details (40%); however, ChatGPT also presented information about risks/complications (30%), recovery time frame (20%), and evaluation of surgery (10%).

# # #

AAOS 2024 Annual Meeting Disclosure Statement

About AAOS
With more than 39,000 members, the American Academy of Orthopedic Surgeons is the world’s largest medical association of musculoskeletal specialists. AAOS is the trusted leader in the advancement of musculoskeletal health. It provides the highest quality, most comprehensive education to help orthopedic surgeons and allied health professionals at every career level to best treat patients in their daily practices. AAOS is the source of information on bone and joint conditions, treatments and issues related to musculoskeletal health care; and leads the health care discussion on quality improvement.

Follow AAOS at Facebook, X, LinkedIn AND Instagram.

SOURCE American Academy of Orthopedic Surgeons

Leave a Comment

Your email address will not be published. Required fields are marked *