The U.S. Department of Defense (DoD) has successfully concluded a pilot program aimed at evaluating large language models (LLMs) for military medical applications. This initiative, known as the Crowdsourced Artificial Intelligence Red-Teaming Assurance Program (CAIRT), involved over 200 clinical providers and healthcare analysts who identified more than 800 potential vulnerabilities in LLMs used for clinical note summarization and medical advisory chatbots. The findings will significantly influence policies and best practices for the responsible use of generative AI in defense healthcare.
The CAIRT program's recent red team test focused on assessing three different LLMs for their potential in two key areas: summarizing clinical notes and serving as medical advisory chatbots. By engaging a broad spectrum of healthcare professionals, the program uncovered numerous issues that could compromise patient care. These insights are vital for refining AI systems to ensure they meet stringent risk management standards required by the DoD.
To achieve this, the CAIRT program enlisted a diverse group of over 200 clinical providers and healthcare analysts. They rigorously tested the LLMs across various scenarios, identifying over 800 potential vulnerabilities and biases. This comprehensive evaluation process was designed to uncover hidden risks that might affect the accuracy and reliability of AI-driven tools in military medical settings. The collaboration with the Defense Health Agency and the Program Executive Office further enhanced the depth and breadth of these assessments. Additionally, the program offered a financial AI bias bounty in 2024 to encourage deeper exploration of unknown risks within open-source chatbots. The extensive data collected from these efforts will be instrumental in shaping future policies and best practices for the ethical and effective use of AI in military healthcare.
Trust is paramount for clinicians to fully embrace AI technologies. Ensuring that LLMs meet critical performance expectations—such as being useful, transparent, explainable, and secure—is essential. The CAIRT program’s findings underscore the importance of ongoing testing and collaborative efforts between clinicians and developers to identify and mitigate potential biases in AI algorithms.
Dr. Sonya Makhni, a medical director at Mayo Clinic Platform, highlighted the challenges in unlocking the full potential of AI in healthcare delivery. She emphasized that assumptions made during the AI development lifecycle can introduce systematic errors, leading to biased outcomes that may pose risks to healthcare equity. To combat this, active engagement from both clinicians and developers throughout the AI development process is crucial. Predicting potential areas of bias or suboptimal performance helps clarify which contexts are better suited for specific AI algorithms and which require more monitoring and oversight. Dr. Matthew Johnson, the CAIRT program lead, noted that this initiative serves as a pathfinder for generating valuable testing data, surfacing areas for consideration, and validating mitigation strategies. Ultimately, the CAIRT program aims to accelerate AI capabilities and build confidence in its application across various DoD genAI use cases.