Google’s MedPaLM AI responds with a high level of performance to medical questions. But, to date, the model is not viable.

It’s not just ChatGPT. The latter, based on OpenAI’s GPT-3 algorithm, has made headlines in recent weeks. Its level of performance is impressive and very interesting. For now, however, its concrete applicability remains to be determined. But other projects have a more specific aim. This is how Google and DeepMind developed MedPaLM, whose model they detail in works published on the arxiv server (this is not a study published in a scientific journal, at this time) at the end of 2022.

This algorithm is designed as a chatbot, integrating databases that contain many common questions and answers written by professionals or patients (in a framed medical context). The principle as such is quite simple: the user is supposed to be able to ask a question, by delivering several symptoms for example, and MedPaLM must know how to respond by giving a diagnosis and treatment options.

Examples of answers provided by MedPaLM // Source: Google

MedPaLM generates impressive scores

To test MedPaLM, Google and DeepMind presented the same set of questions to AI and (human) healthcare professionals. Then they had the answers evaluated by another group of human health professionals.

The result is quite amazing:

  • 92.6% answers provided by Med-PaLM were considered correct;
  • 92.9% answers provided by human professionals were considered correct.

On paper, it is very impressive, because almost identical. And indeed, it is. The progress is dazzling. A previous model, Flan-PaLM, provided just over 60% consistent responses.

The progress is also notable on a significant element in the medical field: the danger posed by the responses to patients. For MedPaLM:

  • 5.8% responses were assessed as potentially harmful;
  • 6.5% answers provided by human physicians were assessed as possibly harmful.

In the old model, Flan-PaLM, the rate of responses that could potentially harm patients was 29.7%. While with MedPaLM, the success rate is, again, equivalent to that of humans – and even higher, although this point is to be qualified in relation to other evaluation criteria.

Med-PaLM showed promising performance in several aspects, including scientific and clinical accuracy, reading comprehension, medical knowledge recall, medical reasoning, and usefulness, compared to Flan-PaLM says one of the engineers, Shek Azizi, on Twitter.

Such a model of AI in medicine is not yet viable

The practice of medicine can in no way be reduced to such percentages or question-and-answer questionnaires. As the Google team points out in their study: While these results are promising, the medical field is complex. Further assessments are needed, especially with regard to aspects related to fairness, equity and bias. »

There are other criteria than a simple “correct” answer in appearance. When Google engineers assess more factually and more precisely the quality of the answers provided by MedPaLM, it remains better than the previous models, but systematically less than human doctors. Clearly, human responses remain better:

Flan-PaLM, MedPaLM and human physician scores.  // Source: Google
Flan-PaLM, MedPaLM and human physician scores. // Source: Google

The conclusion of the MedPaLM team, within the preprint study posted online, is therefore also that of limitations. And these ” must be overcome before such models become viable for use in clinical applications.. »

Understand everything about experimenting with OpenAI, ChatGPT

California18

Welcome to California18, your number one source for Breaking News from the World. We’re dedicated to giving you the very best of News.

Leave a Reply