Ask the doctor to diagnose the disease online, is this round of AI ok?

  ◎ reporter Zhang mengran

  Have you searched the Internet for "What hurts me? Is there something wrong with me?"? The answer may not be satisfactory. However, with the emergence of large-scale natural language models (LLM) such as ChatGPT, people began to try to use it to answer medical questions or medical knowledge.

  However, is it reliable?

  As far as it is concerned, the answer given by artificial intelligence (AI) is accurate. However, James Davenport, a professor in university of bath, pointed out the difference between medical questions and actual medical practice. He believed that "medical practice is not just about answering medical questions. If we just answer medical questions, we don’t need to teach hospitals, and doctors don’t need to receive years of training after academic courses."

  In view of all kinds of doubts, in a recent paper published by Nature, the world’s top artificial intelligence experts showed a benchmark to evaluate how well large-scale natural language models can solve people’s medical problems.

  The existing model is not perfect.

  The latest assessment comes from Google Research and Deep Thinking. Experts believe that the artificial intelligence model has many potentials in the medical field, including knowledge retrieval and supporting clinical decision-making. However, the existing model is not perfect, for example, it may fabricate convincing medical error information, or include prejudice to aggravate health inequality. Therefore, it is necessary to evaluate their clinical knowledge.

  Relevant assessments have not been without before. However, in the past, it usually relied on automated evaluation of limited benchmarks, such as individual medical test scores. This is translated into the real world, and both reliability and value are lacking.

  Moreover, when people turn to the Internet to obtain medical information, they will encounter "information overload" and then choose the worst one from 10 possible diagnoses, thus bearing a lot of unnecessary pressure.

  The research team hopes that the language model can provide short expert opinions, be unbiased, indicate its citation source, and reasonably express uncertainty.

  What is the LLM performance of 540 billion parameters?

  In order to evaluate the ability of LLM to encode clinical knowledge, Xie Kufei Aziz, an expert from Google Research Institute, and his colleagues discussed their ability to answer medical questions. The team put forward a benchmark called "MultiMedQA QA": it combines six existing question-answering data sets covering professional medical care, research and consumer inquiries and "HealthSearch QA" & mdash; — This is a new data set, which contains 3173 medical questions searched online.

  The team then evaluated PaLM (a 540 billion parameter LLM) and its variant Flan-PaLM. They found that Flan-PaLM reached the most advanced level in some data sets. Flan-PaLM surpassed the most advanced LLM by 17% in the MedQA data set integrating the questions of American medical license examination.

  However, although Flan-PaLM scored well in multiple-choice questions, further evaluation shows that it has a gap in answering consumers’ medical questions.

  LLM of specialized medicine is encouraging.

  In order to solve this problem, artificial intelligence experts use a method called design instruction fine-tuning to further debug Flan-PaLM to adapt to the medical field. At the same time, the researcher introduced an LLM— — Med-PaLM。

  Fine-tuning design instructions is an effective way to make general LLM suitable for new professional fields. The resulting model Med-PaLM is encouraging in the trial evaluation. For example, Flan-PaLM scored by a group of doctors is only 61.9% consistent with the scientific consensus, and Med-PaLM scored 92.6%, which is equivalent to the answers made by doctors (92.9%). Similarly, 29.7% of Flan-PaLM’s answers were rated as likely to lead to harmful results, and only 5.8% of Med-PaLM’s answers were equivalent to those of doctors (6.5%).

  The research team mentioned that although the results are promising, it is necessary to make further evaluation, especially in terms of safety, fairness and prejudice.

  In other words, there are still many limitations to overcome before the clinical application of LLM is feasible.