In this post, we share our motivation for building custom small LLMs for medical document understanding in the context of India. We show that by using high-quality data for training that is specific to Indian healthcare ecosystem, we've outperformed giant state-of-the-art (SOTA) models like GPT-4o and Claude Sonnet 3.5.
Healthcare in India is at a crucial turning point, with technology holding a great potential for transformation. However, several challenges are currently hindering this progress:
The challenges mentioned above create an ideal ground for Large Language Models (LLMs) to shine. Advanced LLMs can potentially understand, process, and interpret unstructured data, making them uniquely suited to tackle India’s healthcare challenges. Till the time creating medical records is truly digital and data exchange happens over formats such as FHIR, employing LLMs can enable transformation of the unstructured data into structured, with machine-understandable linkages with ontologies such as SNOMED-CT and LOINC. This would pave the way for seamless interoperability and better clinical decision-making. With the right high-quality data and models tuned to specific needs, LLMs present a unique opportunity to streamline processes and improve outcomes.
The unique challenges of India’s healthcare ecosystem demand GenAI solutions specifically designed to address its needs. While general-purpose LLMs offer impressive capabilities, they often fall short when applied to the specific requirements of Indian medical data. This is where customized LLMs come into play, offering precision, efficiency, and scalability to bridge the gap.
General-purpose LLMs are typically trained on broad datasets, but healthcare data in India has its own complexities:
Let us understand this by an example of Zin 10 mg tablet, which has cetirizine as the main component. The screenshot presented below clearly highlights that it is incorrectly understood by the best of the SOTA models.
Training LLMs on high-quality, India-specific datasets ensures that these models understand and process the unique nuances present in Indian healthcare documents with high accuracy.
Training and deploying massive, resource-intensive models at scale isn't always practical due to cost constraints. Smaller, task-specific models are more computationally efficient, allowing them to be trained and deployed even in resource-limited environments. Customized small models can enable scalable adoption without compromising performance.
General-purpose LLMs might be more prone to generating hallucinated outputs, especially when dealing with unfamiliar formats and contexts. In healthcare, such errors can have serious consequences. Custom LLMs trained on task-specific data reduce the likelihood of inaccuracies by narrowing their operational scope.
While large-scale models like Sonnet 3.5 and GPT-4o are renowned for their capabilities, they often struggle with the specific challenges of Indian healthcare data. We perform extensive evaluations to understand their shortcomings and subsequently develop small and specialized LLMs. Our small LLM (we name it Parrotlet-V) excels in tasks like lab report parsing, prescription extraction, PII redaction and document classification, consistently outperforming these industry giants.
Our benchmarking dataset includes thousands of meticulously annotated medical documents spanning lab reports, digital prescriptions, discharge summaries, health insurance policies, and radiology scans. We employ standardized evaluation methodologies, comparing structured outputs entity by entity for precise assessment. To maximize the performance of SOTA LLMs for a justified comparison, we also experimented with prompt tuning, carefully guiding the models to produce outputs in the desired format.
Below, we present the results of our benchmarking experiments. The score here represents fuzzy matching score at entity level aggregated over the corpus.
Detailed entity-level results of Parrotlet-V vs SOTA models for different documents are presented below.
The results highlight the Parrotlet-V model's superior performance compared to state-of-the-art (SOTA) models. Parrotlet-V not only outperforms models with a comparable number of parameters, such as Qwen2-VL 7B, Llama 3.2 Vision 11B, and Phi-3.5 Vision 4.2B, but also surpasses the largest SOTA models in performance.
On further investigation we find that one of the major reasons for the lower accuracy of the models mentioned above is hallucination. Models often invent a value in the structured output even when that field is empty in the medical document. Examples of these hallucinations are given below. We overcome these hallucinations by carefully balancing dataset such that there are ample instances of blank fields (sparse tables), and use techniques such as DPO during the SFT stage. Also at times some models didn't follow the provided schema to parse information and often even end up generating results as free text.
Other types of errors can be attributed to challenges in contextual inference of the fields such as specimen and panel and method. These fields have to be inferred carefully from the heading / subheadings in the lab report. Even though prompts include this hint, SOTA models fail to consistently infer these fields from the surrounding context.
Hallucinations: In the following example SOTA model has generated tests that are not even present in the document.
Contextual understanding: In the following document, our model accurately understands and extracts details like specimen and method, along with all the tests. Meanwhile, GPT-4 and Sonnet models struggle to grasp these visual and medical nuances.
Our experiments show encouraging results, however, there is still a long way to go in order to completely mitigate hallucinations and ensure high recall of the entities.