AI Labels

Language model (LM) technology

Our advanced AI process, known as the OPA language model (LM), efficiently creates labels that identify the central theme for each topic box by analyzing the content of the assigned records. To ensure clarity and precision, our team of subject matter experts (SMEs) reviews these AI-generated labels, refining them to eliminate redundancy and enhance their descriptiveness. The resulting labels consist of the most relevant terms, providing you with a clear understanding of each topic box's themes. Note: As new records are ingested and labels become more specific, periodic reevaluation of AI labels ensures comprehensive coverage of evolving topics. While labels may be updated, the core theme of each topic remains stable.

Topic boxes are categorized into different display levels, each with distinct AI labels and granularity. The broad display level offers the most encompassing topic boxes with five labels. The intermediate display level provides more specific topic boxes with four labels, and the narrow display level delivers the most detailed topic boxes, featuring three targeted labels.

In the example table below, a particular grant record (Title: Multi-Omics Core C, Grant number: P01HL158500, appl ID 11010885) has been assigned the following topic boxes and AI labels for each display level:

Display level	AI topic box labels*	Example Grant attributions for the entire topic box (heat mapping) # of docs \| % Human (MeSH)\| % Research \| % Training
Broad	AlgorithmsStatistical modelingImage analysisInformaticsHeuristics	189,413 \| 42.50 \| 86.27 \| 3.65
Intermediate	InformaticsSoftwareOmics Computational approaches	207,948 \| 45.36 \| 82.61 \| 7.68
Narrow	ProteogenomicsMetabolomics High-throughput profiling	35,220 \| 30.29 \| 91.30 \| 3.36

* As mentioned above, the AI labels are made up of the most relevant terms to help you understand the content of the topic box. The data in the third column above can be found using the Accessible Table; MeSH stands for Medical Subject Headings. Note the above data were recorded on 12-11-2025 and are subject to change as records are added to the database and assigned to these same topic boxes. Learn more about the heat mapping metrics here.

Below is an example of the iSearch Analytics visualization if you were viewing the intermediate display level for the aforementioned grant record. Notice the AI labels in the Topic Explorer visualization are identical to the intermediate AI labels in the second column of the table above:

What is the technical process for deriving the AI labels?

The semantic meaning of all documents in a dataset is converted into a collection of mathematical vectors. In this case, the initial training set includes: title, abstract, and specific aims of all grant records, title and abstract of publications, title (brief and official) and brief summary of clinical trials, and lastly, title and abstract of patents, for all of the corpus to March 2021. The vectors are then used to determine the number of topics and assign each document to a topic based on the location of its vector in multi-dimensional space. The AI labels used are the terms with vectors closest to the centroid vector of each topic; these capture the central tendency of the corresponding documents. Results are then quality checked for duplication and descriptiveness across display levels.

Is this the same LM technology that powers ChatGPT, sciBERT, and GPT4?

Not exactly. OPA topics use a well established type of language model technology called Word2Vec. It is still part of the language model family, but it is not generative AI. Instead of creating new text or images, it looks at the meaning of words and documents by turning them into numerical representations, called embeddings to help cluster documents in space.

Generative AI, such as ChatGPT and Claude, creates content based on user prompts and is often trained on very large datasets, sometimes including much of the internet. In contrast, the OPA language model was trained specifically on major scientific data in the NIH corpus, including grant records, publications, preprints, and clinical trial data.

Additionally, our embedding process is deterministic, meaning it gives consistent results for the same document. It relies only on the document’s meaning, which helps avoid many common problems seen in generative AI, such as hallucinations, misinformation, and memorizing training data. It also helps the model better tell apart different types of data.

How are new documents/records allocated to topics?

The algorithm generates word vectors for each document using:

Grant records: title and abstract
Literature (publications and preprints): title and abstract
Clinical trials: title (brief and official), brief summary, detailed description, keywords and intervention (name and description)

The word vectors are then used to allocate the document to the appropriate cluster at each level.

Back to top

Background/overview

Technical process