Background/overview
Technical process
Language model (LM) technology
Document allocation
Background/overview
Our advanced AI process, known as the OPA language model (LM), efficiently creates labels that identify the central theme for each topic box by analyzing the content of the assigned records. To ensure clarity and precision, our team of subject matter experts (SMEs) reviews these AI-generated labels, refining them to eliminate redundancy and enhance their descriptiveness. The resulting labels consist of the most relevant terms, providing you with a clear understanding of each topic box's themes. Note: As new records are ingested and labels become more specific, periodic reevaluation of AI labels ensures comprehensive coverage of evolving topics. While labels may be updated, the core theme of each topic remains stable.
Topic boxes are categorized into different display levels, each with distinct AI labels and granularity. The broad display level offers the most encompassing topic boxes with five labels. The intermediate display level provides more specific topic boxes with four labels, and the narrow display level delivers the most detailed topic boxes, featuring three targeted labels.
In the example table below, a particular grant record (Title: Multi-Omics Core C, Grant number: P01HL158500, appl ID 11010885) has been assigned the following topic boxes and AI labels for each display level:
| Display level | AI topic box labels* |
Example Grant attributions for the entire topic box (heat mapping) # of docs | % Human (MeSH)| % Research | % Training |
| Broad | Algorithms |
189,413 | 42.50 | 86.27 | 3.65 |
| Intermediate | Informatics |
207,948 | 45.36 | 82.61 | 7.68 |
| Narrow | Proteogenomics |
35,220 | 30.29 | 91.30 | 3.36 |
* As mentioned above, the AI labels are made up of the most relevant terms to help you understand the content of the topic box. The data in the third column above can be found using the Accessible Table; MeSH stands for Medical Subject Headings. Note the above data were recorded on 12-11-2025 and are subject to change as records are added to the database and assigned to these same topic boxes. Learn more about the heat mapping metrics here.
Below is an example of the iSearch Analytics visualization if you were viewing the intermediate display level for the aforementioned grant record. Notice the AI labels in the Topic Explorer visualization are identical to the intermediate AI labels in the second column of the table above:
What is the technical process for deriving the AI labels?
The semantic meaning of all documents in a dataset is converted into a collection of mathematical vectors. In this case, the initial training set includes: title, abstract, and specific aims of all grant records, title and abstract of publications, title (brief and official) and brief summary of clinical trials, and lastly, title and abstract of patents, for all of the corpus to March 2021. The vectors are then used to determine the number of topics and assign each document to a topic based on the location of its vector in multi-dimensional space. The AI labels used are the terms with vectors closest to the centroid vector of each topic; these capture the central tendency of the corresponding documents. Results are then quality checked for duplication and descriptiveness across display levels.
Is this the same LM technology that powers ChatGPT, sciBERT, and GPT4?
Yes, although the OPA LM does not have the same issues as current LM (e.g. hallucinations, misinformation) because it is discriminative AI, not generative AI. Discriminative AI is trained on labeled data, allowing it to be able to distinguish between varying data classes. The OPA LM has been trained on major scientific categories such as grant records, publications, preprints, and clinical trials data, which are updated regularly. Generative AI, on the other hand, produces text/images based on user provided prompts. The latter uses the entirely of the internet to provide the user results.
How are new documents/records allocated to topics?
The algorithm generates word vectors for each document using:
- Grant records: title and abstract
- Literature (publications and preprints): title and abstract
- Clinical trials: title (brief and official), brief summary, detailed description, keywords and intervention (name and description)
The word vectors are then used to allocate the document to the appropriate cluster at each level.