Background

Genetic sequencing data is a very complex.

Untitled

Untitled

In a typical sequencing process:

  1. A sample goes through the sequencer
  2. The sequencer outputs a file of ACGT (or ACGU, for RNA sequencing data).

Untitled

  1. Aligners and other bioinformatics tools map that long string of 4 letters (the nucleotide bases) to a reference genome of a human, for example, and give out a readout file that looks like this:

Screenshot 2024-03-24 at 10.53.13 AM.png

All of the ReferenceIDs in this picture represent gene expressions, which is the fundamental way our bodies react to the environment. RNA represents your DNA “going to work”, and is responsible for catalyzing pretty much every process in our bodies.

2. These screenshots are snapshots of our internal health, but nobody and understand what’s going on.

There is plenty of published research explaining the function(s) of these different RNA molecules. Most of it lives on PubMed, which is freely accessible and crawlable.

This is where LLMs have an opportunity to bridge the gap. Techniques like RAG present a good near term opportunity, but the issue with retrieval and sequencing data is that both the amount of data and its function can change very quickly.

Presenting miROR

miROR is trained on a corpus of homo sapien RNA to provide see the truth to complex data. Specs: