The ChatGPT of DNA. Understanding Large Language Models in…

Understanding Large Language Models in Genomic Research

Image generated by gemini

LLMs have displayed unreal ability in mastering the syntax and semantics of human language and created applications like ChatGPT. Transformer architecture is the driving force behind these models, which help in deciphering every language. The languages being deciphered by LLMs not only include computational and human languages but also complex sequences of DNA, RNA, and protein.

These biological sequences adhere to rules of biological grammar that has been dictated by evolution. DNA is a linear sequence composed of four nucleotides: A, T, C, G, RNA is composed of: A, U, C, G, and protein is a chain of 20 amino aicds. All these biomolecules have a dictionary and grammatical rules of their own, such as protein starts with an amino acid names “methionine”, DNA has to have AUG as that is the start codon and responsible for placement of “methionine” at the beginning of protein.

The next amino acid in protein and nucleotide in DNA/RNA can be predicted similar to how the computational models predict next word in a sentence. This brings the transition from reading life’s code to understanding its deep, underlying syntax.

This paradigm shift, centred on Genomic Language Models (gLMs) and Protein Language

Leave a Reply