AI News
21 Oct 2025
Read 15 min
How to convert scRNA-seq to cell sentences for LLMs
Convert scRNA-seq to cell sentences so LLMs can interpret single-cell profiles and speed discovery
Why turning gene expression into text changes the game
Single-cell RNA sequencing gives us rich snapshots of which genes are turned on in each cell. But raw expression matrices are hard to share and reason about. Language models shine when input looks like text. By turning each cell into a compact “sentence” of gene symbols ordered by activity, we get a format that LLMs can parse, compare, and explain. Google Research, Google DeepMind, and Yale built C2S-Scale-Gemma-2-27B on this idea. The team shows that a decoder-only model can learn cell types, tissues, perturbations, and biological facts when expression and text live in one token stream. They also use it to propose a drug-context insight that lab tests later support.How to convert scRNA-seq to cell sentences
This section is a simple guide you can use to convert scRNA-seq to cell sentences for your own LLM workflows. You do not need to change your lab pipeline. You shape the output so a model can read it.Step 1: Start with a clean expression matrix
Your input is a cell-by-gene matrix. Each value reflects the expression level for a gene in a single cell. – Remove low-quality cells using standard filters. – Use a consistent gene symbol set (for example, HGNC for human, MGI for mouse). – Align species labels so the model knows if a cell is human or mouse.Step 2: Normalize and stabilize values
You want to capture the rank order of genes, not small noise. – Apply a sensible normalization (for example, library-size scaling). – Use a variance-stabilizing step if needed to reduce noise across cells. – Avoid complex transforms that distort rank order.Step 3: Rank genes per cell and choose top signals
For each cell, sort genes from highest to lower expression. You will not use all genes. You will emit the top K genes that best describe the cell state. – Pick a K that fits your model’s context window. – Keep K consistent across your dataset to reduce drift. – Consider ties by using secondary rules (for example, average expression across neighbors).Step 4: Tokenize with gene symbols
The key is to keep the encoding simple and readable. – Emit a sequence of gene symbols as tokens: GENE1, GENE2, GENE3, … – Preserve order. The order carries the cell’s “voice.” – Use separators that your tokenizer can handle. Commas or spaces both work. For example, a cell sentence might look like: “EPCAM KRT8 KRT18 KRT19 MUC1 ERBB2 …” This is not a random list. It is a ranked list that hints at an epithelial-like identity.Step 5: Add small, structured context
Short labels help the model reason about biology and reduce confusion. – Prefix with species: “species: human” – Add tissue if known: “tissue: lung” – Add condition if relevant: “IFN: low” or “IFN: present” – Add sample metadata that is safe and useful: “disease: neuroendocrine tumor” A full prompt block might read: “species: human; tissue: lung; condition: IFN-low; cell: EPCAM KRT8 KRT18 …”Step 6: Define task templates for the LLM
Your sentence is now an input token stream. You can ask for: – Cell type prediction: “Label this cell: [cell sentence]” – Tissue assignment: “Which tissue fits best?” – Cluster captioning: “Describe this cluster with 1–2 sentences.” – Perturbation effect: “What increases MHC-I expression in this context?” – Biological Q&A: “Which pathways are likely active here?” Give the model a few examples (few-shot) to set the tone. Keep instructions clear and short.Step 7: Check outputs with biological controls
LLMs can sound confident, so you must verify. – Compare predicted labels with known annotations. – Use marker genes as sanity checks. – Run small wet-lab tests for key claims whenever possible.Inside C2S-Scale 27B: a brief look
Google’s team trained C2S-Scale-Gemma-2-27B on a combined corpus that includes both expression-derived tokens and biological text. The core model is Gemma-2 27B, a decoder-only Transformer. Training used Google TPU v5. The release uses a permissive CC-BY-4.0 license and provides open weights on Hugging Face.Training data at scale
– Over 800 public single-cell RNA-seq datasets – More than 57 million cells across human and mouse – Linked metadata and textual context for each cell or sample This large, diverse corpus lets the model learn common and rare cell identities, tissue signatures, and responses to stimuli. Because the input format is plain tokens, the model learns across both “languages”: biology text and gene-ranked sentences.Why it matters for everyday analysis
– It snaps into standard LLM toolchains. – It makes programmatic queries simple. You can loop over cells, ask the same question, and score outputs. – It supports multi-task prompts without custom heads or retraining for each task.What the model found: an interferon-conditional amplifier
The team ran a virtual screen across more than 4,000 compounds. They asked a targeted question: What increases MHC-I antigen presentation in samples that have some interferon tone, but not in neutral cell-line settings? The model suggested a split behavior for silmitasertib, a CK2 inhibitor. It predicted a strong boost to the MHC-I program with low-dose interferon, but little effect without interferon.Bench validation in neuroendocrine models
They tested this idea in human neuroendocrine systems not seen in training. The combination of silmitasertib plus low-dose interferon led to a marked increase in antigen presentation. Flow cytometry showed higher HLA-A, HLA-B, and HLA-C signals with the combination. The gains included cases like roughly 13.6% at 10 nM and around 34.9% at 1000 nM silmitasertib in one model. Across assays, the team reported about a 50% improvement with the combo versus either agent alone. The interpretation is clear: CK2 inhibition appears to lower the threshold for interferon signaling. It does not start the MHC-I program by itself. But with a small interferon push, the pathway turns on stronger. This could help “cold” tumors show more antigens and become more visible to immune attack. This is still preclinical and in vitro, so it is a hypothesis to test, not a treatment claim.How to apply this workflow in your lab or team
The method to convert scRNA-seq to cell sentences is simple and portable. You can start with a small dataset and scale up.Set up a minimal pipeline
– Preprocess and normalize your cells. – Rank genes per cell and emit a fixed top K. – Add compact metadata tags. – Feed into an LLM prompt with clear instructions.Choose your tasks
– Cell identity: Map unknown cells to known types. – Tissue context: Assign likely tissue of origin. – Perturbation reasoning: Ask which drugs or cytokines move a pathway up or down. – Cluster captions: Summarize clusters with natural-language highlights. – Cross-species mapping: Ask for human-mouse correspondences using unified symbol logic.Interpret outputs with care
– Compare to marker gene panels. – Look for consistency across replicates. – Use external references for tricky cases. – For drug or cytokine suggestions, confirm in the wet lab.Best practices and common pitfalls
Keep symbol hygiene tight
– Use current, official gene symbols. – Avoid mixing aliases and symbols in the same dataset. – Flag missing genes or low-coverage cells before ranking.Balance K with model context
– Too short: you lose signal. – Too long: you add noise and risk truncation. – Keep K stable across experiments to aid comparability.Provide honest context, not leading text
– Share species, tissue, and condition facts you know. – Do not bake in the answer in your prompt. – Use the same prompt for all cells in a batch to reduce bias.Evaluate with both text and numbers
– Ask the model to return a label and a short reason. – Score confidence with simple scales or probabilities if supported. – Track agreement with ground truth and marker panels.C2S-Scale-Gemma-2-27B at a glance
– Model: Gemma-2, 27B parameters, decoder-only Transformer – Training: Google TPU v5 – Corpus: >800 datasets, >57M cells, human and mouse, with text context – License: CC-BY-4.0 – Availability: Open weights on Hugging Face (vandijklab); also a 2B variant for light workloads This stack means you can run research tasks without building a custom model from scratch. You can prototype on the 2B model, then switch to 27B for stronger reasoning.Real-world pattern: context matters
The model’s drug insight highlights a key lesson. Biological effects can depend on immune context. The virtual screen asked a context-aware question and returned a context-conditional hit. Then lab work supported the prediction. This loop—ask with context, get a hypothesis, test in vitro—is a strong pattern you can adopt. – Use context tags like “IFN-low” vs. “IFN-neutral.” – Compare outputs across contexts, not just compounds. – Focus on pathway-level readouts (for example, MHC-I program) to reduce noise.Ethics and safety notes
– Keep all claims at the right level: these are research outputs. – Avoid clinical language unless you have clinical data. – Share prompts, data splits, and checks so others can replicate.From workflow to discovery
The value here is not only the headline result. It is the simple data representation that lets you reuse the entire LLM ecosystem. When you convert scRNA-seq to cell sentences, you gain: – A compact, comparable cell format – Clear prompts for repeatable queries – Easy integration with toolchains for search, summarization, and scoring – A path from data to testable hypotheses C2S-Scale-27B shows that this is more than a neat trick. It can surface context-dependent ideas that match biology and hold up in the lab, at least at bench scale.Conclusion
Turning expression vectors into ordered gene lists is a practical bridge between single-cell data and language models. With a few steps—ranking, tokenizing, and adding light metadata—you can convert scRNA-seq to cell sentences and run powerful, text-native analyses. Use this method to explore cell identity, pathway shifts, and context-aware perturbations, and to turn LLM outputs into testable, lab-ready ideas.For more news: Click Here
FAQ
Contents