Insights AI News How to convert scRNA-seq to cell sentences for LLMs
post

AI News

21 Oct 2025

Read 15 min

How to convert scRNA-seq to cell sentences for LLMs

Convert scRNA-seq to cell sentences so LLMs can interpret single-cell profiles and speed discovery

Google’s latest research shows a simple, repeatable way to convert scRNA-seq to cell sentences so large language models can read and analyze single-cell data like text. The method ranks genes by expression and turns the top signals into a short, ordered gene list. It unlocks fast cell-type labeling, perturbation reasoning, and context-aware discovery.

Why turning gene expression into text changes the game

Single-cell RNA sequencing gives us rich snapshots of which genes are turned on in each cell. But raw expression matrices are hard to share and reason about. Language models shine when input looks like text. By turning each cell into a compact “sentence” of gene symbols ordered by activity, we get a format that LLMs can parse, compare, and explain. Google Research, Google DeepMind, and Yale built C2S-Scale-Gemma-2-27B on this idea. The team shows that a decoder-only model can learn cell types, tissues, perturbations, and biological facts when expression and text live in one token stream. They also use it to propose a drug-context insight that lab tests later support.

How to convert scRNA-seq to cell sentences

This section is a simple guide you can use to convert scRNA-seq to cell sentences for your own LLM workflows. You do not need to change your lab pipeline. You shape the output so a model can read it.

Step 1: Start with a clean expression matrix

Your input is a cell-by-gene matrix. Each value reflects the expression level for a gene in a single cell. – Remove low-quality cells using standard filters. – Use a consistent gene symbol set (for example, HGNC for human, MGI for mouse). – Align species labels so the model knows if a cell is human or mouse.

Step 2: Normalize and stabilize values

You want to capture the rank order of genes, not small noise. – Apply a sensible normalization (for example, library-size scaling). – Use a variance-stabilizing step if needed to reduce noise across cells. – Avoid complex transforms that distort rank order.

Step 3: Rank genes per cell and choose top signals

For each cell, sort genes from highest to lower expression. You will not use all genes. You will emit the top K genes that best describe the cell state. – Pick a K that fits your model’s context window. – Keep K consistent across your dataset to reduce drift. – Consider ties by using secondary rules (for example, average expression across neighbors).

Step 4: Tokenize with gene symbols

The key is to keep the encoding simple and readable. – Emit a sequence of gene symbols as tokens: GENE1, GENE2, GENE3, … – Preserve order. The order carries the cell’s “voice.” – Use separators that your tokenizer can handle. Commas or spaces both work. For example, a cell sentence might look like: “EPCAM KRT8 KRT18 KRT19 MUC1 ERBB2 …” This is not a random list. It is a ranked list that hints at an epithelial-like identity.

Step 5: Add small, structured context

Short labels help the model reason about biology and reduce confusion. – Prefix with species: “species: human” – Add tissue if known: “tissue: lung” – Add condition if relevant: “IFN: low” or “IFN: present” – Add sample metadata that is safe and useful: “disease: neuroendocrine tumor” A full prompt block might read: “species: human; tissue: lung; condition: IFN-low; cell: EPCAM KRT8 KRT18 …”

Step 6: Define task templates for the LLM

Your sentence is now an input token stream. You can ask for: – Cell type prediction: “Label this cell: [cell sentence]” – Tissue assignment: “Which tissue fits best?” – Cluster captioning: “Describe this cluster with 1–2 sentences.” – Perturbation effect: “What increases MHC-I expression in this context?” – Biological Q&A: “Which pathways are likely active here?” Give the model a few examples (few-shot) to set the tone. Keep instructions clear and short.

Step 7: Check outputs with biological controls

LLMs can sound confident, so you must verify. – Compare predicted labels with known annotations. – Use marker genes as sanity checks. – Run small wet-lab tests for key claims whenever possible.

Inside C2S-Scale 27B: a brief look

Google’s team trained C2S-Scale-Gemma-2-27B on a combined corpus that includes both expression-derived tokens and biological text. The core model is Gemma-2 27B, a decoder-only Transformer. Training used Google TPU v5. The release uses a permissive CC-BY-4.0 license and provides open weights on Hugging Face.

Training data at scale

– Over 800 public single-cell RNA-seq datasets – More than 57 million cells across human and mouse – Linked metadata and textual context for each cell or sample This large, diverse corpus lets the model learn common and rare cell identities, tissue signatures, and responses to stimuli. Because the input format is plain tokens, the model learns across both “languages”: biology text and gene-ranked sentences.

Why it matters for everyday analysis

– It snaps into standard LLM toolchains. – It makes programmatic queries simple. You can loop over cells, ask the same question, and score outputs. – It supports multi-task prompts without custom heads or retraining for each task.

What the model found: an interferon-conditional amplifier

The team ran a virtual screen across more than 4,000 compounds. They asked a targeted question: What increases MHC-I antigen presentation in samples that have some interferon tone, but not in neutral cell-line settings? The model suggested a split behavior for silmitasertib, a CK2 inhibitor. It predicted a strong boost to the MHC-I program with low-dose interferon, but little effect without interferon.

Bench validation in neuroendocrine models

They tested this idea in human neuroendocrine systems not seen in training. The combination of silmitasertib plus low-dose interferon led to a marked increase in antigen presentation. Flow cytometry showed higher HLA-A, HLA-B, and HLA-C signals with the combination. The gains included cases like roughly 13.6% at 10 nM and around 34.9% at 1000 nM silmitasertib in one model. Across assays, the team reported about a 50% improvement with the combo versus either agent alone. The interpretation is clear: CK2 inhibition appears to lower the threshold for interferon signaling. It does not start the MHC-I program by itself. But with a small interferon push, the pathway turns on stronger. This could help “cold” tumors show more antigens and become more visible to immune attack. This is still preclinical and in vitro, so it is a hypothesis to test, not a treatment claim.

How to apply this workflow in your lab or team

The method to convert scRNA-seq to cell sentences is simple and portable. You can start with a small dataset and scale up.

Set up a minimal pipeline

– Preprocess and normalize your cells. – Rank genes per cell and emit a fixed top K. – Add compact metadata tags. – Feed into an LLM prompt with clear instructions.

Choose your tasks

– Cell identity: Map unknown cells to known types. – Tissue context: Assign likely tissue of origin. – Perturbation reasoning: Ask which drugs or cytokines move a pathway up or down. – Cluster captions: Summarize clusters with natural-language highlights. – Cross-species mapping: Ask for human-mouse correspondences using unified symbol logic.

Interpret outputs with care

– Compare to marker gene panels. – Look for consistency across replicates. – Use external references for tricky cases. – For drug or cytokine suggestions, confirm in the wet lab.

Best practices and common pitfalls

Keep symbol hygiene tight

– Use current, official gene symbols. – Avoid mixing aliases and symbols in the same dataset. – Flag missing genes or low-coverage cells before ranking.

Balance K with model context

– Too short: you lose signal. – Too long: you add noise and risk truncation. – Keep K stable across experiments to aid comparability.

Provide honest context, not leading text

– Share species, tissue, and condition facts you know. – Do not bake in the answer in your prompt. – Use the same prompt for all cells in a batch to reduce bias.

Evaluate with both text and numbers

– Ask the model to return a label and a short reason. – Score confidence with simple scales or probabilities if supported. – Track agreement with ground truth and marker panels.

C2S-Scale-Gemma-2-27B at a glance

– Model: Gemma-2, 27B parameters, decoder-only Transformer – Training: Google TPU v5 – Corpus: >800 datasets, >57M cells, human and mouse, with text context – License: CC-BY-4.0 – Availability: Open weights on Hugging Face (vandijklab); also a 2B variant for light workloads This stack means you can run research tasks without building a custom model from scratch. You can prototype on the 2B model, then switch to 27B for stronger reasoning.

Real-world pattern: context matters

The model’s drug insight highlights a key lesson. Biological effects can depend on immune context. The virtual screen asked a context-aware question and returned a context-conditional hit. Then lab work supported the prediction. This loop—ask with context, get a hypothesis, test in vitro—is a strong pattern you can adopt. – Use context tags like “IFN-low” vs. “IFN-neutral.” – Compare outputs across contexts, not just compounds. – Focus on pathway-level readouts (for example, MHC-I program) to reduce noise.

Ethics and safety notes

– Keep all claims at the right level: these are research outputs. – Avoid clinical language unless you have clinical data. – Share prompts, data splits, and checks so others can replicate.

From workflow to discovery

The value here is not only the headline result. It is the simple data representation that lets you reuse the entire LLM ecosystem. When you convert scRNA-seq to cell sentences, you gain: – A compact, comparable cell format – Clear prompts for repeatable queries – Easy integration with toolchains for search, summarization, and scoring – A path from data to testable hypotheses C2S-Scale-27B shows that this is more than a neat trick. It can surface context-dependent ideas that match biology and hold up in the lab, at least at bench scale.

Conclusion

Turning expression vectors into ordered gene lists is a practical bridge between single-cell data and language models. With a few steps—ranking, tokenizing, and adding light metadata—you can convert scRNA-seq to cell sentences and run powerful, text-native analyses. Use this method to explore cell identity, pathway shifts, and context-aware perturbations, and to turn LLM outputs into testable, lab-ready ideas.

(Source: https://www.marktechpost.com/2025/10/17/google-ai-releases-c2s-scale-27b-model-that-translate-complex-single-cell-gene-expression-data-into-cell-sentences-that-llms-can-understand/)

For more news: Click Here

FAQ

Q: What is a “cell sentence” in single-cell analysis? A: A “cell sentence” is an ordered list of gene symbols ranked by expression that represents a single cell in a compact, textual form. The approach is exactly how researchers convert scRNA-seq to cell sentences so language models can parse and reason about cellular states. Q: Why should I convert scRNA-seq to cell sentences before using an LLM? A: Converting creates a compact, text-native representation that LLMs can parse, compare, and explain, unlocking tasks like cell-type labeling, perturbation reasoning, and context-aware discovery. It aligns expression tokens with biological text so a single model can learn and apply both formats without custom retraining. Q: What are the basic preprocessing steps needed to convert scRNA-seq to cell sentences? A: Start with a cleaned cell-by-gene matrix, apply normalization and variance stabilization to preserve rank order, then rank genes per cell and emit a fixed top-K list of gene symbols as tokens. Add simple structured metadata tags (species, tissue, condition) and use consistent gene symbol sets to keep tokenization reliable. Q: How should I choose the top-K genes when building cell sentences? A: Pick a K that fits your model’s context window and keep it consistent across the dataset to reduce drift. Too-short lists lose signal while too-long lists add noise and risk token truncation, and you should use secondary tie-breaking rules to handle equal-expression genes. Q: What metadata should I include with each cell sentence to improve reasoning? A: Include short, structured tags such as species (for example, “species: human”), tissue, and condition labels like “IFN-low” to give clear context that reduces confusion. Adding safe sample metadata such as disease type is useful, but avoid leading text that bakes the answer into the prompt. Q: What analysis tasks become practical after you convert scRNA-seq to cell sentences? A: You can ask LLMs to predict cell type, assign tissue, caption clusters, reason about perturbation effects (for example, MHC-I program changes), and run biological Q&A using few-shot templates. Because the input is text-native, the same model can execute multi-task prompts and programmatic queries without custom heads. Q: How should I validate and interpret LLM outputs generated from cell sentences? A: Verify model outputs against known annotations and marker-gene panels, check consistency across replicates, and score agreement or confidence instead of accepting single answers. Treat drug or cytokine suggestions as hypothesis-generating and confirm key claims with targeted wet-lab tests. Q: Is the C2S-Scale-Gemma-2-27B model and its data available for research use? A: The article reports C2S-Scale-Gemma-2-27B is built on Gemma-2 27B, trained on Google TPU v5 and released under CC-BY-4.0, with open weights and usage docs on Hugging Face and a 2B variant for lighter workloads. The published training corpus aggregates over 800 public scRNA-seq datasets totaling more than 57 million cells across human and mouse.

Contents