Notes about Statistical Models in Single-cell RNA Sequencing Data

Da (Derek) Kuang
3 min readOct 26, 2020

--

The Negative Binomial distribution is a powerful model because it captures the over-dispersion of the nature of the count data generated by RNA-Seq and scRNA-Seq experiments. In this post, I will explain how negative binomial distribution associates the sequencing experiment to the integers in the gene-by-cell matrix.

How 10X Genomics Works

Before discussing the statistical models, let’s first go over the 10X Genomics workflow for single-cell RNA sequencing.

  1. Gel bead in EMulsion (GEM) encapsulates each cell.
  2. Add UMI and Barcodes
  3. Library preparation
  4. Cluster amplification
  5. Sequencing
  6. Alignment

Why Negative Binomial Distribution

Gene is a piece of a coding sequence to have a certain function. There could be many mRNA molecules carrying the same gene in one cell.

The sequencing experiment's output is a gene-by-cell matrix, where each number in the matrix represents the number of molecules detected by the sequencer.

The expression level X of a gene G in a cell is a random variable, and we are interested in its distribution. The distribution would be the foundation for us to correct potential noise in the dataset (normalization), and to identify anomalies (differentially expression).

During the scRNA-seq, in step 4, Cluster amplification, the numbers of replication to each mRNA molecule can be described as exponential random variables. Moreover, the expression level X of the gene G, is actually the sum of the numbers of all the corresponding replicated mRNA molecules. Note that the sum of exponential random variables is from Gamma distribution. (The prove can be found here.) Therefore, the random variable sampled from the Gamma distribution represents the “expected” expression level X of the gene G in a cell.

However, it is not the end of the story. For a gene on the fragmented cDNA to be counted as the count in the gene-by-cell matrix, the segment encoding the gene (Exton) must be within ~75 bp (read length) from the adapter. Otherwise, it will not be recognized by the Sequencer in step 5. The distance between Exton and the adaptor is determined by how the RNA is fragmented. This process can be described by a Poisson distribution and the mean should be the number of molecules during the replication, which is a Gamma distribution.

Given that the Negative Binomial model can be seen as a hieratical model of Poisson with the rate of Gamma distribution, we assume that the random variable X is sampled from Negative Binomial to represent the expression level of gene G.

Expansion

Here we only consider steps 4 and 5 to set up the distribution model. It is possible to introduce zeros as drop-outs in steps 2,3 and 5. So that sometimes people use Zero-inflated Negative binomial to describe X. But recently, some scholars disagree with it and argue that Negative Binomial is enough. I will compose another note about it in detail.

Moreover, a batch effect has been observed among data from different experiments. The origin of batch effects is not fully understood but results at least in part from differences in average capture efficiencies across experiments. bayNorm introduces a Binomial distribution to descript the stochastic capture efficiencies.

--

--

Da (Derek) Kuang
Da (Derek) Kuang

Written by Da (Derek) Kuang

Computer Science Ph.D. Student @ UPenn

No responses yet