Bioinformatics pipeline for single-cell 3' UTR isoform quantification
A bioinformatics pipeline for single-cell 3’ UTR isoform quantification.
The scUTRquant pipeline builds on kallisto bus
to provide a reusable tool for
quantifying 3’ UTR isoforms from 3’-end tag-based scRNA-seq datasets. The pipeline
is based on Snakemake and provides both Conda and Docker images to satisfy software
requirements. It includes a prebuilt reference UTRome for mm10, which ensures a
consistent set of features (3’ UTR isoforms) across different runs. In total, this provides
a rapid pipeline for recovering 3’ UTR isoform counts from common scRNA-seq datasets.
The pipeline takes as input:
Note on UTRome Index
The pipeline includes code to download a prebuilt mm10 UTRome GTF and kallisto index.
This prebuilt index was generated by augmenting the protein coding transcripts that have
verified 3’ ends in the GENCODE vM21 annotation with high-confidence cleavage sites called
from the Mouse Cell Atlas dataset. This augmented transcriptome was then truncated to include
only the last 500 nts of each transcript and then deduplicated. Finally, the merge file contains
information on transcripts whose cleavage sites differ by fewer than 200 nts, which corresponds
to the empirical resolution limit for kallisto
quantification as determined by simulations.
Please see our accompanying manuscript for more details.
The primary output of the pipeline is a Bioconductor SingleCellExperiment
object.
The counts
in the object is a sparse Matrix
of 3’ UTR isoform counts; the rowRanges
is a GenomicRanges
of the 3’ UTR isoforms; the rowData
is a DataFrame
with additional
information about 3’ UTR isoforms; and the colData
is a DataFrame
populated with sample
metadata and optional user-provide cell annotations.
To assist users in quality control, the pipeline additionally generates HTML reports for each sample.
The pipeline is configured to retain intermediate files, such as BUS and MTX files. Advanced users can readily customize the pipeline to only generate the files they require. For example, users who prefer to work with alternative scRNA-seq data structures, such as those used in Scanpy or Seurat, may wish to terminate the pipeline at MTX generation.
The pipeline can use either Conda/Mamba or Singularity to provide the required software.
Snakemake can use Conda to install the needed software. This configuration requires:
If Conda is not already installed, we strongly recommend installing Mambaforge. If Conda was previously installed, strongly recommend installing Mamba:
conda install -n base -c conda-forge mamba
Snakemake can use the pre-built scUTRsquant Docker image to provide all additional software. This configuration requires installing:
git clone git@github.com:Mayrlab/scUTRquant.git
cd scUTRquant/extdata/targets/utrome_mm10_v1
. download_utrome.sh
Reuse Tip: For use across multiple projects, it is recommended to centralize
these files and change the entries in the configfile
for utrome_gtf
,
utrome_kdx
, and utrome_merge
to point to the central location. In that
case, one does not need to redownload the files.
cd scUTRquant/extdata/bxs
. download_10X_whitelists.sh
Reuse Tip: Similar to the UTRome files, these can also be centralized
and referenced by the bx_whitelist
variable in the configfile
.
Examples are provided in the scUTRquant/examples
folder. Each includes a script
for downloading the raw data, a sample_sheet.csv
formatted for use in the pipeline,
and a config.yaml
file for running the pipeline.
Look for output files in the qc/
and data/
folders.
Note that the config.yaml
uses paths relative to the scUTRquant
folder.
cd scUTRquant/examples/neuron_1k_v3_bam/
. download.sh
Run the pipeline.
Conda Mode
cd scUTRquant
snakemake --use-conda --configfile examples/neuron_1k_v3_bam/config.yaml
Singularity Mode
cd scutr-quant
snakemake --use-singularity --configfile examples/neuron_1k_v3_bam/config.yaml
cd scUTRquant/examples/heart_1k_v3_fastq/
. download.sh
Run the pipeline.
Conda Mode
cd scUTRquant
snakemake --use-conda --configfile examples/heart_1k_v3_fastq/config.yaml
Singularity Mode
cd scUTRquant
snakemake --use-singularity --configfile examples/heart_1k_v3_fastq/config.yaml
The Snakemake configfile
specifies the parameters used to run the
pipeline. The following keys are expected:
dataset_name
: name used in the final SingleCellExperiment
objecttmp_dir
: path to use for temporary filessample_file
: CSV-formatted file listing the samples to be processedsample_regex
: regular expression used to match sample IDs; including a specific
regex helps to constrain Snakemake’s DAG-generationoutput_type
: a list of outputs, including "txs"
and/or "genes"
target
: name of target(s) to which to pseudoalign; valid targets are defined in
the extdata/targets/targets.yaml
; multiple targets can be specified in list formattech
: argument to kallisto bus
indicating the scRNA-seq technology; see
the documentation for supported valuesstrand
: argument to kallisto bus
indicating the orientation of sequence reads
with respect to transcripts; all 10X 3’-end libraries use --fr-stranded
;
omitting this argument eliminates the ability to correctly assign reads to
transcripts when opposing stranded genes overlapbx_whitelist
: file of valid barcodes used in bustools correct
min_umis
: minimum number of UMIs per cell; cells below this threshold are excludedcell_annots
: (optional) CSV file with a cell_id
entry that matches thee <sample_id>_<cell_bx>
formatSnakemake can draw values for config
in three ways:
scUTRquant/config.yaml
: This file is listed as the configfile
in the Snakefile.--configfile config.yaml
: The file provided at the commandline.--config argument=value
: A specific value for an argumentThis list runs from lowest to highest precedence. Configuration values that do not differ from those in scUTRquant/config.yaml
can be left unspecfied in the YAML given by the --configfile
argument. That is, one can use the scUTRquant/config.yaml
to define shared settings, and only list dataset-specific config values in the dataset’s YAML.
The sample_file
provided in the Snakemake configuration is expected to be a CSV
with at least the following columns:
sample_id
: a unique identifier for the sample; used in file names and paths
of intermediate files derived from the samplefile_type
: indicates whether sample input is 'bam'
or 'fastq'
formatfiles
: a semicolon-separated list of files; for multi-run (e.g., multi-lane)
samples, the files must have the order:
lane1_R1.fastq;lane1_R2.fastq;lane2_R1.fastq;lane2_R2.fastq;...
The extdata/targets.yaml
defines the targets available to pseudoalign to. The default configuration provides utrome_mm10_v1
, but additional entries can be added. A target has the following fields:
path
: location where files are relative togenome
: genome identifier (e.g., mm10
)gtf
: GTF annotation of UTRome; used in annotating rowskdx
: Kallisto index for UTRomemerge
: TSV for merging features (isoforms)tx_annots
: (optional) RDS file containing Bioconductor DataFrame object with annotations for transcriptsgene_annots
: (optional) RDS file containing Bioconductor DataFrame object with annotations for genesThe Bioconductor package txcutr
provides methods for generating truncated transcriptome
annotations. The txcutr-db repository provides an example Snakemake
pipeline for using txcutr
to generate the files needed for the custom target, starting from Ensembl or
GENCODE annotations.
Recommendations: For 10X Chromium 3’ Single Cell libraries, we use a 500 nt truncation length and a 200 nt merge distance (details in the sqUTRquant manuscript). We recommend filtering transcripts to only protein-coding transcripts with validated 3’ ends. For GENCODE, this means requiring
transcript_type "protein_coding"
and excluding transcripts with the tag mRNA_end_NF.
Once the GTF, KDX, and TSV files are generated, we recommend placing them in a new folder under extdata/targets
.
Then, edit the extdata/targets/targets.yaml
to include a new entry. This will look something like:
extdata/targets/targets.yaml
custom_target_name:
path: "extdata/targets/custom_target_name/"
genome: "mm10"
gtf: "custom_target.gtf"
kdx: "custom_target.kdx"
merge_tsv: "custom_target.merge.tsv"
tx_annots: null
gene_annots: null
Note that the pipeline supports running on multiple targets in parallel. This can be done in the
configuration file, like so,
config.yaml
target:
- target1
- target2
# ...
Or, if specifying targets at Snakemake invocation, one would use the syntax:
snakemake --config target='[target1,target2]' # ...
The rules in the Snakefile
include threads
and resources
arguments per rule. These values are compatible for use with Snakemake profiles for cluster deployment. The current defaults will attempt to use up to 16 threads and 16GB of memory in the kallisto bus
step. Please adjust to fit the resources available on the deployment cluster. We recommend that cluster profiles include both --use-singularity
and --use-conda
flags by default. Following this recommendation, an example run, for instance on neuron_1k_v3_fastq, with profile name profile_name
, would take the form:
snakemake --profile profile_name --configfile examples/neuron_1k_v3_fastq/config.yaml
Fansler, M. M., Zhen, G., & Mayr, C. (2021). Quantification of alternative 3′UTR isoforms from single cell RNA-seq data with scUTRquant. BioRxiv, 2021.11.22.469635. https://doi.org/10.1101/2021.11.22.469635