So I’ve spent months trying to converge on a solid method of calling tumors and their genotypes from HiSeq DNA sequencing files. At this point, I feel comfortable enough with the approach to proselytizing it. Here goes…
The tumors look something like this in the mouse:
Reads from the Illumina HiSeq FASTQ file look as follows:
There are 25 nucleotides between the beginning of the forward primer and the 1st (sgRNA) barcode, which is 8 nt long, and then two spacers before the 2nd (random) barcode (arranged in 5 nt batteries of degeneracy repeated in triplicate for 15 nts of degeneracy). These reads are processed as follows:
- Bad reads are discarded (10-15% of reads),
- Reads are pooled by sgID barcode (a single mismatch or indel is permitted),
- sgID pools are partitioned into distinct tumors via DADA2,
- Barcode abundance is then scaled to absolute cell #,
- tumors with <500 cells or extreme GC content are discarded.
sgIDs are partitioned before tumors because (a) knowledge of their expected sequence is useful, and (b) DADA2 is computationally-intesive (partitioning benefits memory usage and parallelism).
“Bad reads” are reads that either (a) have >2 Expected Errors (based on QC scores) (<1% of reads), (b) have truncated barcodes (4% of reads), (c) have poor alignment to the lentiviral regions (~5% of reads), or (d) have an unknown sgID barcode (2-10% of reads, depending upon sample).
Identifying tumors using DADA2, after pooling based on sgIDs, proved to be fundamentally difficult. After pilling-up barcode abundances based on exact-matches, the data that looks essentially like this:
Big tumors are so abundant that repeat sequencing errors are expected. Jamie Blundell encountered this same problem in the yeast experiments and directed me to DADA2. However, Jamie had time-series data where he observed single- and double-nucleotide sequencing errors creep-up into the FASTQ files over time as the fittest barcoded clones grew to dominate the population; we lacked time-series data.
Dealing with this problem carefully was necessary for two reasons:
- We, unexpectedly, observed tumors sizes varying across >5 decades of sizes, yet HiSeq has an intrinsic error rate of 2 decades (1%). The actual error rate is then augmented by PCR errors, artifacts from the lentiviral library construction, and the peculiarities of our DNA library construction (single-read, 100 bp, low-diversity rapid-runs are new to both us and our vendors) and who knew by how much?
- Knowing the true # of distinct legions (not simply the # of large tumors) is critical. Mean tumor size (in actuality, a MLE of the mean) proved most effective at identifying known tumor suppressors. Knowing the mean requires knowing the number of tumors.
Anyways, DADA2 operates by combining all reads into 1 giant tumor. Each unique pileup of reads i is then assigned a probability pi that it differs from this giant tumor based on an error model. Each pi is calculated via a NW alignment [a ubiquitous algorithm in Comp Bio that considers FASTQ quality scores and indels (in DADA2’s implementation). The relative cost of these errors is parameterized (in my implementation) from the non-degenerate regions of the lentivirus]. If pi is less than a threshold Ω, the read pileup is split from the giant tumor and the remaining pileups are placed within the closer group. The process is repeated until no more pileups exceed the probability threshold.
There is no good way to set Ω. I’ve tinkered with it enough to get somewhere useful: the tallies are robust to variation in reads per cell, reproducible across replicate Illumina lanes, and effective at identifying known tumor suppressors. However, my approach is ad hoc.
Being very conservative in calling tumors leads to over-bundling:
The very small read pileups never split from the largest tumor, which then stays categorically larger than the others. This approach also fails to estimate the true # of lesions, which is necessary for estimating mean tumor size.
So you have to throw caution to the wind. I settled on a very permissive probability threshold, Ω = 10-10 (10-40 is default). This minimizes the number of false positive sgIDs within Cas9 negative mice among the values I tested Ω = [10-10, 10-15, 10-20, 10-30, 10-40, 10-60, 10-80, 10-120]. It is hard to imagine how any biological phenomenon could lead to variation in tumor sizes between sgIDs within Cas9 negative mice.
Lastly, I tried to identify a resolution-limit to tumor calling that was robust to variation in read-depth. I used the mini-pool experiment as sequencing happened to be on 3 different HiSeq machines with different quantities of reads. Truncating at 500 cells, alongside a permissive partitioning of read pileups, minimized variation between HiSeq runs of the same samples:
||Ω = 1e-40
||Ω = 1e-10
||48 M Reads
||26 M Reads
||29 M Reads
||Coef of Variation:
Tying it all together, I obtain good agreement between barcode sizes of the same mouse sequenced on two different machines: