As someone who works on DNA sequencing software the dirty secret is qscores are ...

_ihaque · on June 25, 2016

Range isn't an issue (any reasonable scheme won't assign codes to scores that never occur), but you're right that qscores are both precise but not necessarily accurate. Quantization of quality scores is a common solution (it is optional in CRAM, mentioned above). There's also recent work on better methods for doing it [1,2].

The problem is that if you have to maintain BAMs for regulatory reasons, lossy q-scores may not be sufficient for compliance, because you 1) have lost part of the original data 2) may not be able to exactly reconstruct your analysis results (unless, of course, you did the analysis on the quantized scores).

Thus, it would still be interesting to see better lossless compression methods.

[1] http://bioinformatics.oxfordjournals.org/content/early/2014/...

[2] http://web.stanford.edu/~iochoa/publishedPublications/2015_q...

danieltillett · on June 25, 2016

The qscores do occurs, just the error bars on them are quite large in practice. It certainly would be interesting to see if the analysis can be constructed from quantized q scores.