Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone who works on DNA sequencing software the dirty secret is qscores are excessively precise. They technically span the range 0 to 99, but due to limitations in being able to predict qscores (you can't really get closer than +- 5) the real range is more like (0 to 7)*10. Using better bit-packing it should be possible to compress much better.


Range isn't an issue (any reasonable scheme won't assign codes to scores that never occur), but you're right that qscores are both precise but not necessarily accurate. Quantization of quality scores is a common solution (it is optional in CRAM, mentioned above). There's also recent work on better methods for doing it [1,2].

The problem is that if you have to maintain BAMs for regulatory reasons, lossy q-scores may not be sufficient for compliance, because you 1) have lost part of the original data 2) may not be able to exactly reconstruct your analysis results (unless, of course, you did the analysis on the quantized scores).

Thus, it would still be interesting to see better lossless compression methods.

[1] http://bioinformatics.oxfordjournals.org/content/early/2014/...

[2] http://web.stanford.edu/~iochoa/publishedPublications/2015_q...


The qscores do occurs, just the error bars on them are quite large in practice. It certainly would be interesting to see if the analysis can be constructed from quantized q scores.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: