Indexed on: 17 Feb '17Published on: 17 Feb '17Published in: IEEE transactions on nanobioscience
In mass spectrometry-based de novo protein sequencing, it is hard to complete the sequence of the whole protein. Motivated by this we study the (one-sided) problem of filling a protein scaffold S with some missing amino acids, given a sequence of contigs none of which is allowed to be altered, with respect to a complete reference protein P of length n, such that the BLOSUM62 score between P and the filled sequence S' is maximized. We show that this problem is polynomial-time solvable in O(n26) time. We also consider the case when the contigs are not of high quality and they are concatenated into an (incomplete) sequence I, where the missing amino acids can be inserted anywhere in I to obtain I', such that the BLOSUM62 score between P and I' is maximized. We show that this problem is polynomial-time solvable in O(n22) time. Due to the high time complexity, both of these algorithms are impractical, we hence present several algorithms based on greedy and local search, trying to solve the problems practically. The empirical results, based on some antibody and mammalian proteins, show that the algorithms can fill protein scaffolds with high quality, provided that a good pair of scaffold and reference are given.