Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist.  Reasonable 
  40  documentation exists at: http://maq.sourceforge.net/fastq.shtml 
  41   
  42  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  43  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  44  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  45  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  46  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  47  FASTA file using the -s or -seq argument instead. 
  48   
  49  The bad news is that Solexa/Illumina did things differently - they have their 
  50  own scoring system AND their own incompatible versions of the FASTQ format. 
  51  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  52  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  53  reasonable mapping can be achieved between them, and they are approximately 
  54  equal for high quality reads). 
  55   
  56  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  57  own score mapping and an ASCII offset of 64. To make things worse, for the 
  58  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  59  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  60  with an ASCII offset of 64. 
  61   
  62  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  63  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  64   
  65  You are expected to use this module via the Bio.SeqIO functions, with the 
  66  following format names: 
  67   
  68   - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  69   - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  70      offset of 33 (e.g. from the NCBI Short Read Archive). These can hold PHRED 
  71      scores from 0 to 93. 
  72   - "fastq-sanger" is an alias for "fastq". 
  73   - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  74      files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  75      scores from -5 to 62. 
  76   - "fastq-illumina" means new Illumina 1.3+ style FASTQ files, using PHRED 
  77      scores but with an ASCII offset 64, allowing PHRED scores from 0 to 62. 
  78   
  79  We could potentially add support for "qual-solexa" meaning QUAL files which 
  80  contain Solexa scores, but thus far there isn't any reason to use such files. 
  81   
  82  For example, consider the following short FASTQ file:: 
  83   
  84      @EAS54_6_R1_2_1_413_324 
  85      CCCTTCTTGTCTTCAGCGTTTCTCC 
  86      + 
  87      ;;3;;;;;;;;;;;;7;;;;;;;88 
  88      @EAS54_6_R1_2_1_540_792 
  89      TTGGCAGGCCAAGGCCGATGGATCA 
  90      + 
  91      ;;;;;;;;;;;7;;;;;-;;;3;83 
  92      @EAS54_6_R1_2_1_443_348 
  93      GTTGCTTCTGGCGTGGGTGGGGGGG 
  94      + 
  95      ;;;;;;;;;;;9;7;;.7;393333 
  96   
  97  This contains three reads of length 25.  From the read length these were 
  98  probably originally from an early Solexa/Illumina sequencer but this file 
  99  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 100  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 101  "fastq" as the format name: 
 102   
 103      >>> from Bio import SeqIO 
 104      >>> for record in SeqIO.parse(open("Quality/example.fastq"), "fastq"): 
 105      ...     print record.id, record.seq 
 106      EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 107      EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 108      EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 109   
 110  The qualities are held as a list of integers in each record's annotation: 
 111   
 112      >>> print record 
 113      ID: EAS54_6_R1_2_1_443_348 
 114      Name: EAS54_6_R1_2_1_443_348 
 115      Description: EAS54_6_R1_2_1_443_348 
 116      Number of features: 0 
 117      Per letter annotation for: phred_quality 
 118      Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 119      >>> print record.letter_annotations["phred_quality"] 
 120      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 121   
 122  You can use the SeqRecord format method to show this in the QUAL format: 
 123   
 124      >>> print record.format("qual") 
 125      >EAS54_6_R1_2_1_443_348 
 126      26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 127      24 18 18 18 18 
 128      <BLANKLINE> 
 129   
 130  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 131   
 132      >>> print record.format("fastq") 
 133      @EAS54_6_R1_2_1_443_348 
 134      GTTGCTTCTGGCGTGGGTGGGGGGG 
 135      + 
 136      ;;;;;;;;;;;9;7;;.7;393333 
 137      <BLANKLINE> 
 138   
 139  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 140  of 64): 
 141   
 142      >>> print record.format("fastq-illumina") 
 143      @EAS54_6_R1_2_1_443_348 
 144      GTTGCTTCTGGCGTGGGTGGGGGGG 
 145      + 
 146      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 147      <BLANKLINE> 
 148   
 149  You can also get Biopython to convert the scores and show a Solexa style 
 150  FASTQ file: 
 151   
 152      >>> print record.format("fastq-solexa") 
 153      @EAS54_6_R1_2_1_443_348 
 154      GTTGCTTCTGGCGTGGGTGGGGGGG 
 155      + 
 156      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 157      <BLANKLINE> 
 158   
 159  Notice that this is actually the same output as above using "fastq-illumina" 
 160  as the format! The reason for this is all these scores are high enough that 
 161  the PHRED and Solexa scores are almost equal. The differences become apparent 
 162  for poor quality reads. See the functions solexa_quality_from_phred and 
 163  phred_quality_from_solexa for more details. 
 164    
 165  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 166  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 167   
 168      >>> sub_rec = record[5:15] 
 169      >>> print sub_rec 
 170      ID: EAS54_6_R1_2_1_443_348 
 171      Name: EAS54_6_R1_2_1_443_348 
 172      Description: EAS54_6_R1_2_1_443_348 
 173      Number of features: 0 
 174      Per letter annotation for: phred_quality 
 175      Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 176      >>> print sub_rec.letter_annotations["phred_quality"] 
 177      [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 178      >>> print sub_rec.format("fastq") 
 179      @EAS54_6_R1_2_1_443_348 
 180      TTCTGGCGTG 
 181      + 
 182      ;;;;;;9;7; 
 183      <BLANKLINE> 
 184       
 185  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 186   
 187      >>> from Bio import SeqIO 
 188      >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 
 189      >>> out_handle = open("Quality/temp.qual", "w") 
 190      >>> SeqIO.write(record_iterator, out_handle, "qual") 
 191      3 
 192      >>> out_handle.close() 
 193   
 194  You can of course read in a QUAL file, such as the one we just created: 
 195   
 196      >>> from Bio import SeqIO 
 197      >>> for record in SeqIO.parse(open("Quality/temp.qual"), "qual"): 
 198      ...     print record.id, record.seq 
 199      EAS54_6_R1_2_1_413_324 ????????????????????????? 
 200      EAS54_6_R1_2_1_540_792 ????????????????????????? 
 201      EAS54_6_R1_2_1_443_348 ????????????????????????? 
 202   
 203  Notice that QUAL files don't have a proper sequence present!  But the quality 
 204  information is there: 
 205   
 206      >>> print record 
 207      ID: EAS54_6_R1_2_1_443_348 
 208      Name: EAS54_6_R1_2_1_443_348 
 209      Description: EAS54_6_R1_2_1_443_348 
 210      Number of features: 0 
 211      Per letter annotation for: phred_quality 
 212      UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 213      >>> print record.letter_annotations["phred_quality"] 
 214      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 215   
 216  Just to keep things tidy, if you are following this example yourself, you can 
 217  delete this temporary file now: 
 218   
 219      >>> import os 
 220      >>> os.remove("Quality/temp.qual") 
 221   
 222  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 223  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 224  would have to read the two in separately and then combine the data.  However, 
 225  since this is such a common thing to want to do, there is a helper iterator 
 226  defined in this module that does this for you - PairedFastaQualIterator. 
 227   
 228  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 229  then a simple dictionary approach would work: 
 230   
 231      >>> from Bio import SeqIO 
 232      >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta")) 
 233      >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"): 
 234      ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 235   
 236  You can then access any record by its key, and get both the sequence and the 
 237  quality scores. 
 238   
 239      >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq") 
 240      @EAS54_6_R1_2_1_540_792 
 241      TTGGCAGGCCAAGGCCGATGGATCA 
 242      + 
 243      ;;;;;;;;;;;7;;;;;-;;;3;83 
 244      <BLANKLINE> 
 245   
 246  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 247  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 248  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 249  for the more recent variant), as this cannot be detected reliably 
 250  automatically. 
 251   
 252  To illustrate this problem, let's consider an artifical example: 
 253   
 254      >>> from Bio.Seq import Seq 
 255      >>> from Bio.Alphabet import generic_dna 
 256      >>> from Bio.SeqRecord import SeqRecord 
 257      >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 258      ... description="Made up!") 
 259      >>> print test.format("fasta") 
 260      >Test Made up! 
 261      NACGTACGTA 
 262      <BLANKLINE> 
 263      >>> print test.format("fastq") 
 264      Traceback (most recent call last): 
 265       ... 
 266      ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 267   
 268  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 269  or FASTQ format we need to provide some quality scores. These are held as a 
 270  list of integers (one for each base) in the letter_annotations dictionary: 
 271   
 272      >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40] 
 273      >>> print test.format("qual") 
 274      >Test Made up! 
 275      0 1 2 3 4 5 10 20 30 40 
 276      <BLANKLINE> 
 277      >>> print test.format("fastq") 
 278      @Test Made up! 
 279      NACGTACGTA 
 280      + 
 281      !"#$%&+5?I 
 282      <BLANKLINE> 
 283   
 284  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 285  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 286  the letter "I": 
 287   
 288      >>> ord('!') - 33 
 289      0 
 290      >>> ord('I') - 33 
 291      40 
 292      >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 293      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 294   
 295  Similarly, we could produce an Illumina 1.3+ style FASTQ file using PHRED 
 296  scores with an offset of 64: 
 297   
 298      >>> print test.format("fastq-illumina") 
 299      @Test Made up! 
 300      NACGTACGTA 
 301      + 
 302      @ABCDEJT^h 
 303      <BLANKLINE> 
 304   
 305  And we can check this too - the first PHRED score was zero, and this mapped to 
 306  "@", while the final score was 40 and this mapped to "h": 
 307   
 308      >>> ord("@") - 64 
 309      0 
 310      >>> ord("h") - 64 
 311      40 
 312      >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 313      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 314   
 315  Notice how different the standard Sanger FASTQ and the Illumina 1.3+ style 
 316  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 317  format to consider which encodes Solexa scores instead of PHRED scores. 
 318   
 319  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 320  scores (rounding to one decimal place): 
 321   
 322      >>> for q in [0,1,2,3,4,5,10,20,30,40]: 
 323      ...     print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q)) 
 324      PHRED 0 maps to Solexa -5.0 
 325      PHRED 1 maps to Solexa -5.0 
 326      PHRED 2 maps to Solexa -2.3 
 327      PHRED 3 maps to Solexa -0.0 
 328      PHRED 4 maps to Solexa 1.8 
 329      PHRED 5 maps to Solexa 3.3 
 330      PHRED 10 maps to Solexa 9.5 
 331      PHRED 20 maps to Solexa 20.0 
 332      PHRED 30 maps to Solexa 30.0 
 333      PHRED 40 maps to Solexa 40.0 
 334   
 335  Now here is the record using the old Solexa style FASTQ file: 
 336   
 337      >>> print test.format("fastq-solexa") 
 338      @Test Made up! 
 339      NACGTACGTA 
 340      + 
 341      ;;>@BCJT^h 
 342      <BLANKLINE> 
 343   
 344  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 345   
 346      >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 347      [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 348   
 349  This explains why the last few letters of this FASTQ output matched that using 
 350  the Illumina 1.3+ format - high quality PHRED scores and Solexa scores are 
 351  approximately equal. 
 352   
 353  """ 
 354  __docformat__ = "epytext en" #Don't just use plain text in epydoc API pages! 
 355   
 356  #See also http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/ 
 357   
 358  from Bio.Alphabet import single_letter_alphabet 
 359  from Bio.Seq import Seq, UnknownSeq 
 360  from Bio.SeqRecord import SeqRecord 
 361  from Interfaces import SequentialSequenceWriter 
 362  from math import log 
 363  import warnings 
 364  # define score offsets. See discussion for differences between Sanger and 
 365  # Solexa offsets. 
 366  SANGER_SCORE_OFFSET = 33 
 367  SOLEXA_SCORE_OFFSET = 64 
 368   
369 -def solexa_quality_from_phred(phred_quality):
370 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 371 372 PHRED and Solexa quality scores are both log transformations of a 373 probality of error (high score = low probability of error). This function 374 takes a PHRED score, transforms it back to a probability of error, and 375 then re-expresses it as a Solexa score. This assumes the error estimates 376 are equivalent. 377 378 How does this work exactly? Well the PHRED quality is minus ten times the 379 base ten logarithm of the probability of error:: 380 381 phred_quality = -10*log(error,10) 382 383 Therefore, turning this round:: 384 385 error = 10 ** (- phred_quality / 10) 386 387 Now, Solexa qualities use a different log transformation:: 388 389 solexa_quality = -10*log(error/(1-error),10) 390 391 After substitution and a little manipulation we get:: 392 393 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 394 395 However, real Solexa files use a minimum quality of -5. This does have a 396 good reason - a random a random base call would be correct 25% of the time, 397 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 398 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 399 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 400 401 Taken literally, this logarithic formula would map a PHRED quality of zero 402 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 403 score of zero means a probability of error of one (i.e. the base call is 404 definitely wrong), which is worse than random! In practice, a PHRED quality 405 of zero usually means a default value, or perhaps random - and therefore 406 mapping it to the minimum Solexa score of -5 is reasonable. 407 408 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 409 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 410 quality of zero to -5.0 as well. 411 412 Note this function will return a floating point number, it is up to you to 413 round this to the nearest integer if appropriate. e.g. 414 415 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2) 416 80.00 417 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2) 418 50.00 419 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2) 420 19.96 421 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2) 422 9.54 423 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2) 424 3.35 425 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2) 426 1.80 427 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2) 428 -0.02 429 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2) 430 -2.33 431 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2) 432 -5.00 433 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2) 434 -5.00 435 436 Notice that for high quality reads PHRED and Solexa scores are numerically 437 equal. The differences are important for poor quality reads, where PHRED 438 has a minimum of zero but Solexa scores can be negative. 439 440 Finally, as a special case where None is used for a "missing value", None 441 is returned: 442 443 >>> print solexa_quality_from_phred(None) 444 None 445 """ 446 if phred_quality > 0: 447 #Solexa uses a minimum value of -5, which after rounding matches a 448 #random nucleotide base call. 449 return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10)) 450 elif phred_quality == 0: 451 #Special case, map to -5 as discussed in the docstring 452 return -5.0 453 elif phred_quality is None: 454 #Assume None is used as some kind of NULL or NA value; return None 455 #e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 456 return None 457 else: 458 raise ValueError("PHRED qualities must be positive (or zero), not %s" \ 459 % repr(phred_quality))
460
461 -def phred_quality_from_solexa(solexa_quality):
462 """Convert a Solexa quality (which can be negative) to a PHRED quality. 463 464 PHRED and Solexa quality scores are both log transformations of a 465 probality of error (high score = low probability of error). This function 466 takes a Solexa score, transforms it back to a probability of error, and 467 then re-expresses it as a PHRED score. This assumes the error estimates 468 are equivalent. 469 470 The underlying formulas are given in the documentation for the sister 471 function solexa_quality_from_phred, in this case the operation is:: 472 473 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 474 475 This will return a floating point number, it is up to you to round this to 476 the nearest integer if appropriate. e.g. 477 478 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2) 479 80.00 480 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2) 481 20.04 482 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2) 483 10.41 484 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2) 485 3.01 486 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2) 487 1.19 488 489 Note that a solexa_quality less then -5 is not expected, will trigger a 490 warning, but will still be converted as per the logarithmic mapping 491 (giving a number between 0 and 1.19 back). 492 493 As a special case where None is used for a "missing value", None is 494 returned: 495 496 >>> print phred_quality_from_solexa(None) 497 None 498 """ 499 if solexa_quality is None: 500 #Assume None is used as some kind of NULL or NA value; return None 501 return None 502 if solexa_quality < -5: 503 import warnings 504 warnings.warn("Solexa quality less than -5 passed, %s" \ 505 % repr(solexa_quality)) 506 return 10*log(10**(solexa_quality/10.0) + 1, 10)
507
508 -def _get_phred_quality(record):
509 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 510 511 If there are no PHRED qualities, but there are Solexa qualities, those are 512 used instead after conversion. 513 """ 514 try: 515 return record.letter_annotations["phred_quality"] 516 except KeyError: 517 pass 518 try: 519 return [phred_quality_from_solexa(q) for \ 520 q in record.letter_annotations["solexa_quality"]] 521 except KeyError: 522 raise ValueError("No suitable quality scores found in " 523 "letter_annotations of SeqRecord (id=%s)." \ 524 % record.id)
525 526 #Only map 0 to 93, we need to give a warning on truncating at 93 527 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp+SANGER_SCORE_OFFSET))) \ 528 for qp in range(0, 93+1)) 529 #Only map -5 to 93, we need to give a warning on truncating at 93 530 _solexa_to_sanger_quality_str = dict( \ 531 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \ 532 for qs in range(-5, 93+1))
533 -def _get_sanger_quality_str(record):
534 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 535 536 >>> from Bio.Seq import Seq 537 >>> from Bio.SeqRecord import SeqRecord 538 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 539 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]}) 540 >>> _get_sanger_quality_str(r) 541 'SI?5+!' 542 543 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 544 the PHRED qualities are integers, this function is able to use a very fast 545 pre-cached mapping. However, if they are floats which differ slightly, then 546 it has to do the appropriate rounding - which is slower: 547 548 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 549 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]}) 550 >>> _get_sanger_quality_str(r2) 551 'SI?5+!' 552 553 If your scores include a None value, this raises an exception: 554 555 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 556 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]}) 557 >>> _get_sanger_quality_str(r3) 558 Traceback (most recent call last): 559 ... 560 TypeError: A quality value of None was found 561 562 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 563 scores are used in preference: 564 565 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 566 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0], 567 ... "solexa_quality":[-5,-4,0,None,0,40]}) 568 >>> _get_sanger_quality_str(r4) 569 'SI?5+!' 570 571 If there are no PHRED scores, but there are Solexa scores, these are used 572 instead (after the approriate conversion): 573 574 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 575 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]}) 576 >>> _get_sanger_quality_str(r5) 577 'I?5+$"' 578 579 Again, integer Solexa scores can be looked up in a pre-cached mapping making 580 this very fast. You can still use approximate floating point scores: 581 582 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 583 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]}) 584 >>> _get_sanger_quality_str(r6) 585 'I?5+$"' 586 587 Notice that due to the limited range of printable ASCII characters, a 588 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 589 file (using ASCII 126, the tilde). This function will issue a warning 590 in this situation. 591 """ 592 #TODO - This functions works and is fast, but it is also ugly 593 #and there is considerable repetition of code for the other 594 #two FASTQ variants. 595 try: 596 #These take priority (in case both Solexa and PHRED scores found) 597 qualities = record.letter_annotations["phred_quality"] 598 except KeyError: 599 #Fall back on solexa scores... 600 pass 601 else: 602 #Try and use the precomputed mapping: 603 try: 604 return "".join([_phred_to_sanger_quality_str[qp] \ 605 for qp in qualities]) 606 except KeyError: 607 #Could be a float, or a None in the list, or a high value. 608 pass 609 if None in qualities: 610 raise TypeError("A quality value of None was found") 611 if max(qualities) >= 93.5: 612 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ") 613 #This will apply the truncation at 93, giving max ASCII 126 614 return "".join([chr(min(126, int(round(qp))+SANGER_SCORE_OFFSET)) \ 615 for qp in qualities]) 616 #Fall back on the Solexa scores... 617 try: 618 qualities = record.letter_annotations["solexa_quality"] 619 except KeyError: 620 raise ValueError("No suitable quality scores found in " 621 "letter_annotations of SeqRecord (id=%s)." \ 622 % record.id) 623 #Try and use the precomputed mapping: 624 try: 625 return "".join([_solexa_to_sanger_quality_str[qs] \ 626 for qs in qualities]) 627 except KeyError: 628 #Either no PHRED scores, or something odd like a float or None 629 pass 630 if None in qualities: 631 raise TypeError("A quality value of None was found") 632 #Must do this the slow way, first converting the PHRED scores into 633 #Solexa scores: 634 if max(qualities) >= 93.5: 635 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ") 636 #This will apply the truncation at 93, giving max ASCII 126 637 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \ 638 for qs in qualities])
639 640 #Only map 0 to 62, we need to give a warning on truncating at 62 641 assert 62+SOLEXA_SCORE_OFFSET == 126 642 _phred_to_illumina_quality_str = dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \ 643 for qp in range(0, 62+1)) 644 #Only map -5 to 62, we need to give a warning on truncating at 62 645 _solexa_to_illumina_quality_str = dict( \ 646 (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \ 647 for qs in range(-5, 62+1))
648 -def _get_illumina_quality_str(record):
649 """Returns an Illumina 1.3+ FASTQ encoded quality string (PRIVATE). 650 651 Notice that due to the limited range of printable ASCII characters, a 652 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 653 file (using ASCII 126, the tilde). This function will issue a warning 654 in this situation. 655 """ 656 #TODO - This functions works and is fast, but it is also ugly 657 #and there is considerable repetition of code for the other 658 #two FASTQ variants. 659 try: 660 #These take priority (in case both Solexa and PHRED scores found) 661 qualities = record.letter_annotations["phred_quality"] 662 except KeyError: 663 #Fall back on solexa scores... 664 pass 665 else: 666 #Try and use the precomputed mapping: 667 try: 668 return "".join([_phred_to_illumina_quality_str[qp] \ 669 for qp in qualities]) 670 except KeyError: 671 #Could be a float, or a None in the list, or a high value. 672 pass 673 if None in qualities: 674 raise TypeError("A quality value of None was found") 675 if max(qualities) >= 62.5: 676 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ") 677 #This will apply the truncation at 62, giving max ASCII 126 678 return "".join([chr(min(126, int(round(qp))+SOLEXA_SCORE_OFFSET)) \ 679 for qp in qualities]) 680 #Fall back on the Solexa scores... 681 try: 682 qualities = record.letter_annotations["solexa_quality"] 683 except KeyError: 684 raise ValueError("No suitable quality scores found in " 685 "letter_annotations of SeqRecord (id=%s)." \ 686 % record.id) 687 #Try and use the precomputed mapping: 688 try: 689 return "".join([_solexa_to_illumina_quality_str[qs] \ 690 for qs in qualities]) 691 except KeyError: 692 #Either no PHRED scores, or something odd like a float or None 693 pass 694 if None in qualities: 695 raise TypeError("A quality value of None was found") 696 #Must do this the slow way, first converting the PHRED scores into 697 #Solexa scores: 698 if max(qualities) >= 62.5: 699 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ") 700 #This will apply the truncation at 62, giving max ASCII 126 701 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \ 702 for qs in qualities])
703 704 #Only map 0 to 62, we need to give a warning on truncating at 62 705 assert 62+SOLEXA_SCORE_OFFSET == 126 706 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs+SOLEXA_SCORE_OFFSET))) \ 707 for qs in range(-5, 62+1)) 708 #Only map -5 to 62, we need to give a warning on truncating at 62 709 _phred_to_solexa_quality_str = dict(\ 710 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \ 711 for qp in range(0, 62+1))
712 -def _get_solexa_quality_str(record):
713 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 714 715 Notice that due to the limited range of printable ASCII characters, a 716 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 717 file (using ASCII 126, the tilde). This function will issue a warning 718 in this situation. 719 """ 720 #TODO - This functions works and is fast, but it is also ugly 721 #and there is considerable repetition of code for the other 722 #two FASTQ variants. 723 try: 724 #These take priority (in case both Solexa and PHRED scores found) 725 qualities = record.letter_annotations["solexa_quality"] 726 except KeyError: 727 #Fall back on PHRED scores... 728 pass 729 else: 730 #Try and use the precomputed mapping: 731 try: 732 return "".join([_solexa_to_solexa_quality_str[qs] \ 733 for qs in qualities]) 734 except KeyError: 735 #Could be a float, or a None in the list, or a high value. 736 pass 737 if None in qualities: 738 raise TypeError("A quality value of None was found") 739 if max(qualities) >= 62.5: 740 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ") 741 #This will apply the truncation at 62, giving max ASCII 126 742 return "".join([chr(min(126, int(round(qs))+SOLEXA_SCORE_OFFSET)) \ 743 for qs in qualities]) 744 #Fall back on the PHRED scores... 745 try: 746 qualities = record.letter_annotations["phred_quality"] 747 except KeyError: 748 raise ValueError("No suitable quality scores found in " 749 "letter_annotations of SeqRecord (id=%s)." \ 750 % record.id) 751 #Try and use the precomputed mapping: 752 try: 753 return "".join([_phred_to_solexa_quality_str[qp] \ 754 for qp in qualities]) 755 except KeyError: 756 #Either no PHRED scores, or something odd like a float or None 757 #or too big to be in the cache 758 pass 759 if None in qualities: 760 raise TypeError("A quality value of None was found") 761 #Must do this the slow way, first converting the PHRED scores into 762 #Solexa scores: 763 if max(qualities) >= 62.5: 764 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ") 765 return "".join([chr(min(126, 766 int(round(solexa_quality_from_phred(qp))) + \ 767 SOLEXA_SCORE_OFFSET)) \ 768 for qp in qualities])
769 770 #TODO - Default to nucleotide or even DNA?
771 -def FastqGeneralIterator(handle):
772 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 773 774 This code does not try to interpret the quality string numerically. It 775 just returns tuples of the title, sequence and quality as strings. For 776 the sequence and quality, any whitespace (such as new lines) is removed. 777 778 Our SeqRecord based FASTQ iterators call this function internally, and then 779 turn the strings into a SeqRecord objects, mapping the quality string into 780 a list of numerical scores. If you want to do a custom quality mapping, 781 then you might consider calling this function directly. 782 783 For parsing FASTQ files, the title string from the "@" line at the start 784 of each record can optionally be omitted on the "+" lines. If it is 785 repeated, it must be identical. 786 787 The sequence string and the quality string can optionally be split over 788 multiple lines, although several sources discourage this. In comparison, 789 for the FASTA file format line breaks between 60 and 80 characters are 790 the norm. 791 792 WARNING - Because the "@" character can appear in the quality string, 793 this can cause problems as this is also the marker for the start of 794 a new sequence. In fact, the "+" sign can also appear as well. Some 795 sources recommended having no line breaks in the quality to avoid this, 796 but even that is not enough, consider this example:: 797 798 @071113_EAS56_0053:1:1:998:236 799 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 800 +071113_EAS56_0053:1:1:998:236 801 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 802 @071113_EAS56_0053:1:1:182:712 803 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 804 + 805 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 806 @071113_EAS56_0053:1:1:153:10 807 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 808 + 809 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 810 @071113_EAS56_0053:1:3:990:501 811 TGGGAGGTTTTATGTGGA 812 AAGCAGCAATGTACAAGA 813 + 814 IIIIIII.IIIIII1@44 815 @-7.%<&+/$/%4(++(% 816 817 This is four PHRED encoded FASTQ entries originally from an NCBI source 818 (given the read length of 36, these are probably Solexa Illumna reads where 819 the quality has been mapped onto the PHRED values). 820 821 This example has been edited to illustrate some of the nasty things allowed 822 in the FASTQ format. Firstly, on the "+" lines most but not all of the 823 (redundant) identifiers are ommited. In real files it is likely that all or 824 none of these extra identifiers will be present. 825 826 Secondly, while the first three sequences have been shown without line 827 breaks, the last has been split over multiple lines. In real files any line 828 breaks are likely to be consistent. 829 830 Thirdly, some of the quality string lines start with an "@" character. For 831 the second record this is unavoidable. However for the fourth sequence this 832 only happens because its quality string is split over two lines. A naive 833 parser could wrongly treat any line starting with an "@" as the beginning of 834 a new sequence! This code copes with this possible ambiguity by keeping 835 track of the length of the sequence which gives the expected length of the 836 quality string. 837 838 Using this tricky example file as input, this short bit of code demonstrates 839 what this parsing function would return: 840 841 >>> handle = open("Quality/tricky.fastq", "rU") 842 >>> for (title, sequence, quality) in FastqGeneralIterator(handle): 843 ... print title 844 ... print sequence, quality 845 071113_EAS56_0053:1:1:998:236 846 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 847 071113_EAS56_0053:1:1:182:712 848 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 849 071113_EAS56_0053:1:1:153:10 850 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 851 071113_EAS56_0053:1:3:990:501 852 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 853 >>> handle.close() 854 855 Finally we note that some sources state that the quality string should 856 start with "!" (which using the PHRED mapping means the first letter always 857 has a quality score of zero). This rather restrictive rule is not widely 858 observed, so is therefore ignored here. One plus point about this "!" rule 859 is that (provided there are no line breaks in the quality sequence) it 860 would prevent the above problem with the "@" character. 861 """ 862 #We need to call handle.readline() at least four times per record, 863 #so we'll save a property look up each time: 864 handle_readline = handle.readline 865 866 #Skip any text before the first record (e.g. blank lines, comments?) 867 while True: 868 line = handle_readline() 869 if line == "" : return #Premature end of file, or just empty? 870 if line[0] == "@": 871 break 872 873 while True: 874 if line[0] != "@": 875 raise ValueError("Records in Fastq files should start with '@' character") 876 title_line = line[1:].rstrip() 877 #Will now be at least one line of quality data - in most FASTQ files 878 #just one line! We therefore use string concatenation (if needed) 879 #rather using than the "".join(...) trick just in case it is multiline: 880 seq_string = handle_readline().rstrip() 881 #There may now be more sequence lines, or the "+" quality marker line: 882 while True: 883 line = handle_readline() 884 if not line: 885 raise ValueError("End of file without quality information.") 886 if line[0] == "+": 887 #The title here is optional, but if present must match! 888 second_title = line[1:].rstrip() 889 if second_title and second_title != title_line: 890 raise ValueError("Sequence and quality captions differ.") 891 break 892 seq_string += line.rstrip() #removes trailing newlines 893 #This is going to slow things down a little, but assuming 894 #this isn't allowed we should try and catch it here: 895 if " " in seq_string or "\t" in seq_string: 896 raise ValueError("Whitespace is not allowed in the sequence.") 897 seq_len = len(seq_string) 898 899 #Will now be at least one line of quality data... 900 quality_string = handle_readline().rstrip() 901 #There may now be more quality data, or another sequence, or EOF 902 while True: 903 line = handle_readline() 904 if not line : break #end of file 905 if line[0] == "@": 906 #This COULD be the start of a new sequence. However, it MAY just 907 #be a line of quality data which starts with a "@" character. We 908 #should be able to check this by looking at the sequence length 909 #and the amount of quality data found so far. 910 if len(quality_string) >= seq_len: 911 #We expect it to be equal if this is the start of a new record. 912 #If the quality data is longer, we'll raise an error below. 913 break 914 #Continue - its just some (more) quality data. 915 quality_string += line.rstrip() 916 917 if seq_len != len(quality_string): 918 raise ValueError("Lengths of sequence and quality values differs " 919 " for %s (%i and %i)." \ 920 % (title_line, seq_len, len(quality_string))) 921 922 #Return the record and then continue... 923 yield (title_line, seq_string, quality_string) 924 if not line : return #StopIteration at end of file 925 assert False, "Should not reach this line"
926 927 #This is a generator function!
928 -def FastqPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
929 """Generator function to iterate over FASTQ records (as SeqRecord objects). 930 931 - handle - input file 932 - alphabet - optional alphabet 933 - title2ids - A function that, when given the title line from the FASTQ 934 file (without the beginning >), will return the id, name and 935 description (in that order) for the record as a tuple of 936 strings. If this is not given, then the entire title line 937 will be used as the description, and the first word as the 938 id and name. 939 940 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 941 942 For each sequence in a (Sanger style) FASTQ file there is a matching string 943 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 944 values with an offset of 33. 945 946 For example, consider a file containing three short reads:: 947 948 @EAS54_6_R1_2_1_413_324 949 CCCTTCTTGTCTTCAGCGTTTCTCC 950 + 951 ;;3;;;;;;;;;;;;7;;;;;;;88 952 @EAS54_6_R1_2_1_540_792 953 TTGGCAGGCCAAGGCCGATGGATCA 954 + 955 ;;;;;;;;;;;7;;;;;-;;;3;83 956 @EAS54_6_R1_2_1_443_348 957 GTTGCTTCTGGCGTGGGTGGGGGGG 958 + 959 ;;;;;;;;;;;9;7;;.7;393333 960 961 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 962 string encoding the PHRED qualities using a ASCI values with an offset of 963 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 964 965 Using this module directly you might run: 966 967 >>> handle = open("Quality/example.fastq", "rU") 968 >>> for record in FastqPhredIterator(handle): 969 ... print record.id, record.seq 970 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 971 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 972 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 973 >>> handle.close() 974 975 Typically however, you would call this via Bio.SeqIO instead with "fastq" 976 (or "fastq-sanger") as the format: 977 978 >>> from Bio import SeqIO 979 >>> handle = open("Quality/example.fastq", "rU") 980 >>> for record in SeqIO.parse(handle, "fastq"): 981 ... print record.id, record.seq 982 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 983 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 984 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 985 >>> handle.close() 986 987 If you want to look at the qualities, they are record in each record's 988 per-letter-annotation dictionary as a simple list of integers: 989 990 >>> print record.letter_annotations["phred_quality"] 991 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 992 """ 993 assert SANGER_SCORE_OFFSET == ord("!") 994 #Originally, I used a list expression for each record: 995 # 996 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 997 # 998 #Precomputing is faster, perhaps partly by avoiding the subtractions. 999 q_mapping = dict() 1000 for letter in range(0, 255): 1001 q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET 1002 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1003 if title2ids: 1004 id, name, descr = title2ids(title_line) 1005 else: 1006 descr = title_line 1007 id = descr.split()[0] 1008 name = id 1009 record = SeqRecord(Seq(seq_string, alphabet), 1010 id=id, name=name, description=descr) 1011 qualities = [q_mapping[letter] for letter in quality_string] 1012 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1013 raise ValueError("Invalid character in quality string") 1014 #For speed, will now use a dirty trick to speed up assigning the 1015 #qualities. We do this to bypass the length check imposed by the 1016 #per-letter-annotations restricted dict (as this has already been 1017 #checked by FastqGeneralIterator). This is equivalent to: 1018 #record.letter_annotations["phred_quality"] = qualities 1019 dict.__setitem__(record._per_letter_annotations, 1020 "phred_quality", qualities) 1021 yield record
1022 1023 #This is a generator function!
1024 -def FastqSolexaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1025 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1026 1027 The optional arguments are the same as those for the FastqPhredIterator. 1028 1029 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1030 encoding the Solexa integer qualities using ASCII values with an offset 1031 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1032 will NOT perform any automatic conversion when loading. 1033 1034 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1035 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1036 1037 For example, consider a file containing these five records:: 1038 1039 @SLXA-B3_649_FC8437_R1_1_1_610_79 1040 GATGTGCAATACCTTTGTAGAGGAA 1041 +SLXA-B3_649_FC8437_R1_1_1_610_79 1042 YYYYYYYYYYYYYYYYYYWYWYYSU 1043 @SLXA-B3_649_FC8437_R1_1_1_397_389 1044 GGTTTGAGAAAGAGAAATGAGATAA 1045 +SLXA-B3_649_FC8437_R1_1_1_397_389 1046 YYYYYYYYYWYYYYWWYYYWYWYWW 1047 @SLXA-B3_649_FC8437_R1_1_1_850_123 1048 GAGGGTGTTGATCATGATGATGGCG 1049 +SLXA-B3_649_FC8437_R1_1_1_850_123 1050 YYYYYYYYYYYYYWYYWYYSYYYSY 1051 @SLXA-B3_649_FC8437_R1_1_1_362_549 1052 GGAAACAAAGTTTTTCTCAACATAG 1053 +SLXA-B3_649_FC8437_R1_1_1_362_549 1054 YYYYYYYYYYYYYYYYYYWWWWYWY 1055 @SLXA-B3_649_FC8437_R1_1_1_183_714 1056 GTATTATTTAATGGCATACACTCAA 1057 +SLXA-B3_649_FC8437_R1_1_1_183_714 1058 YYYYYYYYYYWYYYYWYWWUWWWQQ 1059 1060 Using this module directly you might run: 1061 1062 >>> handle = open("Quality/solexa_example.fastq", "rU") 1063 >>> for record in FastqSolexaIterator(handle): 1064 ... print record.id, record.seq 1065 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1066 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1067 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1068 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1069 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1070 >>> handle.close() 1071 1072 Typically however, you would call this via Bio.SeqIO instead with 1073 "fastq-solexa" as the format: 1074 1075 >>> from Bio import SeqIO 1076 >>> handle = open("Quality/solexa_example.fastq", "rU") 1077 >>> for record in SeqIO.parse(handle, "fastq-solexa"): 1078 ... print record.id, record.seq 1079 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1080 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1081 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1082 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1083 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1084 >>> handle.close() 1085 1086 If you want to look at the qualities, they are recorded in each record's 1087 per-letter-annotation dictionary as a simple list of integers: 1088 1089 >>> print record.letter_annotations["solexa_quality"] 1090 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1091 1092 These scores aren't very good, but they are high enough that they map 1093 almost exactly onto PHRED scores: 1094 1095 >>> print "%0.2f" % phred_quality_from_solexa(25) 1096 25.01 1097 1098 Let's look at faked example read which is even worse, where there are 1099 more noticeable differences between the Solexa and PHRED scores:: 1100 1101 @slxa_0001_1_0001_01 1102 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1103 +slxa_0001_1_0001_01 1104 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1105 1106 Again, you would typically use Bio.SeqIO to read this file in (rather than 1107 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1108 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1109 as shown above. This example has only as one entry, so instead we can 1110 use the Bio.SeqIO.read() function: 1111 1112 >>> from Bio import SeqIO 1113 >>> handle = open("Quality/solexa_faked.fastq", "rU") 1114 >>> record = SeqIO.read(handle, "fastq-solexa") 1115 >>> handle.close() 1116 >>> print record.id, record.seq 1117 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1118 >>> print record.letter_annotations["solexa_quality"] 1119 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1120 1121 These quality scores are so low that when converted from the Solexa scheme 1122 into PHRED scores they look quite different: 1123 1124 >>> print "%0.2f" % phred_quality_from_solexa(-1) 1125 2.54 1126 >>> print "%0.2f" % phred_quality_from_solexa(-5) 1127 1.19 1128 1129 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1130 method to output the record(s): 1131 1132 >>> print record.format("fastq-solexa") 1133 @slxa_0001_1_0001_01 1134 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1135 + 1136 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1137 <BLANKLINE> 1138 1139 Note this output is slightly different from the input file as Biopython 1140 has left out the optional repetition of the sequence identifier on the "+" 1141 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1142 output format instead, and Biopython will do the conversion for you: 1143 1144 >>> print record.format("fastq") 1145 @slxa_0001_1_0001_01 1146 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1147 + 1148 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1149 <BLANKLINE> 1150 1151 >>> print record.format("qual") 1152 >slxa_0001_1_0001_01 1153 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1154 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1155 1 1 1156 <BLANKLINE> 1157 1158 As shown above, the poor quality Solexa reads have been mapped to the 1159 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1160 """ 1161 q_mapping = dict() 1162 for letter in range(0, 255): 1163 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET 1164 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1165 if title2ids: 1166 id, name, descr = title_line 1167 else: 1168 descr = title_line 1169 id = descr.split()[0] 1170 name = id 1171 record = SeqRecord(Seq(seq_string, alphabet), 1172 id=id, name=name, description=descr) 1173 qualities = [q_mapping[letter] for letter in quality_string] 1174 #DO NOT convert these into PHRED qualities automatically! 1175 if qualities and (min(qualities) < -5 or max(qualities)>62): 1176 raise ValueError("Invalid character in quality string") 1177 #Dirty trick to speed up this line: 1178 #record.letter_annotations["solexa_quality"] = qualities 1179 dict.__setitem__(record._per_letter_annotations, 1180 "solexa_quality", qualities) 1181 yield record
1182 1183 #This is a generator function!
1184 -def FastqIlluminaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1185 """Parse new Illumina 1.3+ FASTQ like files (which differ in the quality mapping). 1186 1187 The optional arguments are the same as those for the FastqPhredIterator. 1188 1189 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1190 encoding PHRED integer qualities using ASCII values with an offset of 64. 1191 1192 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1193 with an ASCII offset of 64. They are approximately equal but only for high 1194 qaulity reads. If you have an old Solexa/Illumina file with negative 1195 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1196 1197 >>> from Bio import SeqIO 1198 >>> record = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-solexa") 1199 >>> print record.id, record.seq 1200 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1201 >>> print record.letter_annotations["solexa_quality"] 1202 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1203 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina") 1204 Traceback (most recent call last): 1205 ... 1206 ValueError: Invalid character in quality string 1207 1208 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1209 """ 1210 q_mapping = dict() 1211 for letter in range(0, 255): 1212 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET 1213 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1214 if title2ids: 1215 id, name, descr = title2ids(title_line) 1216 else: 1217 descr = title_line 1218 id = descr.split()[0] 1219 name = id 1220 record = SeqRecord(Seq(seq_string, alphabet), 1221 id=id, name=name, description=descr) 1222 qualities = [q_mapping[letter] for letter in quality_string] 1223 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1224 raise ValueError("Invalid character in quality string") 1225 #Dirty trick to speed up this line: 1226 #record.letter_annotations["phred_quality"] = qualities 1227 dict.__setitem__(record._per_letter_annotations, 1228 "phred_quality", qualities) 1229 yield record
1230
1231 -def QualPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1232 """For QUAL files which include PHRED quality scores, but no sequence. 1233 1234 For example, consider this short QUAL file:: 1235 1236 >EAS54_6_R1_2_1_413_324 1237 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1238 26 26 26 23 23 1239 >EAS54_6_R1_2_1_540_792 1240 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1241 26 18 26 23 18 1242 >EAS54_6_R1_2_1_443_348 1243 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1244 24 18 18 18 18 1245 1246 Using this module directly you might run: 1247 1248 >>> handle = open("Quality/example.qual", "rU") 1249 >>> for record in QualPhredIterator(handle): 1250 ... print record.id, record.seq 1251 EAS54_6_R1_2_1_413_324 ????????????????????????? 1252 EAS54_6_R1_2_1_540_792 ????????????????????????? 1253 EAS54_6_R1_2_1_443_348 ????????????????????????? 1254 >>> handle.close() 1255 1256 Typically however, you would call this via Bio.SeqIO instead with "qual" 1257 as the format: 1258 1259 >>> from Bio import SeqIO 1260 >>> handle = open("Quality/example.qual", "rU") 1261 >>> for record in SeqIO.parse(handle, "qual"): 1262 ... print record.id, record.seq 1263 EAS54_6_R1_2_1_413_324 ????????????????????????? 1264 EAS54_6_R1_2_1_540_792 ????????????????????????? 1265 EAS54_6_R1_2_1_443_348 ????????????????????????? 1266 >>> handle.close() 1267 1268 Becase QUAL files don't contain the sequence string itself, the seq 1269 property is set to an UnknownSeq object. As no alphabet was given, this 1270 has defaulted to a generic single letter alphabet and the character "?" 1271 used. 1272 1273 By specifying a nucleotide alphabet, "N" is used instead: 1274 1275 >>> from Bio import SeqIO 1276 >>> from Bio.Alphabet import generic_dna 1277 >>> handle = open("Quality/example.qual", "rU") 1278 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1279 ... print record.id, record.seq 1280 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1281 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1282 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1283 >>> handle.close() 1284 1285 However, the quality scores themselves are available as a list of integers 1286 in each record's per-letter-annotation: 1287 1288 >>> print record.letter_annotations["phred_quality"] 1289 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1290 1291 You can still slice one of these SeqRecord objects with an UnknownSeq: 1292 1293 >>> sub_record = record[5:10] 1294 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"] 1295 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1296 """ 1297 #Skip any text before the first record (e.g. blank lines, comments) 1298 while True: 1299 line = handle.readline() 1300 if line == "" : return #Premature end of file, or just empty? 1301 if line[0] == ">": 1302 break 1303 1304 while True: 1305 if line[0] != ">": 1306 raise ValueError("Records in Fasta files should start with '>' character") 1307 if title2ids: 1308 id, name, descr = title2ids(line[1:].rstrip()) 1309 else: 1310 descr = line[1:].rstrip() 1311 id = descr.split()[0] 1312 name = id 1313 1314 qualities = [] 1315 line = handle.readline() 1316 while True: 1317 if not line : break 1318 if line[0] == ">": break 1319 qualities.extend([int(word) for word in line.split()]) 1320 line = handle.readline() 1321 1322 if qualities and min(qualities) < 0: 1323 raise ValueError(("Negative quality score %i found in %s. " + \ 1324 "Are these Solexa scores, not PHRED scores?") \ 1325 % (min(qualities), id)) 1326 1327 #Return the record and then continue... 1328 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1329 id = id, name = name, description = descr) 1330 #Dirty trick to speed up this line: 1331 #record.letter_annotations["phred_quality"] = qualities 1332 dict.__setitem__(record._per_letter_annotations, 1333 "phred_quality", qualities) 1334 yield record 1335 1336 if not line : return #StopIteration 1337 assert False, "Should not reach this line"
1338
1339 -class FastqPhredWriter(SequentialSequenceWriter):
1340 """Class to write standard FASTQ format files (using PHRED quality scores). 1341 1342 Although you can use this class directly, you are strongly encouraged 1343 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1344 or the alias "fastq-sanger". For example, this code reads in a standard 1345 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1346 Sanger style FASTQ file: 1347 1348 >>> from Bio import SeqIO 1349 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1350 >>> out_handle = open("Quality/temp.fastq", "w") 1351 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1352 3 1353 >>> out_handle.close() 1354 1355 You might want to do this if the original file included extra line breaks, 1356 which while valid may not be supported by all tools. The output file from 1357 Biopython will have each sequence on a single line, and each quality 1358 string on a single line (which is considered desirable for maximum 1359 compatibility). 1360 1361 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1362 quality scores) is converted into a standard Sanger style FASTQ file using 1363 PHRED qualities: 1364 1365 >>> from Bio import SeqIO 1366 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1367 >>> out_handle = open("Quality/temp.fastq", "w") 1368 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1369 5 1370 >>> out_handle.close() 1371 1372 This code is also called if you use the .format("fastq") method of a 1373 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1374 1375 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1376 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1377 warning is issued. 1378 1379 P.S. To avoid cluttering up your working directory, you can delete this 1380 temporary file now: 1381 1382 >>> import os 1383 >>> os.remove("Quality/temp.fastq") 1384 """ 1385 assert SANGER_SCORE_OFFSET == ord("!") 1386
1387 - def write_record(self, record):
1388 """Write a single FASTQ record to the file.""" 1389 assert self._header_written 1390 assert not self._footer_written 1391 self._record_written = True 1392 #TODO - Is an empty sequence allowed in FASTQ format? 1393 if record.seq is None: 1394 raise ValueError("No sequence for record %s" % record.id) 1395 seq_str = str(record.seq) 1396 qualities_str = _get_sanger_quality_str(record) 1397 if len(qualities_str) != len(seq_str): 1398 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1399 % (record.id, len(seq_str), len(qualities_str))) 1400 1401 #FASTQ files can include a description, just like FASTA files 1402 #(at least, this is what the NCBI Short Read Archive does) 1403 id = self.clean(record.id) 1404 description = self.clean(record.description) 1405 if description and description.split(None, 1)[0]==id: 1406 #The description includes the id at the start 1407 title = description 1408 elif description: 1409 title = "%s %s" % (id, description) 1410 else: 1411 title = id 1412 1413 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1414
1415 -class QualPhredWriter(SequentialSequenceWriter):
1416 """Class to write QUAL format files (using PHRED quality scores). 1417 1418 Although you can use this class directly, you are strongly encouraged 1419 to use the Bio.SeqIO.write() function instead. For example, this code 1420 reads in a FASTQ file and saves the quality scores into a QUAL file: 1421 1422 >>> from Bio import SeqIO 1423 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1424 >>> out_handle = open("Quality/temp.qual", "w") 1425 >>> SeqIO.write(record_iterator, out_handle, "qual") 1426 3 1427 >>> out_handle.close() 1428 1429 This code is also called if you use the .format("qual") method of a 1430 SeqRecord. 1431 1432 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1433 1434 >>> import os 1435 >>> os.remove("Quality/temp.qual") 1436 """
1437 - def __init__(self, handle, wrap=60, record2title=None):
1438 """Create a QUAL writer. 1439 1440 Arguments: 1441 - handle - Handle to an output file, e.g. as returned 1442 by open(filename, "w") 1443 - wrap - Optional line length used to wrap sequence lines. 1444 Defaults to wrapping the sequence at 60 characters 1445 Use zero (or None) for no wrapping, giving a single 1446 long line for the sequence. 1447 - record2title - Optional function to return the text to be 1448 used for the title line of each record. By default 1449 a combination of the record.id and record.description 1450 is used. If the record.description starts with the 1451 record.id, then just the record.description is used. 1452 1453 The record2title argument is present for consistency with the 1454 Bio.SeqIO.FastaIO writer class. 1455 """ 1456 SequentialSequenceWriter.__init__(self, handle) 1457 #self.handle = handle 1458 self.wrap = None 1459 if wrap: 1460 if wrap < 1: 1461 raise ValueError 1462 self.wrap = wrap 1463 self.record2title = record2title
1464
1465 - def write_record(self, record):
1466 """Write a single QUAL record to the file.""" 1467 assert self._header_written 1468 assert not self._footer_written 1469 self._record_written = True 1470 1471 if self.record2title: 1472 title = self.clean(self.record2title(record)) 1473 else: 1474 id = self.clean(record.id) 1475 description = self.clean(record.description) 1476 if description and description.split(None, 1)[0]==id: 1477 #The description includes the id at the start 1478 title = description 1479 elif description: 1480 title = "%s %s" % (id, description) 1481 else: 1482 title = id 1483 self.handle.write(">%s\n" % title) 1484 1485 qualities = _get_phred_quality(record) 1486 try: 1487 #This rounds to the nearest integer. 1488 #TODO - can we record a float in a qual file? 1489 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1490 except TypeError, e: 1491 if None in qualities: 1492 raise TypeError("A quality value of None was found") 1493 else: 1494 raise e 1495 1496 if self.wrap: 1497 while qualities_strs: 1498 line = qualities_strs.pop(0) 1499 while qualities_strs \ 1500 and len(line) + 1 + len(qualities_strs[0]) < self.wrap: 1501 line += " " + qualities_strs.pop(0) 1502 self.handle.write(line + "\n") 1503 else: 1504 data = " ".join(qualities_strs) 1505 self.handle.write(data + "\n")
1506
1507 -class FastqSolexaWriter(SequentialSequenceWriter):
1508 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1509 1510 This outputs FASTQ files like those from the early Solexa/Illumina 1511 pipeline, using Solexa scores and an ASCII offset of 64. These are 1512 NOT compatible with the standard Sanger style PHRED FASTQ files. 1513 1514 If your records contain a "solexa_quality" entry under letter_annotations, 1515 this is used, otherwise any "phred_quality" entry will be used after 1516 conversion using the solexa_quality_from_phred function. If neither style 1517 of quality scores are present, an exception is raised. 1518 1519 Although you can use this class directly, you are strongly encouraged 1520 to use the Bio.SeqIO.write() function instead. For example, this code 1521 reads in a FASTQ file and re-saves it as another FASTQ file: 1522 1523 >>> from Bio import SeqIO 1524 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1525 >>> out_handle = open("Quality/temp.fastq", "w") 1526 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1527 5 1528 >>> out_handle.close() 1529 1530 You might want to do this if the original file included extra line breaks, 1531 which (while valid) may not be supported by all tools. The output file 1532 from Biopython will have each sequence on a single line, and each quality 1533 string on a single line (which is considered desirable for maximum 1534 compatibility). 1535 1536 This code is also called if you use the .format("fastq-solexa") method of 1537 a SeqRecord. For example, 1538 1539 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1540 >>> print record.format("fastq-solexa") 1541 @Test PHRED qualities from 40 to 0 inclusive 1542 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1543 + 1544 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1545 <BLANKLINE> 1546 1547 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1548 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1549 a warning is issued. 1550 1551 P.S. Don't forget to delete the temp file if you don't need it anymore: 1552 1553 >>> import os 1554 >>> os.remove("Quality/temp.fastq") 1555 """
1556 - def write_record(self, record):
1557 """Write a single FASTQ record to the file.""" 1558 assert self._header_written 1559 assert not self._footer_written 1560 self._record_written = True 1561 1562 #TODO - Is an empty sequence allowed in FASTQ format? 1563 if record.seq is None: 1564 raise ValueError("No sequence for record %s" % record.id) 1565 seq_str = str(record.seq) 1566 qualities_str = _get_solexa_quality_str(record) 1567 if len(qualities_str) != len(seq_str): 1568 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1569 % (record.id, len(seq_str), len(qualities_str))) 1570 1571 #FASTQ files can include a description, just like FASTA files 1572 #(at least, this is what the NCBI Short Read Archive does) 1573 id = self.clean(record.id) 1574 description = self.clean(record.description) 1575 if description and description.split(None, 1)[0]==id: 1576 #The description includes the id at the start 1577 title = description 1578 elif description: 1579 title = "%s %s" % (id, description) 1580 else: 1581 title = id 1582 1583 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1584
1585 -class FastqIlluminaWriter(SequentialSequenceWriter):
1586 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1587 1588 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1589 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1590 compatible with the standard Sanger style PHRED FASTQ files which use an 1591 ASCII offset of 32. 1592 1593 Although you can use this class directly, you are strongly encouraged to 1594 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1595 instead. This code is also called if you use the .format("fastq-illumina") 1596 method of a SeqRecord. For example, 1597 1598 >>> from Bio import SeqIO 1599 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1600 >>> print record.format("fastq-illumina") 1601 @Test PHRED qualities from 40 to 0 inclusive 1602 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1603 + 1604 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1605 <BLANKLINE> 1606 1607 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1608 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1609 warning is issued. 1610 """
1611 - def write_record(self, record):
1612 """Write a single FASTQ record to the file.""" 1613 assert self._header_written 1614 assert not self._footer_written 1615 self._record_written = True 1616 1617 #TODO - Is an empty sequence allowed in FASTQ format? 1618 if record.seq is None: 1619 raise ValueError("No sequence for record %s" % record.id) 1620 seq_str = str(record.seq) 1621 qualities_str = _get_illumina_quality_str(record) 1622 if len(qualities_str) != len(seq_str): 1623 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1624 % (record.id, len(seq_str), len(qualities_str))) 1625 1626 #FASTQ files can include a description, just like FASTA files 1627 #(at least, this is what the NCBI Short Read Archive does) 1628 id = self.clean(record.id) 1629 description = self.clean(record.description) 1630 if description and description.split(None, 1)[0]==id: 1631 #The description includes the id at the start 1632 title = description 1633 elif description: 1634 title = "%s %s" % (id, description) 1635 else: 1636 title = id 1637 1638 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1639
1640 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet = single_letter_alphabet, title2ids = None):
1641 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1642 1643 For example, consider this short QUAL file with PHRED quality scores:: 1644 1645 >EAS54_6_R1_2_1_413_324 1646 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1647 26 26 26 23 23 1648 >EAS54_6_R1_2_1_540_792 1649 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1650 26 18 26 23 18 1651 >EAS54_6_R1_2_1_443_348 1652 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1653 24 18 18 18 18 1654 1655 And a matching FASTA file:: 1656 1657 >EAS54_6_R1_2_1_413_324 1658 CCCTTCTTGTCTTCAGCGTTTCTCC 1659 >EAS54_6_R1_2_1_540_792 1660 TTGGCAGGCCAAGGCCGATGGATCA 1661 >EAS54_6_R1_2_1_443_348 1662 GTTGCTTCTGGCGTGGGTGGGGGGG 1663 1664 You can parse these separately using Bio.SeqIO with the "qual" and 1665 "fasta" formats, but then you'll get a group of SeqRecord objects with 1666 no sequence, and a matching group with the sequence but not the 1667 qualities. Because it only deals with one input file handle, Bio.SeqIO 1668 can't be used to read the two files together - but this function can! 1669 For example, 1670 1671 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1672 ... open("Quality/example.qual", "rU")) 1673 >>> for record in rec_iter: 1674 ... print record.id, record.seq 1675 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1676 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1677 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1678 1679 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1680 they are in each record's per-letter-annotation dictionary as a simple 1681 list of integers: 1682 1683 >>> print record.letter_annotations["phred_quality"] 1684 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1685 1686 If you have access to data as a FASTQ format file, using that directly 1687 would be simpler and more straight forward. Note that you can easily use 1688 this function to convert paired FASTA and QUAL files into FASTQ files: 1689 1690 >>> from Bio import SeqIO 1691 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1692 ... open("Quality/example.qual", "rU")) 1693 >>> out_handle = open("Quality/temp.fastq", "w") 1694 >>> SeqIO.write(rec_iter, out_handle, "fastq") 1695 3 1696 >>> out_handle.close() 1697 1698 And don't forget to clean up the temp file if you don't need it anymore: 1699 1700 >>> import os 1701 >>> os.remove("Quality/temp.fastq") 1702 """ 1703 from Bio.SeqIO.FastaIO import FastaIterator 1704 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \ 1705 title2ids=title2ids) 1706 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \ 1707 title2ids=title2ids) 1708 1709 #Using zip(...) would create a list loading everything into memory! 1710 #It would also not catch any extra records found in only one file. 1711 while True: 1712 try: 1713 f_rec = fasta_iter.next() 1714 except StopIteration: 1715 f_rec = None 1716 try: 1717 q_rec = qual_iter.next() 1718 except StopIteration: 1719 q_rec = None 1720 if f_rec is None and q_rec is None: 1721 #End of both files 1722 break 1723 if f_rec is None: 1724 raise ValueError("FASTA file has more entries than the QUAL file.") 1725 if q_rec is None: 1726 raise ValueError("QUAL file has more entries than the FASTA file.") 1727 if f_rec.id != q_rec.id: 1728 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \ 1729 % (f_rec.id, q_rec.id)) 1730 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1731 raise ValueError("Sequence length and number of quality scores disagree for %s" \ 1732 % f_rec.id) 1733 #Merge the data.... 1734 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"] 1735 yield f_rec
1736 #Done 1737 1738
1739 -def _test():
1740 """Run the Bio.SeqIO module's doctests. 1741 1742 This will try and locate the unit tests directory, and run the doctests 1743 from there in order that the relative paths used in the examples work. 1744 """ 1745 import doctest 1746 import os 1747 if os.path.isdir(os.path.join("..", "..", "Tests")): 1748 print "Runing doctests..." 1749 cur_dir = os.path.abspath(os.curdir) 1750 os.chdir(os.path.join("..", "..", "Tests")) 1751 assert os.path.isfile("Quality/example.fastq") 1752 assert os.path.isfile("Quality/example.fasta") 1753 assert os.path.isfile("Quality/example.qual") 1754 assert os.path.isfile("Quality/tricky.fastq") 1755 assert os.path.isfile("Quality/solexa_faked.fastq") 1756 doctest.testmod(verbose=0) 1757 os.chdir(cur_dir) 1758 del cur_dir 1759 print "Done"
1760 1761 if __name__ == "__main__": 1762 _test() 1763