Package Bio :: Package AlignIO :: Module StockholmIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.AlignIO.StockholmIO

  1  # Copyright 2006-2010 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5  """ 
  6  Bio.AlignIO support for the "stockholm" format (used in the PFAM database). 
  7   
  8  You are expected to use this module via the Bio.AlignIO functions (or the 
  9  Bio.SeqIO functions if you want to work directly with the gapped sequences). 
 10   
 11  For example, consider a Stockholm alignment file containing the following:: 
 12   
 13      # STOCKHOLM 1.0 
 14      #=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>.. 
 15      AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 
 16      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 
 17      AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 
 18      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 
 19   
 20      #=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 
 21      AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 22      #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 
 23      AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 24      #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 25      // 
 26   
 27  This is a single multiple sequence alignment, so you would probably load this 
 28  using the Bio.AlignIO.read() function: 
 29   
 30      >>> from Bio import AlignIO 
 31      >>> handle = open("Stockholm/simple.sth", "rU") 
 32      >>> align = AlignIO.read(handle, "stockholm") 
 33      >>> handle.close() 
 34      >>> print align 
 35      SingleLetterAlphabet() alignment with 2 rows and 104 columns 
 36      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 37      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 38      >>> for record in align: 
 39      ...     print record.id, len(record) 
 40      AP001509.1 104 
 41      AE007476.1 104 
 42   
 43  This example file is clearly using RNA, so you might want the alignment object 
 44  (and the SeqRecord objects it holds) to reflect this, rather than simple using 
 45  the default single letter alphabet as shown above.  You can do this with an 
 46  optional argument to the Bio.AlignIO.read() function: 
 47   
 48      >>> from Bio import AlignIO 
 49      >>> from Bio.Alphabet import generic_rna 
 50      >>> handle = open("Stockholm/simple.sth", "rU") 
 51      >>> align = AlignIO.read(handle, "stockholm", alphabet=generic_rna) 
 52      >>> handle.close() 
 53      >>> print align 
 54      RNAAlphabet() alignment with 2 rows and 104 columns 
 55      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 56      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 57   
 58  In addition to the sequences themselves, this example alignment also includes 
 59  some GR lines for the secondary structure of the sequences.  These are strings, 
 60  with one character for each letter in the associated sequence: 
 61   
 62      >>> for record in align: 
 63      ...     print record.id 
 64      ...     print record.seq 
 65      ...     print record.letter_annotations['secondary_structure'] 
 66      AP001509.1 
 67      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 68      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 69      AE007476.1 
 70      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 71      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 72   
 73  Any general annotation for each row is recorded in the SeqRecord's annotations 
 74  dictionary.  You can output this alignment in many different file formats using 
 75  Bio.AlignIO.write(), or the MultipleSeqAlignment object's format method: 
 76   
 77      >>> print align.format("fasta") 
 78      >AP001509.1 
 79      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A 
 80      GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 81      >AE007476.1 
 82      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA 
 83      GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 84      <BLANKLINE> 
 85   
 86  Most output formats won't be able to hold the annotation possible in a Stockholm file: 
 87   
 88      >>> print align.format("stockholm") 
 89      # STOCKHOLM 1.0 
 90      #=GF SQ 2 
 91      AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 92      #=GS AP001509.1 AC AP001509.1 
 93      #=GS AP001509.1 DE AP001509.1 
 94      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 95      AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 96      #=GS AE007476.1 AC AE007476.1 
 97      #=GS AE007476.1 DE AE007476.1 
 98      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 99      // 
100      <BLANKLINE> 
101   
102  Note that when writing Stockholm files, Biopython does not break long sequences up and 
103  interleave them (as in the input file shown above).  The standard allows this simpler 
104  layout, and it is more likely to be understood by other tools.  
105   
106  Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to iterate over 
107  the two rows as SeqRecord objects - rather than working with Alignnment objects. 
108  Again, if you want to you can specify this is RNA: 
109   
110      >>> from Bio import SeqIO 
111      >>> from Bio.Alphabet import generic_rna 
112      >>> handle = open("Stockholm/simple.sth", "rU") 
113      >>> for record in SeqIO.parse(handle, "stockholm", alphabet=generic_rna): 
114      ...     print record.id 
115      ...     print record.seq 
116      ...     print record.letter_annotations['secondary_structure'] 
117      AP001509.1 
118      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
119      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
120      AE007476.1 
121      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
122      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
123      >>> handle.close() 
124   
125  Remember that if you slice a SeqRecord, the per-letter-annotions like the 
126  secondary structure string here, are also sliced: 
127   
128      >>> sub_record = record[10:20] 
129      >>> print sub_record.seq 
130      AUCGUUUUAC 
131      >>> print sub_record.letter_annotations['secondary_structure'] 
132      -------<<< 
133  """ 
134  __docformat__ = "epytext en" #not just plaintext 
135  from Bio.Seq import Seq 
136  from Bio.SeqRecord import SeqRecord 
137  from Bio.Align import MultipleSeqAlignment 
138  from Interfaces import AlignmentIterator, SequentialAlignmentWriter 
139   
140 -class StockholmWriter(SequentialAlignmentWriter):
141 """Stockholm/PFAM alignment writer.""" 142 143 #These dictionaries should be kept in sync with those 144 #defined in the StockholmIterator class. 145 pfam_gr_mapping = {"secondary_structure" : "SS", 146 "surface_accessibility" : "SA", 147 "transmembrane" : "TM", 148 "posterior_probability" : "PP", 149 "ligand_binding" : "LI", 150 "active_site" : "AS", 151 "intron" : "IN"} 152 #Following dictionary deliberately does not cover AC, DE or DR 153 pfam_gs_mapping = {"organism" : "OS", 154 "organism_classification" : "OC", 155 "look" : "LO"} 156
157 - def write_alignment(self, alignment):
158 """Use this to write (another) single alignment to an open file. 159 160 Note that sequences and their annotation are recorded 161 together (rather than having a block of annotation followed 162 by a block of aligned sequences). 163 """ 164 count = len(alignment) 165 166 self._length_of_sequences = alignment.get_alignment_length() 167 self._ids_written = [] 168 169 #NOTE - For now, the alignment object does not hold any per column 170 #or per alignment annotation - only per sequence. 171 172 if count == 0: 173 raise ValueError("Must have at least one sequence") 174 if self._length_of_sequences == 0: 175 raise ValueError("Non-empty sequences are required") 176 177 self.handle.write("# STOCKHOLM 1.0\n") 178 self.handle.write("#=GF SQ %i\n" % count) 179 for record in alignment: 180 self._write_record(record) 181 self.handle.write("//\n")
182
183 - def _write_record(self, record):
184 """Write a single SeqRecord to the file""" 185 if self._length_of_sequences != len(record.seq): 186 raise ValueError("Sequences must all be the same length") 187 188 #For the case for stockholm to stockholm, try and use record.name 189 seq_name = record.id 190 if record.name is not None: 191 if "accession" in record.annotations: 192 if record.id == record.annotations["accession"]: 193 seq_name = record.name 194 195 #In the Stockholm file format, spaces are not allowed in the id 196 seq_name = seq_name.replace(" ","_") 197 198 if "start" in record.annotations \ 199 and "end" in record.annotations: 200 suffix = "/%s-%s" % (str(record.annotations["start"]), 201 str(record.annotations["end"])) 202 if seq_name[-len(suffix):] != suffix: 203 seq_name = "%s/%s-%s" % (seq_name, 204 str(record.annotations["start"]), 205 str(record.annotations["end"])) 206 207 if seq_name in self._ids_written: 208 raise ValueError("Duplicate record identifier: %s" % seq_name) 209 self._ids_written.append(seq_name) 210 self.handle.write("%s %s\n" % (seq_name, record.seq.tostring())) 211 212 #The recommended placement for GS lines (per sequence annotation) 213 #is above the alignment (as a header block) or just below the 214 #corresponding sequence. 215 # 216 #The recommended placement for GR lines (per sequence per column 217 #annotation such as secondary structure) is just below the 218 #corresponding sequence. 219 # 220 #We put both just below the corresponding sequence as this allows 221 #us to write the file using a single pass through the records. 222 223 #AC = Accession 224 if "accession" in record.annotations: 225 self.handle.write("#=GS %s AC %s\n" \ 226 % (seq_name, self.clean(record.annotations["accession"]))) 227 elif record.id: 228 self.handle.write("#=GS %s AC %s\n" \ 229 % (seq_name, self.clean(record.id))) 230 231 #DE = description 232 if record.description: 233 self.handle.write("#=GS %s DE %s\n" \ 234 % (seq_name, self.clean(record.description))) 235 236 #DE = database links 237 for xref in record.dbxrefs: 238 self.handle.write("#=GS %s DR %s\n" \ 239 % (seq_name, self.clean(xref))) 240 241 #GS = other per sequence annotation 242 for key, value in record.annotations.iteritems(): 243 if key in self.pfam_gs_mapping: 244 data = self.clean(str(value)) 245 if data: 246 self.handle.write("#=GS %s %s %s\n" \ 247 % (seq_name, 248 self.clean(self.pfam_gs_mapping[key]), 249 data)) 250 else: 251 #It doesn't follow the PFAM standards, but should we record 252 #this data anyway? 253 pass 254 255 #GR = per row per column sequence annotation 256 for key, value in record.letter_annotations.iteritems(): 257 if key in self.pfam_gr_mapping and len(str(value))==len(record.seq): 258 data = self.clean(str(value)) 259 if data: 260 self.handle.write("#=GR %s %s %s\n" \ 261 % (seq_name, 262 self.clean(self.pfam_gr_mapping[key]), 263 data)) 264 else: 265 #It doesn't follow the PFAM standards, but should we record 266 #this data anyway? 267 pass
268
269 -class StockholmIterator(AlignmentIterator):
270 """Loads a Stockholm file from PFAM into MultipleSeqAlignment objects. 271 272 The file may contain multiple concatenated alignments, which are loaded 273 and returned incrementally. 274 275 This parser will detect if the Stockholm file follows the PFAM 276 conventions for sequence specific meta-data (lines starting #=GS 277 and #=GR) and populates the SeqRecord fields accordingly. 278 279 Any annotation which does not follow the PFAM conventions is currently 280 ignored. 281 282 If an accession is provided for an entry in the meta data, IT WILL NOT 283 be used as the record.id (it will be recorded in the record's 284 annotations). This is because some files have (sub) sequences from 285 different parts of the same accession (differentiated by different 286 start-end positions). 287 288 Wrap-around alignments are not supported - each sequences must be on 289 a single line. However, interlaced sequences should work. 290 291 For more information on the file format, please see: 292 http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format 293 http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html 294 295 For consistency with BioPerl and EMBOSS we call this the "stockholm" 296 format. 297 """ 298 299 #These dictionaries should be kept in sync with those 300 #defined in the PfamStockholmWriter class. 301 pfam_gr_mapping = {"SS" : "secondary_structure", 302 "SA" : "surface_accessibility", 303 "TM" : "transmembrane", 304 "PP" : "posterior_probability", 305 "LI" : "ligand_binding", 306 "AS" : "active_site", 307 "IN" : "intron"} 308 #Following dictionary deliberately does not cover AC, DE or DR 309 pfam_gs_mapping = {"OS" : "organism", 310 "OC" : "organism_classification", 311 "LO" : "look"} 312
313 - def next(self):
314 try: 315 line = self._header 316 del self._header 317 except AttributeError: 318 line = self.handle.readline() 319 if not line: 320 #Empty file - just give up. 321 return 322 if not line.strip() == '# STOCKHOLM 1.0': 323 raise ValueError("Did not find STOCKHOLM header") 324 #import sys 325 #print >> sys.stderr, 'Warning file does not start with STOCKHOLM 1.0' 326 327 # Note: If this file follows the PFAM conventions, there should be 328 # a line containing the number of sequences, e.g. "#=GF SQ 67" 329 # We do not check for this - perhaps we should, and verify that 330 # if present it agrees with our parsing. 331 332 seqs = {} 333 ids = [] 334 gs = {} 335 gr = {} 336 gf = {} 337 passed_end_alignment = False 338 while 1: 339 line = self.handle.readline() 340 if not line: break #end of file 341 line = line.strip() #remove trailing \n 342 if line == '# STOCKHOLM 1.0': 343 self._header = line 344 break 345 elif line == "//": 346 #The "//" line indicates the end of the alignment. 347 #There may still be more meta-data 348 passed_end_alignment = True 349 elif line == "": 350 #blank line, ignore 351 pass 352 elif line[0] != "#": 353 #Sequence 354 #Format: "<seqname> <sequence>" 355 assert not passed_end_alignment 356 parts = [x.strip() for x in line.split(" ",1)] 357 if len(parts) != 2: 358 #This might be someone attempting to store a zero length sequence? 359 raise ValueError("Could not split line into identifier " \ 360 + "and sequence:\n" + line) 361 id, seq = parts 362 if id not in ids: 363 ids.append(id) 364 seqs.setdefault(id, '') 365 seqs[id] += seq.replace(".","-") 366 elif len(line) >= 5: 367 #Comment line or meta-data 368 if line[:5] == "#=GF ": 369 #Generic per-File annotation, free text 370 #Format: #=GF <feature> <free text> 371 feature, text = line[5:].strip().split(None,1) 372 #Each feature key could be used more than once, 373 #so store the entries as a list of strings. 374 if feature not in gf: 375 gf[feature] = [text] 376 else: 377 gf[feature].append(text) 378 elif line[:5] == '#=GC ': 379 #Generic per-Column annotation, exactly 1 char per column 380 #Format: "#=GC <feature> <exactly 1 char per column>" 381 pass 382 elif line[:5] == '#=GS ': 383 #Generic per-Sequence annotation, free text 384 #Format: "#=GS <seqname> <feature> <free text>" 385 id, feature, text = line[5:].strip().split(None,2) 386 #if id not in ids: 387 # ids.append(id) 388 if id not in gs: 389 gs[id] = {} 390 if feature not in gs[id]: 391 gs[id][feature] = [text] 392 else: 393 gs[id][feature].append(text) 394 elif line[:5] == "#=GR ": 395 #Generic per-Sequence AND per-Column markup 396 #Format: "#=GR <seqname> <feature> <exactly 1 char per column>" 397 id, feature, text = line[5:].strip().split(None,2) 398 #if id not in ids: 399 # ids.append(id) 400 if id not in gr: 401 gr[id] = {} 402 if feature not in gr[id]: 403 gr[id][feature] = "" 404 gr[id][feature] += text.strip() # append to any previous entry 405 #TODO - Should we check the length matches the alignment length? 406 # For iterlaced sequences the GR data can be split over 407 # multiple lines 408 #Next line... 409 410 411 assert len(seqs) <= len(ids) 412 #assert len(gs) <= len(ids) 413 #assert len(gr) <= len(ids) 414 415 self.ids = ids 416 self.sequences = seqs 417 self.seq_annotation = gs 418 self.seq_col_annotation = gr 419 420 if ids and seqs: 421 422 if self.records_per_alignment is not None \ 423 and self.records_per_alignment != len(ids): 424 raise ValueError("Found %i records in this alignment, told to expect %i" \ 425 % (len(ids), self.records_per_alignment)) 426 427 alignment = MultipleSeqAlignment(self.alphabet) 428 429 #TODO - Introduce an annotated alignment class? 430 #For now, store the annotation a new private property: 431 alignment._annotations = gr 432 433 alignment_length = len(seqs.values()[0]) 434 for id in ids: 435 seq = seqs[id] 436 if alignment_length != len(seq): 437 raise ValueError("Sequences have different lengths, or repeated identifier") 438 name, start, end = self._identifier_split(id) 439 record = SeqRecord(Seq(seq, self.alphabet), 440 id = id, name = name, description = id, 441 annotations = {"accession":name}) 442 #Accession will be overridden by _populate_meta_data if an explicit 443 #accession is provided: 444 record.annotations["accession"]=name 445 446 if start is not None: 447 record.annotations["start"] = start 448 if end is not None: 449 record.annotations["end"] = end 450 451 self._populate_meta_data(id, record) 452 alignment.append(record) 453 return alignment 454 else: 455 return None
456 457
458 - def _identifier_split(self, identifier):
459 """Returns (name,start,end) string tuple from an identier.""" 460 if identifier.find("/")!=-1: 461 start_end = identifier.split("/",1)[1] 462 if start_end.count("-")==1: 463 start, end = map(int, start_end.split("-")) 464 name = identifier.split("/",1)[0] 465 return (name, start, end) 466 return (identifier, None, None)
467
468 - def _get_meta_data(self, identifier, meta_dict):
469 """Takes an itentifier and returns dict of all meta-data matching it. 470 471 For example, given "Q9PN73_CAMJE/149-220" will return all matches to 472 this or "Q9PN73_CAMJE" which the identifier without its /start-end 473 suffix. 474 475 In the example below, the suffix is required to match the AC, but must 476 be removed to match the OS and OC meta-data:: 477 478 # STOCKHOLM 1.0 479 #=GS Q9PN73_CAMJE/149-220 AC Q9PN73 480 ... 481 Q9PN73_CAMJE/149-220 NKA... 482 ... 483 #=GS Q9PN73_CAMJE OS Campylobacter jejuni 484 #=GS Q9PN73_CAMJE OC Bacteria 485 486 This function will return an empty dictionary if no data is found.""" 487 name, start, end = self._identifier_split(identifier) 488 if name==identifier: 489 identifier_keys = [identifier] 490 else: 491 identifier_keys = [identifier, name] 492 answer = {} 493 for identifier_key in identifier_keys: 494 try: 495 for feature_key in meta_dict[identifier_key]: 496 answer[feature_key] = meta_dict[identifier_key][feature_key] 497 except KeyError: 498 pass 499 return answer
500
501 - def _populate_meta_data(self, identifier, record):
502 """Adds meta-date to a SecRecord's annotations dictionary. 503 504 This function applies the PFAM conventions.""" 505 506 seq_data = self._get_meta_data(identifier, self.seq_annotation) 507 for feature in seq_data: 508 #Note this dictionary contains lists! 509 if feature=="AC" : #ACcession number 510 assert len(seq_data[feature])==1 511 record.annotations["accession"]=seq_data[feature][0] 512 elif feature=="DE" : #DEscription 513 record.description = "\n".join(seq_data[feature]) 514 elif feature=="DR" : #Database Reference 515 #Should we try and parse the strings? 516 record.dbxrefs = seq_data[feature] 517 elif feature in self.pfam_gs_mapping: 518 record.annotations[self.pfam_gs_mapping[feature]] = ", ".join(seq_data[feature]) 519 else: 520 #Ignore it? 521 record.annotations["GS:" + feature] = ", ".join(seq_data[feature]) 522 523 #Now record the per-letter-annotations 524 seq_col_data = self._get_meta_data(identifier, self.seq_col_annotation) 525 for feature in seq_col_data: 526 #Note this dictionary contains strings! 527 if feature in self.pfam_gr_mapping: 528 record.letter_annotations[self.pfam_gr_mapping[feature]] = seq_col_data[feature] 529 else: 530 #Ignore it? 531 record.letter_annotations["GR:" + feature] = seq_col_data[feature]
532
533 -def _test():
534 """Run the Bio.SeqIO module's doctests. 535 536 This will try and locate the unit tests directory, and run the doctests 537 from there in order that the relative paths used in the examples work. 538 """ 539 import doctest 540 import os 541 if os.path.isdir(os.path.join("..","..","Tests")): 542 print "Runing doctests..." 543 cur_dir = os.path.abspath(os.curdir) 544 os.chdir(os.path.join("..","..","Tests")) 545 assert os.path.isfile("Stockholm/simple.sth") 546 doctest.testmod() 547 os.chdir(cur_dir) 548 del cur_dir 549 print "Done"
550 551 if __name__ == "__main__": 552 _test() 553