Mining prokaryotic genomes for unknown amino acids: a stop-codon-based approach

Abstract

Background

Selenocysteine and pyrrolysine are the 21st and 22nd amino acids, which are genetically encoded by stop codons.  Since a number of microbial genomes have been completely sequenced to date, it is tempting to ask whether the 23rd amino acid is left undiscovered in these genomes.  Recently, a computational study addressed this question and reported that no tRNA gene for unknown amino acid was found in genome sequences available.  However, performance of the tRNA prediction program on an unknown tRNA family, which may have atypical sequence and structure, is unclear, thereby rendering their result inconclusive.  A protein-level study will provide independent insight into the novel amino acid.

Results

Assuming that the 23rd amino acid is also encoded by a stop codon, we systematically predicted proteins that contain stop-codon-encoded amino acids from 191 prokaryotic genomes.  Since our prediction method relies only on the conservation patterns of primary sequences, it also provides an opportunity to search novel selenoproteins and other readthrough proteins.  It successfully recovered many of currently known selenoproteins and pyrrolysine proteins.  However, no promising candidate for the 23rd amino acid was detected, and only one novel selenoprotein was predicted.

Conclusion

Our result suggests that the unknown amino acid encoded by stop codons does not exist, or its phylogenetic distribution is rather limited, which is in agreement with the previous study on tRNA.  The method described here can be used in future studies to explore novel readthrough events from complete genomes, which are rapidly growing.

Background

Stop codon readthrough is a phenomenon in which the translation process does not terminate at a stop codon, and an amino acid is inserted there instead 12.  In some cases, the inserted amino acid is not one of the 20 amino acids but a noncanonical one.  Two such amino acids have been discovered to date: selenocysteine 34 and pyrrolysine 56.  Because each of them have specialized tRNA genes for decoding and can be considered extensions of the standard genetic code, they are called the 21st and 22nd amino acids, respectively.  Selenocysteine, the 21st amino acid, is encoded by stop codon UGA, and organisms that use selenocysteine have been found from all three domains of life.  Its insertion into UGA is directed by SECIS (selenocysteine insertion sequence) elements, a stem-loop structure on the selenoprotein mRNA.  Along with the progress of genome sequencing projects, computational prediction methods of selenocysteine-containing proteins (selenoproteins) have been developed by several research groups 78910, and the repertoire of selenoproteins has been greatly expanded 1112.  Pyrrolysine, the 22nd amino acid encoded by stop codon UAG, was recently discovered from a methanogenic archaea 56.  Currently, only methanogenic archaea of the order Methanosarcinales and one bacterium are considered to utilize pyrrolysine 13.  The limited phylogenetic distribution of pyrrolysine suggests that its incorporation into the genetic code of methanogen is relatively recent, and the insertion mechanism of a novel amino acid can evolve in a shorter period of time than anticipated.  This raises an interesting question: "Is there a 23rd amino acid?".  If such an amino acid is discovered, it will deepen our understanding of the evolution and diversity of the genetic code.  Because genome sequences of various prokaryotes are available today, there will be a chance to discover the novel amino acid via analysis of these genomes.  Since both the 21st and 22nd amino acids are encoded by stop codons, the prime suspect is other stop codons (e.g. stop codon UAA), although the possibility of sense codons certainly remains.  Using this clue, computational screening methods of the 23rd amino acid can be designed.  Recently, Lobanov et al. addressed this problem by searching tRNAs with anticodons corresponding to stop codons 14.  They analyzed 146 prokaryotic genomes, but no likely tRNA of the novel amino acid was detected.  They concluded that the 23rd amino acid would have a limited phylogenetic distribution, if it exists.  However, programs for tRNA identification are based on the features of known tRNAs and do not necessarily perform well on unknown ones.  Actually, tRNASec and tRNAPyl have unusual secondary structures 515 and often escape detection by programs without special consideration.  Lobanov et al. thus developed a sensitive search method to deal with this problem, but they also admitted that it would fail to identify highly unusual tRNAs.  There is another approach to searching for the 23rd amino acid.  By enumerating ORFs that have an inframe stop codon from genomes and examining their evolutionary conservation, candidate proteins can be predicted.  Because such an ORF-based study is independent from the tRNA analysis, it can either identify candidate organisms missed by the previous study or strengthen its negative conclusion.  Here we report a comprehensive analysis of prokaryotic ORFs that contain an inframe stop codon.  Through enumeration of theoretical ORFs and inspection of their evolutionary conservation, candidates of readthrough proteins were predicted.  They contained many of the known proteins with stop-codon-encoded amino acids, but almost no novel candidates were identified.  Therefore, the unknown amino acid, if it is encoded by a stop codon, is unlikely to exist in the current databases of microbial genomes.  The consequences for selenoproteins and other readthrough genes are also discussed.

Results

Basic ideas

In this study, we focus on theoretical ORFs with one inframe stop codon, termed "interrupted ORFs" (iORFs) (Figure 1a).  If we enumerate all iORFs from microbial genomes, most of the readthrough genes will be included in them.  However, the vast majority of the enumerated iORFs will be biologically meaningless.  To filter out such meaningless iORFs, we required the iORFs to have at least one homolog in other genomes, because evolutionary conservation of primary sequence is a strong indicator of functional importance.  However, this condition is not sufficient, since two major problems remain: pseudogenes and two adjacent genes.  The first problem is that even if an iORF has homologs in other species, it could be a pseudogene or a product of sequencing error.  The second problem is that adjacent genes on the same reading frame may satisfy the condition of conserved iORFs.  In particular, gene pairs within an operon are problematic because their gene arrangement is often conserved.  If the intergenic distance between two genes in an operon happens to be a multiple of three, they look like a conserved readthrough gene.

Basic ideas of the prediction method

Basic ideas of the prediction method.  (a) Schematic illustration of an interrupted ORF (iORF).  (b) Readthrough genes can be distinguished from two adjacent genes based on the results of BLAST searches.  Boxes denote iORFs, and × indicates the inframe stop codon.  Shaded regions represent actual protein-coding regions.  If an iORF codes a readthrough protein, BLAST hits from other organisms will cover the inframe stop codon.  In contrast, if the iORF consists of two adjacent genes, many hits that do not cover the inframe stop codon will be found.  To discriminate them from true readthrough genes, evolutionary information was exploited.  In order to eliminate pseudogenes and sequencing errors, conservation of iORFs and their inframe stop codons was examined.  Since pseudogenes are less conserved, and sequencing errors are relatively rare events, they will not have homologous iORFs in other species.  Even if they do, the position or type (UAA, UAG or UGA) of their inframe stop codons will not coincide.  In this way, they can be eliminated as candidates.  A drawback of this criterion is that it limits the target of our study to readthrough genes conserved across two or more species.  In other words, species-specific readthrough genes are not in the scope of this study.  To address the second problem, adjacent gene pairs were filtered out by examining boundaries of sequence alignments between iORFs and its homologs (Figure 1b).  The stop-codon-encoded amino acids of prokaryotes are usually located inside domains, the units of evolutionary sequence conservation.  Therefore, the aligned regions of readthrough proteins contain their inframe stop codon.  Based on this observation, each iORF was required to have: (i) at least one homolog from other organisms that covers the inframe stop codon and (ii) no homolog that does not cover the stop codon.  Note that, however, if the whole length of an iORF was used as a query sequence, this procedure will erroneously discard multidomain readthrough proteins.  To avoid this problem, a partial sequence around the inframe stop codon was used as a query.

Prediction procedure

The prediction schema is shown in Figure 2.  A total of 191 prokaryotes were analyzed in this study, of which 166 are bacteria and 25 are archaea.  They were selected from 328 prokaryotes with completely sequenced genomes by excluding closely related species.  From the genome sequences of the 191 organisms, all possible iORFs were enumerated.  Two conditions were imposed on the geometry of the iORFs (Figure 1a).  First, only iORFs longer than 80 codons were extracted.  Secondly, margins between the inframe stop codon and both termini of the iORF must be longer than 10 codons.  The total number of iORFs extracted under these conditions was 2,969,958.  Next, iORFs that overlap RNA genes or protein-coding genes in different reading frames were discarded.  This test significantly reduced the number of iORFs to 390,926.

A flowchart of the prediction procedure

A flowchart of the prediction procedure.  Several steps are omitted for simplicity.  Detailed explanation is given in the text.  As noted above, the target of this study is evolutionarily conserved iORFs.  Thus, it was examined whether the iORFs have homologous regions in other genomes.  The 390,926 iORFs were translated into amino acid sequences and subjected to TBLASTN 16 against the 191 genome sequences.  Instead of the whole length of the amino acid sequence, a window of 101 residues centered at the inframe stop codon was used as a BLAST query.  After the BLAST searches, iORFs that have at least one interspecific hit that contains the inframe stop codon were collected.  Whether the codon aligned to the inframe stop codon is a nonsense codon or not was neglected at this stage.  There were 94,690 iORFs that have interspecific hits.  The result of the above homology searches was also used for the boundary analysis (Figure 1b).  An iORF was discarded if there were any BLAST hits that do not cover the inframe stop codon.  A total of 26,003 iORF satisfied the above criteria.  To examine intrafamily conservation of the inframe stop codons, these iORFs were clustered into protein families based on sequence similarity.  After removal of singletons, 679 clusters with two or more members were obtained.  A cluster was discarded unless all members of the cluster had the same type of inframe stop codons (UAA, UAG or UGA).  The locations of the inframe stop codons were also required to be identical in the multiple sequence alignment of the cluster members.  These conditions reduced the number of clusters to 273.  Manual inspection of these 273 clusters revealed that they still contain many false positives that are unrelated to stop-codon-encoded amino acids.  Hence, three-step filtering procedures were applied to remove the false positives.  Briefly, the first filter assesses protein-likeliness based on the signal of purifying selection, while the second and third filters try to remove adjacent gene pairs using the pattern of BLAST alignments (for details, see Materials and Methods).  As a result of the filtering, the number of candidate clusters was reduced to 32.  Through manual inspection of the BLAST alignments, 11 clusters were discarded because they are highly unlikely to code readthrough proteins.

Known proteins in the predicted clusters

The clusters predicted by our method are summarized in Table 1.  Of the 21 clusters, 15 were known selenoproteins, and four were known pyrrolysine proteins.  To assess the sensitivity of our method, the result was compared with a list of prokaryotic selenoproteins reported by Kryukov and Gladyshev 12.  Since our target is readthrough genes conserved across two or more species, such selenoprotein families were selected from their list.  There were 15 families satisfying this criterion, but one family, proline reductase, was excluded because it was found in only one organism in our dataset.  Of the 14 families, 11 were found in our prediction result.  The three families we failed to find were SelW-like protein, peroxiredoxin and thiol:protein disulphide oxidoreductase.  SelW-like protein was below the threshold of detection, because its stop codon is near the N-terminus and the amino acid sequences of its members are too divergent.  The reason why the two other families were not detected is more complex.  Since these two families are homologous, they were grouped into an identical cluster at the clustering stage of our method.  However, the positions of selenocysteine were different between the two families (Figure 3).  The cluster was thus discarded because of an apparent lack of stop codon conservation.  To deal with a situation like this, a reexamination of the clustering threshold and subdivision of clusters will be required.

Predicted clusters of readthrough proteins

A plus sign in a locus indicates that the genomic coordinates of the iORF can be described by a concatenation of two genes or regions.  For example, "GSU2293 + downstream" means that the iORF consists of the gene GSU2293 and its downstream sequence.  HesB family was not clustered into one family, because their sequences were too short and diverged.

Selenoprotein families we failed to detect because of nonconserved location of stop codons

Selenoprotein families we failed to detect because of nonconserved location of stop codons.  Selenocysteine residues of Peroxiredoxin-like protein families constitute homologous redox motifs (TXXU and UXXC), but their positions are different between two families.  Columns are colored according to sequence conservation.  Selenocysteine residues are shown in red, and the other residues in the redox motifs are shown in yellow.  Prx; Peroxiredoxin, TPO; thiol:protein disulphide oxidereductase, Adeh; Anaeromyxobacter dehalogenans, Gmet; Geobacter metallireducens, Gsul; G. sulfurreducens, Dpsy; Desulfotalea psychrophila.  The alignments were computed using ClustalW, and the figures were generated using Jalview.  Of the four pyrrolysine proteins detected, three methylamine methyltransferases have been experimentally confirmed to contain pyrrolysine 617.  The rest is a cluster of TetR-like transcriptional regulators from Methanosarcina acetivorans and M. barkeri.  Since the genome annotation of M. acetivorans describes this protein as a gene containing an inframe amber codon, we classified it as a 'known' candidate, although it is still unclear whether it really contains pyrrolysine.  The genome annotation of M. acetivorans also includes several amber-containing genes that were absent from our prediction result.  They are a methlycobamide:CoM methylase and four transposases 18.  The reason why they were not detected is that only one species in our dataset had an amber-containing form of these proteins.  This is unavoidable because of the inability of our method to detect species-specific readthrough events.  It is the price for reliably excluding pseudogenes and sequencing errors.

Unknown candidates in the predicted clusters

The successful detection of many known proteins is encouraging, because our method relies only on general properties of proteins that contain stop-codon-encoded amino acids, but not on specific features of selenocysteine or pyrrolysine.  Therefore, unknown clusters in our candidates have possibilities for the 23rd amino acid or novel readthrough proteins.  There were two such clusters (Table 1).  The first cluster is comprised of c-type cytochromes from δ-proteobacteria Geobacter sulfurreducens and G. metallireducens.  The N-terminal part of the sequence contains five CXXCH heme-biding motifs, while the C-terminal part has no similarity with any characterized proteins.  Homology search against unfinished microbial genomes identified seven homologous proteins from four other δ-proteobacteria species.  Multiple sequence alignment of these sequences is shown in Figure 4a.  Multiple sequence alignments of novel candidate proteins  Multiple sequence alignments of novel candidate proteins.  (a) A selenoprotein candidate from Geobacter sulfurreducens and its homologs.  The possible selenocysteine residues are shown in red, and putative heme-binding motifs are underlined.  Note that sequence conservation near the selenocysteine is comparable to that of the N-terminal cytochrome domain.  A protein Dpro_2 contains yet another inframe stop codon (TAG) at the column 189.  It will be either a sequencing error or a pseudogene.  Gsul; G. sulfurreducens, Gmet; G. metallireducens, Gura; G. uraniumreducens, Gfrc; Geobacter sp. FRC-32, Dace; Desulfuromonas acetoxidans, Dpro; Delta proteobacterium MLMS-1.  (b) Hypothetical proteins from Geobacter species.  The inframe stop codons (TAG) are shown in red.  This cluster is probably an artifact of close phylogenetic relationship.  We expect that this cluster may represent a novel selenoprotein family.  This is because the inframe stop codons of these proteins are exclusively TGA, and all of the above organisms possess selenocysteine insertion machinery (data not shown).  High conservation of residues near the inframe stop codon also suggests the importance of this region.  If they are true selenoproteins, this protein family becomes a rare instance of selenoprotein that lacks non-selenocysteine homologs.  However, computational analysis of sequences immediately downstream of the inframe stop codons failed to identify SECIS elements, which is a hallmark of selenocysteine-containing genes.  Therefore, yet another possibility is that they are a highly conserved operon.  An experimental verification is necessary to distinguish these two possibilities.  The second cluster consists of two hypothetical proteins, again from G. sulfurreducens and G. metallireducens (Figure 4b).  In contrast to the first cluster, no homolog was identified from other species.  This cluster is probably a false positive and not readthrough proteins.  This is because the residues near the inframe stop codons are poorly conserved.  Moreover, the C-terminal extensions are quite short (about 20 aa).  The sequence conservation in this region can be easily explained by the close phylogenetic relationship between the two species.  In summary, although a possible selenoprotein was newly identified, there was no promising candidate for an unknown amino acid encoded by a stop codon.

Stop codon usage in the pre-filtering clusters

The above negative result could be explained if the filtering process, which is the final step of the prediction method (Figure 2), was too strict.  Although the raw output of the search for evolutionarily conserved iORFs was 273 clusters, most of them were discarded at the subsequent filtering stage.  Because we have no a priori knowledge about the 23rd amino acid, cutoff thresholds for the filtering procedures were determined based on the known readthrough proteins.  This is practically indispensable for objective classification of candidates, but there is no guarantee that unknown proteins with the 23rd amino acid will score higher than the thresholds.  To explore whether a number of good candidates lie below the thresholds, the 273 clusters were analyzed in a way independent from filtering.  If an organism has many readthrough proteins, proteins from the organism will frequently appear in the 273 clusters.  Moreover, relative usage of the inframe stop codons will deviate from that of usual termination signals in the proteome.  Figure 5 shows the discrepancies between relative usage of the inframe and C-terminal stop codons of 127 organisms in the pre-filtering clusters.  Only seven organisms had statistically significant discrepancies (P < 0.05), and all of them are known to utilize selenocysteine or pyrrolysine.

Discrepancies of stop codon usages between the inframe and C-terminal stop codons

Discrepancies of stop codon usages between the inframe and C-terminal stop codons.  The inframe stop codon usage is taken from the pre-filtering clusters, and the C-terminal usage is computed based on the annotated proteins of the organism.  Red circle: an organism with pyrrolysine, blue; selenocysteine, yellow; both pyrrolysine and selenocysteine, white; neither pyrrolysine nor selenocysteine.  The organisms are ordered by their discrepancy scores.  The discrepancy score is the negative logarithm of a p-value of Fisher's exact test.  The dotted line indicates significance level 0.05 after a correction for multiple testing.  When top ten organisms were examined, only Gluconobacter oxydans was an organism not known to have stop-codon-encoded amino acids.  An inspection of the G. oxydans iORFs in the 273 clusters revealed that their inframe stop codons are dominated by TAA, but all of them belong to a single protein cluster associated with transposable elements.  Because it seems unlikely that an insertion system of novel amino acid evolves solely for transposable elements, this organism cannot be considered as a good candidate of the 23rd amino acid.  Sensitivity of this test is not high because many organisms that utilize selenocysteine were below the defined threshold.  However, the result agrees with the filtering-dependent analysis that no candidate of the novel stop-codon-encoded amino acid is detectable in the current dataset.

Discussion

As the number of completely sequenced genomes increases, several research groups started to predict proteins that contain stop-codon-encoded amino acids through computational analyses.  Most of them are aimed at identification of selenoproteins, reflecting concerns from the scientific community and accumulated knowledge on selenocysteine.  In order to improve prediction specificity, they have fully exploited the known features of selenocysteine, such as the SECIS elements or cysteine homologs, which have cysteine in place of selenocysteine.  However, since the target of this study is the 23rd amino acid, and there is no a priori knowledge, only general properties of stop-codon-encoded amino acids can be used for prediction.  Such general-purpose algorithms have also been developed to date.  The method of Chaudhuri and Yeates 10 extracts iORFs from microbial genomes and analyzes sequence conservation around the inframe stop codon.  Their method is thus similar to ours and applicable to both selenocysteine and pyrrolysine.  Perrodou et al. 19 constructed a database of predicted recoding events in microbes.  Their method is applicable not only to stop codon readthrough but also to frameshift.  However, both of them did not apply their methods to search for novel amino acids.  Therefore, the question of the 23rd amino acid has not been investigated from the viewpoint of coding sequences.  Additionally, the previous methods cannot effectively discriminate pseudogenes from readthrough genes.  For instance, Chaudhuri and Yeates reported a homolog of cobalamin biosynthesis protein CobN as a novel candidate of pyrrolysine protein.  However, the gene is probably a pseudogene because it contains an inframe TAA codon in addition to the TAG codon, and only one species seems to have the amber-containing form of the gene.  The previous methods also assume that proteins with stop-codon-encoded amino acids will have non-readthrough homologs (i.e., homologous proteins that do not have inframe stop codons).  However, that is not necessarily true.  For example, pyrrolysine-containing monomethylamine methyltransferases adopt TIM barrel fold 6, but their primary sequences do not exhibit detectable similarity to other TIM barrel proteins because of evolutionary divergence.  Dimethylamine methyltransferases also lack non-readthrough homologs.  Yet another example is glycine reductase selenoprotein A.  Only the selenocysteine-containing form of the enzyme is currently known 20.  Therefore, it is important not to assume non-readthrough homologs for exploring novel candidates.  If any non-readthrough homologs are registered in public sequence databases, a careful annotation process of a newly sequenced genome will be able to detect readthrough genes, even though they may be annotated as pseudogenes.  However, if all members of a gene family have stop codon readthrough, correct annotation of their gene structure will be extremely difficult, and all of them will be split into two distinct genes.  The method reported here is unique in that it does not assume non-readthrough homologs.  Using this method, a systematic screening of the 23rd amino acid and other readthrough genes was carried out.  Many of the currently known selenoproteins and pyrrolysine proteins were recovered, indicating the effectiveness of this approach.  In particular, successful detection of pyrrolysine-containing methyltransferases and selenoprotein A should be noted.  However, almost no novel candidates for readthrough genes were predicted.  What can be concluded from this result?  The most likely explanation is that the 23rd amino acid does not exist, or its distribution on the tree of life is rather limited.  Although a broad spectrum of taxonomic groups has been subjected to genome sequencing, the genomes of most microbial species on the earth have yet to be determined.  The unknown amino acid may be used by these species.  Alternatively, only one organism in our dataset may have the 23rd amino acid.  This is because our method is limited to readthrough genes conserved across two or more species.  If the novel amino acid appears in younger, non-conserved sequences, our technique will miss them.  In either case, the distribution of the 23rd amino acid will be significantly narrower than that of selenocysteine, which has scattered but wide distribution 21.  This conclusion coincides with and strengthens that of the previous research on tRNA 14.  Yet another possibility is that the 23rd amino acid exists but is not encoded by stop codons.  It is well known that the genetic code varies in several organisms 22.  Thus, certain organisms may use one of the sense codons for the novel amino acid.  Because codons for most amino acids are degenerate, redefinition of one of them is feasible.  However, that possibility is beyond the scope of this study and is left as an open problem.  Bioinformatics analysis of unusual tRNA genes and codon usage may provide insights into this problem.  In addition to the 23rd amino acid, our method can simultaneously explore selenoproteins and other readthrough proteins.  A common assumption in microbial selenoprotein predictions is that selenoproteins will have cysteine homologs.  Zhang et al. 20 examined the validity of this assumption using a SECIS-based method and concluded that selenoproteins without cysteine homologs will be extremely rare.  Our method can reassess this assumption in a SECIS-independent way.  Such selenoproteins identified through our screening of nearly 200 microbial genomes were selenoprotein A and only one uncertain candidate.  Therefore, selenoproteins that lack cysteine homologs will be scarce, as previously reported.  Other readthrough proteins with canonical amino acids (i.e., proteins that have canonical amino acids at their inframe stop codons) are quite rare in prokaryotes 1.  The result reported here is in agreement, but it is not conclusive.  This is because our method assumes that stop-codon-encoded amino acid is located inside a domain, but it is unclear whether it holds true in prokaryotic readthrough with canonical amino acids.  At least, only one experimentally-confirmed example from a pathogenic strain of Escherichia coli 23, whose genome is not yet determined, does not obey this rule.  What can be concluded from our result is that this type of readthrough will be located outside of domains, such as a linker between two domains.  Such a stop codon may behave as a switch that regulates production of short and long isoforms from a single mRNA, as in readthrough genes from viruses 24.

Conclusion

To explore the possibility of a 23rd amino acid, ORFs in prokaryotic genomes were investigated in a comprehensive way.  Although many of the currently known selenoproteins and pyrrolysine proteins were successfully detected, no candidate for the 23rd amino acid was discovered.  Therefore, if such an amino acid exists, it will have limited distribution in the tree of life.  Alternatively, it may be encoded by one of the sense codons.  From the viewpoint of selenoprotein prediction, the sensitivity of our method was lower than an existing method.  However, our method has several unique features.  It is applicable to general readthrough genes and rigorously excludes pseudogenes and sequencing errors.  Moreover, it does not assume the occurrence of non-readthrough homologs in the public databases.  It will help in identification of novel readthrough genes from the rapidly expanding collection of complete microbial genomes.

Methods

Enumeration of iORFs from prokaryotic genomes

A total of 328 complete genome sequences of prokaryotes were downloaded from the KEGG FTP site 25 in April 2006.  From them, 191 representative organisms were selected by excluding close relatives.  The threshold was set to average sequence identity 90% of two house-keeping genes, DNA polymerase III α subunit and alanyl-tRNA synthetase.  From these 191 genomes, iORFs longer than 80 codons were enumerated using inhouse software, which is available from the author's web site 26.  Both upstream and downstream regions of its inframe stop codon were required to be longer than 10 codons.  Two stop codons of an iORF (i.e. the inframe and C-terminal stops) can be any combination of canonical stop codons (TAA, TAG, TGA).  However, for Mycoplasma, only TAA and TAG were used.  Three codons ATG, TTG and GTG were allowed to be start signals.  The iORFs of each organism were compared with protein-coding genes of the organism using BLASTX.  If an iORF matched any protein-coding genes (E-value < 10-3) and their reading frames did not coincide, the iORF was discarded.  Similarly, iORFs were compared with RNA genes using BLASTN, and those matched with the RNAs were removed.  Remaining iORFs were translated into amino acid sequences.  We translated all three types of nonsense codons into the one-letter code U, so as to simplify visual inspection of sequence alignments.  Although the code U is usually for selenocysteine, it will be harmless because U is automatically converted into × inside the BLAST programs.

Construction of clusters of conserved iORFs

To examine evolutionary conservation of the iORFs, a window of 101 residues around the inframe stop codon was extracted and subjected to TBLASTN searches against the above 191 genome sequences.  If there were any hits (E-value < 0.01) in other organisms, and if the hit includes 10 upstream and 10 downstream residues of the inframe stop codon, then the iORF was retained.  However, if there were any hits (E-value < 10-5) that did not cover the inframe stop codon, the iORF was discarded.  Eligible iORFs were then clustered using BLASTCLUST with score density 0.5 and minimum length coverage 0.6.  After removing singleton clusters, multiple sequence alignments of the remaining clusters were computed using MAFFT 27 with the L-INS-i option.  Subsequently, conservation of the inframe stop codons in each cluster was examined.  If the location or type of stop codons was not identical, the cluster was discarded.

Three-step filtering of the candidate clusters

The first filter examines protein-likeliness of the iORFs.  This filter is mainly designed to remove conserved non-coding sequences (CNS) immediately downstream of non-readthrough genes.  If we measure purifying selection for amino acid sequences by the ratio of nonsynonymous to synonymous substitution rates (dN/dS), a protein with a stop-codon-encoded amino acid will indicate the sign of selection, while CNS will not.  The dN/dS was calculated for each of the two parts flanking the inframe stop codon in an iORF using codeml program in the PAML package 28.  Statistical significance was estimated by likelihood ratio test 29.  The observed alignment was fitted to two distinct substitution models, one of which estimates dN/dS from the data, and the other fixes it to 1.0.  Let lfree and lfix denote log likelihood of these models.  Then, 2Δl = 2(lfree – lfix) approximately follows the χ2 distribution with one degree of freedom.  If dN/dS was less than 1.0, and the statistics 2Δl was larger than a threshold, we regard it as a sign of purifying selection.  In this study, the threshold was set to 5.0 (corresponds to P < 0.025) so that the known readthrough proteins score higher than the threshold.  For each of the above clusters, an all-against-all comparison of cluster members was performed.  If any pair exhibits such signals in both the N- and C-terminal parts, the cluster was retained.  Even if both the upstream and downstream regions of the inframe stop codon code proteins, they may be two adjacent genes instead of a readthrough protein.  The second and third filtering processes remove such genes based on BLAST alignment patterns.  Although the boundary analysis applied previously has the same goal (Figure 1b), some gene pairs escaped elimination.  To enhance sensitivity of the filters, the whole length of an iORF was used as a BLAST query instead of the partial sequence, and the size of the BLAST database was increased from the 191 nonredundant genomes to the 328 complete genomes in GenomeNet and 246 draft genome sequences downloaded from GenBank in May 2006.  The second filter inspects synteny of iORFs.  If the N- and C-terminal parts of an iORF have distinct but closely arranged BLAST hits in other genomes, it strongly suggests the iORF is actually two adjacent genes.  Translated sequences of iORFs in the pre-filtering clusters were subjected to TBLASTN searches against the genome database.  If both the best hits of the N- and C-terminal parts are statistically significant (E-value < 10-5), and distance between them is less than 1 kbp, we call these hits 'syntenic hits'.  If any syntenic hits with non-coinciding reading frames were found, the cluster was removed.  The third filter uses co-occurrence of residues around the inframe stop codon as another source of information for screening stop codon readthrough.  Suppose a window of 21 residues centered at the inframe stop codon.  In prokaryotes, most stop-codon-encoded amino acids are located inside a domain, the unit of evolutionary sequence conservation.  Therefore, in an ideal situation the presence or absence of the 21 residues in alignments will be synchronized.  In contrast, if the iORF is actually two adjacent genes, then upstream and downstream residues of the stop codon will appear separately in many alignments.  We defined a co-occurrence matrix as a 21 × 21 matrix whose (i,j)-th element represents how often residue i and j appeared simultaneously in N alignments.  The matrix elements were subsequently normalized to the number of alignments N.  By definition, the more often the upstream and downstream residues of the inframe stop codon co-occur in the alignments, the higher the density in the upper right quarter of the matrix.  If average density in the quarter was lower than 0.85, the cluster was filtered out.

Stop codon usage

For each organism, its iORFs were extracted from the pre-filtering clusters, and codon usage at the inframe stop positions was counted.  Codon usage at the C-terminal stop codons in its proteome was also computed using data of coding sequences downloaded from KEGG GENES 25.  These data were combined into a 3 × 2 matrix, and Fisher's exact test was applied.  The p-value was corrected for multiple testing using the Bonferroni correction because there were 127 organisms in the pre-filtering clusters.

Probabilistic prediction and ranking of human protein-protein interactions

Abstract

Background

Although the prediction of protein-protein interactions has been extensively investigated for yeast, few such datasets exist for the far larger proteome in human.  Furthermore, it has recently been estimated that the overall average false positive rate of available computational and high-throughput experimental interaction datasets is as high as 90%.

Results

The prediction of human protein-protein interactions was investigated by combining orthogonal protein features within a probabilistic framework.  The features include co-expression, orthology to known interacting proteins and the full-Bayesian combination of subcellular localization, co-occurrence of domains and post-translational modifications.  A novel scoring function for local network topology was also investigated.  This topology feature greatly enhanced the predictions and together with the full-Bayes combined features, made the largest contribution to the predictions.  Using a conservative threshold, our most accurate predictor identifies 37606 human interactions, 32892 (80%) of which are not present in other publicly available large human interaction datasets, thus substantially increasing the coverage of the human interaction map.  A subset of the 32892 novel predicted interactions have been independently validated.  Comparison of the prediction dataset to other available human interaction datasets estimates the false positive rate of the new method to be below 80% which is competitive with other methods.  Since the new method scores and ranks all human protein pairs, smaller subsets of higher quality can be generated thus leading to even lower false positive prediction rates.

Conclusion

The set of interactions predicted in this work increases the coverage of the human interaction map and will help determine the highest confidence human interactions.

Background

Protein-protein interactions perform and regulate fundamental cellular processes.  The comprehensive study of such interactions on a genome-wide scale will lead to a clearer understanding of diverse cellular processes and of the molecular mechanisms of disease.  Although the determination of interactions by small-scale laboratory techniques is impractical for a complete proteome on the grounds of cost and time, several experimental techniques now exist to determine protein-protein interactions in a high-throughput manner 1.  High-throughput datasets have been generated for model organisms such as yeast 23456, worm 7 and fly 89 as well as Escherichia coli 10.  In addition, the first broad-focus experimental datasets for the human interactome have recently been published 1112.  Interactions determined by high-throughput methods are generally considered to be less reliable than those obtained by low-throughput studies 1314 and as a consequence efforts are also underway to extract evidence for interactions from the literature 151617.  Analysis of the high-throughput datasets has shown that they overlap very little with each other, suggesting that their coverage is low.  Indeed, it has been estimated recently that the current yeast and human protein interaction maps are only 50% and 10% complete, respectively 18.  The low coverage and variable quality of the experimental interaction datasets have prompted many groups to investigate computational methods to predict interactions or to determine the most likely interactions seen in the high-throughput datasets.  The different approaches to predict interactions can be grouped into five main categories:  1) Predictors based on sequence and structure exploit the observation that some pairs of sequence motifs, domains and structural families tend to interact preferentially.  Some methods predict interaction from sequence-motifs found to be over-represented in interacting protein pairs 19, or by considering the physico-chemical properties and the location of groups of amino acids in the sequence 2021.  Others investigate the co-occurrence in interacting proteins of specific protein domains or their structural family classification 2223.  When three-dimensional structures are available for both proteins thought to interact, high quality predictions and additional information such as the residues involved in the interaction and their binding affinity may be estimated (reviewed in 24).  Similarly, when two proteins show clear sequence similarity to proteins that exist in a complex for which the three-dimensional structure is known, detailed predictions of the atomic-level interactions may be made.  For example, the major complexes in yeast have been predicted by this strategy 25.  2) Predictors based on comparative genomics have been exploited primarily in prokaryotes.  They consider the physical location of genes, as well as their pattern of occurrence and evolutionary rate, to predict interactions or functional relationships between protein pairs.  Some predictors make use of the observation that neighboring genes whose relative location is conserved across several prokaryotic organisms are likely to interact 26.  Other predictors exploit the observation that gene pairs that co-occur in related species or that co-evolve also tend to be more likely to interact 27282930.  In addition, domains that exist as separate proteins in some genomes but are also seen fused in a single protein in other genomes have been used to suggest the isolated domains may interact 3132.  3) Predictors based on orthology work on the assumption that the orthologs of a protein pair that are known to interact in one organism will also interact.  Such relationships are often referred to as interologs 33.  For example, at BLAST e-values below 10-10, it has been shown that 16–30% of yeast interactions can be transferred to the worm 34 while further studies have estimated that a joint e-value below 10-70 is required to transfer interactions reliably between organisms 35.  Interologs have been used to predict protein-protein interactions in human 36.  4) Predictors based on functional features exploit non-sequence information to infer interactions.  Some predictors exploit the observation that there is a significant correlation in the expression levels of transcripts encoding proteins that interact 37.  Since proteins must be co-localized in order to interact, protein subcellular localization has often been used to assess the quality of interaction datasets 3839.  Similarly, interacting proteins are also often involved in similar cellular processes, so Gene Ontology "process" and "function" annotations have been exploited to predict interactions and validate high-throughput datasets 163638.  5) Predictors have exploited similarities in the network topology of known interaction datasets to predict novel interactions.  In one study, the local topology of small-world networks has been used to assess the quality of interaction datasets and predict novel interactions 40 while Gerstein and colleagues have investigated the prediction of interactions by the identification of missing edges in almost fully connected complexes 41.  In addition to these diverse approaches, some groups have combined concepts from several of the above categories in integrative frameworks.  The first such predictor integrated co-expression data, co-essentiality as well as biological function in a naive Bayes network to provide proteome-wide de novo prediction of yeast protein interactions 37.  Subsequently, the combination of many more diverse features was investigated using different frameworks to predict yeast protein-protein interactions, increasing the prediction accuracy and allowing an assessment of the limits of genomic integration 424344.  The integration of diverse genomic features has also been useful in the investigation of the related but broader problem of predicting protein-protein associations as well as complex and pathway membership (see for example 45).  Although, many computational methods have investigated the prediction of protein-protein interactions, few have so far been applied to the human proteome.  The first large-scale prediction of the human interactome map involved transferring interactions from model organisms 36.  This resulted in over 70000 predicted physical interactions involving approximately 6200 human proteins.  A second method integrated expression data, orthology, protein domain data and functional annotations into a probabilistic framework and resulted in the prediction of nearly 40000 human protein interactions 46.  It has recently been estimated that the false-positive rates of these computational datasets as well as of available high-throughput human interaction datasets are, on average, as high as 90% and their coverage is only approximately 10%, indicating that more such efforts are needed to increase the coverage and confidence we have in current maps of the human interactome 18.  In this paper, the prediction of physical interactions between human proteins has been investigated by integrating in a Bayesian framework several different pieces of evidence including orthology, functional features and local network topology.  In order to increase the accuracy and coverage of the predictions, different types of negative data (non-interacting protein pairs) were explored to train the predictor.  The most accurate of the predictors was then used to assess the likelihood of pair-wise interaction for over 20000 human proteins from the IPI (International Protein Index) database.  These predictions provide a likelihood of interaction for over 260 million human protein pairs and lead to the prediction of over 37000 human interactions.  They should thus augment current knowledge of the human interactome as well as the understanding of the relationship between distinct cellular processes.

Results and discussion

Architecture of the predictor and training of the modules

The prediction of human protein-protein interactions was investigated in a Bayesian framework by considering combinations of individual protein features known to be indicative of interaction.  The seven individual features considered are summarized in Table 1 and detailed in the Methods section.  As indicated in Table 1, the different features were grouped into five distinct modules: Expression (E), Orthology (O), Combined (C), Disorder (D) and Transitive (T).  Figure 1 illustrates the training scheme and architecture of the method.  The Expression, Orthology, Combined and Disorder modules can calculate likelihood ratios (LR) of interaction independently and are referred to as the Group A modules (Figure 1A).  The product of their likelihood ratios is referred to as the Preliminary Score.  The Transitive module considers the local topology of the network predicted by the group A modules and thus requires the completion of their analysis to calculate its own likelihood ratios of interaction (Figure 1B).  As such, all combinations of the Group A modules can be used to predict interaction in the presence or absence of the Transitive module.  In the absence of the Transitive module, the Preliminary Score is used as the final likelihood ratio output by the predictor.

Features considered in the prediction of interactions for each module

Architecture of the predictor and likelihoods of the modules

Architecture of the predictor and likelihoods of the modules.  The predictor consists of two different parts (A and B) which are trained consecutively.  The Group A modules (shown in panel A) are trained in parallel.  The likelihood ratios (LR) for most of their states are shown in panel A (their complete likelihood ratios are available in Additional File 4).  The product of the likelihood ratios of all Group A modules considered in a given prediction is referred to as the preliminary score (PS) and can be calculated for all human protein pairs.  If the Transitive module is not considered, the final likelihood ratios assigned to all protein pairs is the preliminary score (PS).  If the Transitive module is considered, the local topology of the network determined by the assignment of preliminary scores to all protein pairs considered in the training set is used to calculate the likelihood ratios for the transitive module (shown in panel B) for every protein pair in the training set.  The final likelihood ratio is then the product of the preliminary score calculated in panel A and the likelihood ratio output by the transitive module in panel B.  For the Orthology module: YL, YM, YH: yeast low, medium and high scoring bins; FL, FM, FH: fly low, medium and high scoring bins; WL, WM, WH: worm low, medium and high scoring bins; HM and HL: medium and low scoring bins for human protein pairs that have human paralogs; > 1 organism: bin for human protein pairs that have interologs in more than one organism.  For the Combined module, –— refers to the lowest scoring bin (for the domain (Dom), post-translational modification (PTM) and subcellular localization (Loc) features), – refers to the second lowest scoring bin and +, ++, +++ refer respectively to the third highest, second highest and highest scoring bins.  The likelihood ratios of interaction are evaluated for each module by considering the relative proportions of positive and negative training examples that have a specific state (i.e. that fall in a particular bin of a module).  The datasets used to train the predictor consisted of 26896 known human protein interactions extracted from the Human Protein Reference Database (HPRD) 15 and approximately 100 times more randomly chosen protein pairs used as negative examples.  The composition of the datasets and likelihood ratio calculations are explained in greater detail in the Methods section.  Once the final likelihood ratio of interaction (LRfinal) is calculated for a given protein pair as shown in Figure 1B, it is possible to estimate the posterior odds ratio of interaction by multiplying the final likelihood ratio by the prior odds ratio of interaction.  Protein pairs that have a posterior odds of interaction above 1 are more likely to interact than not to interact, thus providing an obvious threshold to predict interacting proteins.  Estimates for the prior odds ratio of interaction vary.  Previous interaction studies on yeast and human use prior odds ratios that range from 1/600 to > 1/400 37434647.  The evaluation of this ratio is difficult because not all true interactions are known.  As detailed in Methods, the prior odds ratio for human protein interaction was explored by considering different versions and subsets of human interaction datasets.  This suggested that there is insufficient data currently available to determine a reliable ratio for human.  Accordingly, we selected a prior odds ratio of interaction of 1/400 which is similar to current estimates for yeast and is probably still quite conservative.  Thus, the likelihood ratio threshold to predict interactions is 400.

Likelihood ratios of the modules

Figure 1 summarizes the likelihood ratios computed for the five modules.  The different modules differ in the range of likelihood ratio values achieved by their different states.  The Orthology and Combined modules both have states that achieve likelihood ratios above 400 (as high as 1207 for the Orthology module and 613 for the Combined module), indicating that both these modules can, on their own, predict some interacting protein pairs with a posterior odds ratio above 1.  The Expression module follows trends seen in previous studies with increasing likelihood ratios of interaction reflecting increasing expression correlation 3746.  However, since the highest likelihood ratio for the expression datasets that we consider is 33, they are not sufficient on their own to predict interacting protein pairs with a posterior odds ratio above 1.  Similarly, but in a much more pronounced way, the Disorder module is only slightly predictive of interaction, with a maximum likelihood ratio of 1.8.  Most states of the Orthology module achieve higher likelihood ratios than the highest obtained by the Expression and Disorder modules.  This is not surprising as the transfer of interacting orthologs (known as interologs 33) from one organism to another is a popular method to predict interactions (see for example 3448), particularly in the case of organisms like human for which only a small proportion of interactions are known.  The direct transfer of interactions to human from either yeast, fly or worm does not alone result in a posterior odds ratio above 1 (as the likelihood ratios of interaction for all yeast, fly and worm bins in the Orthology module are below 400).  This is not surprising as previous studies have indicated that quite stringent joint E-values must be used to transfer interactions safely between organisms 3435.  In contrast, the consideration of human interactions paralogous to the human protein pairs under investigation results in likelihood ratios of 431 and 1034 (depending on how close the paralogs are as described in Methods) which is much higher than those obtained for any single model organism.  This agrees with a recent report that suggested protein-protein interactions are more conserved within species than across species 49.  The Combined module uses domain co-occurrence, post-translational modification (PTM) co-occurrence and subcellular localization information to predict interaction.  These features were originally investigated separately, as shown in Figure 3, but their combination into one module that considers all dependencies between them achieves higher accuracy (data not shown) and higher likelihood ratios (as can be seen by comparing to Figure 1) while still being computationally feasible.  Additionally, this combination circumvents possible problems of dependence between these features.

Likelihood ratios of the features that form the Combined module, considered separately

Likelihood ratios of the features that form the Combined module, considered separately.  The Combined module considers simultaneously three distinct features: the co-occurrence of both domains and PTMs as well as the subcellular co-localization of proteins.  Here the likelihood ratios of these features considered separately are shown.  In panel A, all domain pairs considered were given scores and likelihood ratios were estimated for different values of these scores.  Similarly, shown in panel B are the likelihood ratios for different values of PTM co-occurrence scores.  Panel C shows the likelihood ratios for protein pairs localized to different sets of cellular compartments.  Previous methods have investigated the use of co-occurring domains to predict interaction (see for example 2346).  Many pairs of domains co-occur in proteins known to interact.  When investigated as a separate feature, the chi-square score of co-occurrence of domain pairs correlates well with the likelihood of interaction of protein pairs that contain these domains, with the highest chi-square score bin obtaining a likelihood ratio of 14, as shown in Figure 3A.  Similarly, the co-occurrence of PTMs is also predictive of interaction, with its highest scoring bin obtaining a likelihood ratio of 6 as shown in Figure 3B.  Lists of high scoring domain pairs and PTM pairs are shown in Additional Files 1 and 2.  Subcellular localization has been extensively used both to assess the quality of interaction datasets 115051 and to generate examples of non-interacting protein pairs to use as negative datasets when training and testing predictors 3746.  In the present study, the use of localization was investigated as a feature predictive of interaction.  Four possible localization states were considered for protein pairs: same compartment, neighboring compartments, different non-neighboring compartments and absence of localization annotation (more details are given in the Methods section).  As shown in Figure 3C, the likelihood ratio of same compartment protein pairs was found to be twice as high as that of randomly chosen or non-annotated protein pairs whereas different non-neighboring protein pairs are more than three times less likely to interact than random protein pairs.  Individual localization features achieve low interaction likelihood ratios.  However, when integrated into the Combined module, domain, PTM and localization information together achieve likelihood ratios that are high enough to predict interaction on their own (i.e. above 400).  As expected, the highest likelihood ratio bins for the Combined module are those representing the highest combinations of the three features separately.  The transitive module enhances the preliminary likelihood score (PS) (calculated using the group A modules) by considering the local topology of the resulting network which is assessed using the neighborhood topology score as detailed in the Methods section.  The likelihood ratios for different values of the neighborhood topology score are shown in Figure 1B.  The Transitive module is highly predictive of interaction and achieves likelihood ratios as high as 229.  This module cannot be used alone as it requires as input the output of at least one group A module.  However, it can predict interacting protein pairs with a posterior odds ratio above 1.0 when used in combination with any single module in group A (as the product of the highest likelihood ratios of the transitive module and any group A module is greater than 400 as can be seen from Figure 1).

Independence of the modules

The final likelihood ratio output by the predictor is only representative of the true likelihood of interaction of a protein pair if the modules considered are independent.  If the modules were not independent, (some likelihood ratios would likely be overestimated, particularly for protein pairs that achieve simultaneously high likelihoods for non-independent features).  Conversely, some likelihood ratios would be underestimated for protein pairs achieving simultaneously low likelihoods for non-independent features.  Previous studies have demonstrated that some of the features considered here are indeed independent 43.  Independence of all modules used in our predictor was verified by calculating Pearson correlation coefficients for all pairs of modules.  As shown in Table 2, all modules considered are independent, since the highest Pearson correlation coefficients computed are well below any value considered significant.

Pairwise Pearson correlation for all modules

Accuracy of the predictors

All combinations of modules were examined to determine which of the resulting predictors achieved the highest prediction accuracy.  In order to analyze the predictions, five-fold cross validation experiments were performed and the area under partial ROC (receiver operator characteristic) curves (partial AUCs) measured.  ROC50 and ROC100 curves were selected as they consider a large enough number of positives to include all protein pairs predicted to have a posterior odds ratio above 1.0 by all the predictors investigated.  Protein pairs predicted to have a posterior odds ratio below 1.0 have an estimated true positive rate below 50% and thus are more likely not to interact than to interact.  These protein pairs are therefore not of interest in this context.  The area under all ROCn curves considered is relatively low because of the high proportion of negatives with respect to positives in the training and test sets (100:1).  Table 3 summarizes the characteristics of 19 different predictors and shows accuracy measures.  Individual modules do not achieve high scores for the areas under the ROC50 and ROC100.  In fact, all ROC50 AUC values achieved by individual modules are below 0.025 and the Expression and Disorder modules do not predict any protein pairs (positive or negative) above a posterior odds ratio of 1, which is expected as the highest likelihood ratios they achieve are lower than 400 (see Figure 1A).  As more Group A modules are considered within the same predictor, the ROCn AUC scores increase significantly, as would be expected since these features are independent (as shown in Table 2) and thus contribute different information to the prediction.  For example, the predictor that considers both the Expression and Combined modules achieves a ROC50 AUC of 0.033 compared to 0.003 and 0.022 respectively for the individual modules.  However, the Disorder module does not contribute significantly to the prediction as predictors that consider it do not, in general, do better than their counterparts that do not use it.  For example, both the Expression-Orthology predictor and the Expression-Orthology-Disorder predictor achieve a ROC50 AUC of 0.024.  The Disorder module offers the advantage of increasing the coverage of the prediction as a disorder score is calculated for all protein pairs.  However, this appears to add more noise to the prediction without increasing the accuracy.

Prediction accuracy of different combinations of modules

As the scores of the predictors increase, so do the number of interactions predicted above different posterior odds ratio thresholds (see lower portion of Table 3).  For example, the Expression-Orthology predictor achieves a ROC50 AUC of 0.024 and predicts 5670 interactions at a posterior odds ratio greater than 1 whereas the Expression-Orthology-Combined predictor achieves a ROC50 AUC of 0.044 and predicts over 15000 interactions at a posterior odds ratio above 1.  The best combination of Group A modules is the predictor consisting of the Expression, Orthology and Combined modules.  The Transitive module, which can only be used in combination with other modules, increases substantially the scores and number of interactions predicted.  The right-hand portion of Table 3 shows the accuracy measures for the highest scoring subset of predictors that consider the Transitive module.  The Transitive module enhances the prediction by identifying among protein pairs with a relatively high preliminary score those that are most likely to interact, by considering the local topology of the network around them.  For example, the ROC50 AUC rises from 0.044 to 0.075 when the Transitive module is added to the Expression-Orthology-Combined predictor, and the number of predictions above a posterior odds ratio of 1 doubles from 15330 to 34780.  Once again, the Disorder module does not contribute positively to the prediction.  Its inclusion does not increase any of the measures of accuracy considered.  The predictor that considers the Expression, Orthology, Combined and Transitive modules is the one that achieves the highest accuracy overall.  It is this predictor that is further analyzed in the next sections.

Comparison to predictions generated using alternative training sets

In this work training sets were used that comprised 100 times more negatives than positives, with the negatives randomly selected and filtered to remove any known or suspected positives (see Methods).  Other groups have used negative:positive ratios ranging from 1 to more than 600 (see for example 374752).  In addition, several groups use localization-derived negatives (i.e. protein pairs that are not annotated as being localized to the same cellular compartment) rather than randomly chosen negatives (see for example 374346).  These issues have been investigated previously 53.  Since the choice of negative training data may influence the method, the choice of different training sets in the context of the probabilistic predictor presented here was investigated to determine which type of training set offers the highest accuracy.  Table 4 compares the accuracy of predictors trained with negative:positive ratios of 1:100 and 1:1 and tested by five-fold cross validation.  Ratios greater than 100 were not considered because they are computationally infeasible given the size of our datasets and the architecture of the predictor.  To perform such a comparison, the EOCT predictor (Expression, Orthology, Combined and Transitive modules) was trained on datasets consisting of either equal numbers of positives and negatives or 100 times more negatives than positives and then tested on both types of datasets.  As shown in Table 4, the predictors trained on datasets containing 100 times more negatives than positives perform significantly better than those trained on datasets containing equal numbers of positives and negatives.  For example, the 1:1 pos:neg trained predictor achieves a ROC50 AUC of 0.0645 whereas its 1:100 pos:neg trained counterpart achieves a 0.0747 ROC50 AUC.  This could be due to the fact that the number of non-interacting protein pairs outweighs greatly the number of interacting protein pairs in cells.  When equal numbers of positives and negatives are used in training, the diversity that exists in the non-interacting protein pair space may not be captured, thus resulting in misleading likelihood ratios for the predictive modules.  It should be noted that predictors tested on datasets consisting of equal numbers of positives and negatives achieve much higher accuracy measures than those tested on datasets containing 100 times more negatives than positives.  This is because the number of positives scoring higher than the highest scoring n negatives, for a given value of n and a given predictor, will be greater if there are equal numbers of positives and negatives in the test set than if there are more negatives than positives.

Influence of the negative:positive training set ratio on the prediction accuracy

The ROCn AUCs are an average of five separate experiments (each of which is itself a five-fold cross validation experiment).  Their standard deviation is shown in parenthesis.  The effect of localization-derived negatives rather than randomly chosen negatives was also investigated to see if it would increase the prediction accuracy.  A criticism of randomly chosen negatives is that they will contain some true interactors.  However, the set of interacting pairs in the full protein pair space is small and thus the contamination rate of randomly chosen negative datasets will in fact be very low.  Contamination is probably below 1%, which is likely lower than the contamination rate of the positive dataset as discussed in 47.  Localization-derived negatives, on the other hand, should be free of contamination, if the localization annotations are complete and accurate, both conditions that are difficult to obtain as discussed in 54.  However, one can argue that localization-derived negatives might not be able to capture the full diversity of the non-interacting protein space since many proteins in the same cellular compartment do not interact.  In addition, proteins specific to a cellular compartment may have different characteristics to proteins in other compartments.  Such predictors may not generalize well when predicting on cell-wide protein pairs which consist not only of non-colocalized non-interacting pairs but also numerous protein pairs that do not interact but are present in the same cellular compartment.  These issues have been discussed previously 52.  In order to see if different types of negatives could influence the accuracy of the predictors developed here we generated negative training/test sets as in 46 by identifying all pairs of human proteins for which one protein is annotated as being nuclear and the other is annotated as being localized to the plasma membrane in the HPRD database 15.  The Combined module for these predictors only considers domains and PTMs but not subcellular localization as this would result in using this feature both in the selection of the training set and as a feature predictive of interaction.  The localization-derived negative trained predictor tested on sets containing localization-derived negatives achieves a lower accuracy than that of the random negative trained predictor tested on a test set containing randomly-generated negatives (0.0686 +/- 0.0010 vs 0.0747 +/- 0.0022).  This is most likely due to the fact that the localization-derived negative trained predictor cannot take full advantage of the Transitive module, since the network resulting from the predictions of the Group A modules likely does not sample the whole protein pair space well.  Our predictor trained with randomly generated negatives and a negative:positive ratio of 100 performs the best out of all the combinations of training sets investigated.  It is this predictor that is further analyzed in subsequent sections.

Contribution of the modules

The relative contribution of the modules to the prediction of interaction was investigated in order to gain a better understanding of the predictive power and areas of highest usefulness of the different modules.  To do this, all protein pairs were considered that achieve an estimated posterior odds ratio > 1 when the EOCT predictor was trained on the full datasets without cross-validation.  This set consists of 37606 distinct predicted interactions and is referred to as the LR400 dataset (all these interactions are listed and ranked in Additional File 3).  These protein pairs represent the most probable interactors with respect to the features considered, among all protein pairs examined by the predictor.  To investigate the individual contribution of each module, we looked at the number of interactions predicted out of all LR400 pairs as a function of the minimum likelihood ratio of each module.  As shown in Figure 4A, all modules contribute positively (i.e. contribute a likelihood ratio greater than 1.0) to the prediction of a certain proportion of the interactions in the LR400 dataset.  The Transitive module and to an even greater extent, the Combined module contribute positively to the prediction of a very high proportion of the LR400 protein pairs (73% and 91% of the LR400 interactions have likelihood ratios greater than 1 for the Transitive and Combined modules respectively).  The Transitive module provides a likelihood ratio of 91 for the prediction of over 70% of the LR400 interactions.  The Combined module provides positive evidence for the highest number of interactions of the LR400 dataset.  However, the value of the likelihood ratio it contributes is below 20 for over 50% of protein pairs in the LR400 dataset (which means that for these protein pairs, the Combined module must be used in combination with other modules to achieve a total likelihood ratio above 400).  The Combined module does, however, achieve likelihood ratios high enough to predict over two thousand interactions of the LR400 dataset on its own, less than 15% of which are present in the training set.  The Orthology module contributes to the prediction of only 8474 protein pairs in the LR400 dataset (23%).  However, a large majority (> 75%) of these 8474 predicted interactions achieve likelihood ratios above 200 from this module.  In fact, almost 40% of these LR400 interactions achieve a likelihood ratio above 400 from the Orthology module.  This indicates that most interactions predicted by the Orthology module (alone or in combination with other modules) are based on the highest scoring Orthology bins (see Figure 1A) which are the most conserved yeast interactions (whose bin achieves a likelihood ratio of 237), as well as human paralogous interactions and interactions found in more than one model organism (both of which achieve a likelihood well above 400).  Few interactions in the LR400 dataset are predicted on the basis of having interacting orthologs in worm or fly alone.  The Expression module provides positive evidence for a little less than half the predictions in the LR400 dataset.  However, as previously noted, the highest likelihood provided by this module is 33 and thus the Expression module cannot predict interaction on its own.

Contribution of the modules

Contribution of the modules.  To examine the contribution of the different modules, we plotted the number of interactions predicted among all LR400 interactions (all interactions predicted using the full predictor that obtain a likelihood ratio of interaction greater than 400) as a function of the minimum likelihood ratio of individual modules (in panel A) or of combinations of modules (in panel B).  In the case of combinations of modules (panel B), the minimum likelihood ratio is the product of the likelihood ratios of the modules considered.  Thus for example, the product of the expression and orthology ratios is greater than 1 for almost 20000 LR400 interactions and greater than 10 for approximately 10000 LR400 interactions (dark blue diamonds in panel B).  E: Expression module, O: Orthology module, C: Combined module, T: Transitive module.  Figure 4B summarizes the contributions of different combinations of modules.  The Combined and Transitive modules contribute the most to the prediction of interactions.  They alone can predict approximately 27000 of the 37606 interactions of the LR400 dataset.  When they are both present, regardless of which other modules are also present, they predict over 70% of the LR400 interactions.  When either of these two modules is absent, fewer than 12500 interactions are predicted.  In contrast, the two remaining modules (Expression and Orthology) can predict approximately 5000 interactions together.  This is interesting as many of the publicly available predicted interaction datasets mentioned in the Background section use mainly orthology transfer from model organisms to identify interactions.  As the majority of the LR400 interactions are derived from the Combined and Transitive modules, it is possible that the method is identifying a large subset of interactions that are not common to previous human protein interaction datasets.  This is discussed further in the next section.  The curve representing the full predictor (consisting of the Expression, Orthology, Combined and Transitive modules) is also represented in Figure 4B (the dark green squares).  By definition, it predicts all proteins in the LR400 dataset at likelihood ratios equal to or above 400 (this is how the LR400 dataset was generated).  The right side of the curve illustrates the number of interactions that are predicted above likelihood ratios of 400 and more.  As shown in Figure 4B, the full predictor predicts approximately 20000 interactions at a total likelihood ratio of 1600 (which is equivalent to an estimated posterior odds ratio of 4).  At a likelihood ratio of 4000, approximately 11000 interactions are predicted and at a likelihood ratio of 8000, approximately 6500 interactions are predicted.  We verified that the increasing estimated posterior odds ratios translated into better predictive value.  Figure 5 shows the true positive rate versus false positive rate for different posterior odds ratios as measured by five-fold cross validation.  As the posterior odds ratio increases, the false positive rate decreases and the relative proportion of true positives increases when compared to the proportion of false positives.  Accordingly, subsets of very high quality predictions may be generated by choosing a suitably high posterior odds ratio threshold.

True positive rate versus false positive rate for different estimated posterior odds ratios

True positive rate versus false positive rate for different estimated posterior odds ratios.  The true positive rate (TPR) versus false positive rate (FPR) is plotted for different values of the posterior odds ratio estimated for the dataset by five-fold cross-validation.  As the posterior odds ratio increases, the false positive rate decreases and the ratio of the true positive rate divided by the false positive ratio increases.  Thus, higher quality datasets can be generated by requiring higher posterior odds ratios.  The TPR is calculated as the number of true positives predicted divided by the total number of positives in the test set.  The FPR is calculated as the number of false positives predicted divided by the total number of negatives in the test set.

Comparison to other interaction datasets

The false positive rate (FPR) of our predictor was estimated by the method of D'Haeseleer and Church 1855 and used to compare it to other prediction datasets.  The Ramani interaction dataset that was automatically extracted from the literature 16 as well as all new interactions present in the October 2006 version of the manually curated HPRD database 15 (but none of the interactions also present in earlier versions of the HPRD which were used to train our predictor) were taken as reference datasets.  The D'Haeseleer and Church method compares two experimental datasets to a reference set and assumes that all intersections between the three datasets contain true positives.  It is thus possible to estimate the number of true positives predicted by an experimental dataset by comparing the number of interactions present in the different intersections of the two experimental methods and the reference dataset (for details, see 1855).  Here, we compare three human interaction prediction datasets: the Rhodes probabilistic dataset 46, the Lehner orthology-derived dataset 36 and the most accurate of our predictors (the LR400 subset of the predictor considering the Expression, Orthology, Combined and Transitive modules).  We estimated false positive rates for each of the datasets by comparing them two by two to one of the reference datasets, thus generating 4 to 6 different estimates of false positive rates for each computational dataset, as shown in Figure 6A (the two Lehner datasets were not compared to each other, which is why they have fewer FPR estimates).  The rates estimated for the Rhodes and Lehner datasets are similar to previous estimates 18.  The estimated false positive rates for the LR400, Rhodes and core Lehner are quite similar (an average of 76% FPR for both the LR400 and core Lehner datasets and 78% for the Rhodes dataset) and well below the overall average false-positive rate of 90% estimated for most available human high-throughput experimental and prediction interaction datasets 18.  It should be noted that the Rhodes, Lehner and Ramani datasets annotate interactions as a relationship between human genes and not their protein products directly.  However, not all proteins encoded by a single gene will necessarily interact with all protein products encoded by a second gene, even if one such protein pair does.  This is why we describe interactions as a relationship between two proteins, allowing for a more precise description of the interaction.  To compare our predictions to these datasets, we consider that two genes interact if at least one of their respective protein products interact.

Comparison to other interaction datasets

Comparison to other interaction datasets.  The false positive rates shown in panel A were estimated for the LR400 dataset as well as the Rhodes [46] and Lehner [36] predictions using the method described in [18, 55] by comparing them two-by-two to a reference dataset.  The number and overlap of distinct proteins (shown in B) and distinct interactions (shown in C) are shown for the LR400 dataset, the Rhodes prediction dataset and the June 2006 version of the HPRD.  In Figure 6B and 6C, we compare the number of distinct proteins and distinct interactions of the LR400 dataset to those of the Rhodes prediction dataset and the June 2006 version of the HPRD which was used to train our predictor.  The Rhodes dataset was trained using an earlier version of the HPRD.  As can be seen in Figure 6, the intersections between the three datasets considered are low, especially when comparing the interactions.  Both the Rhodes dataset and our LR400 dataset predict interactions involving many proteins that are not even present in their positive training set (the HPRD).  Many of the predictions in these two datasets concern protein pairs and proteins that are not present in other datasets, suggesting that they cover different regions of the human interaction space.  As suggested in 18, by making more such datasets available, it will be possible to increase our coverage of the interaction space and determine the most likely human interactions.  Another human interaction dataset has recently become available: the IntNetDB 56.  It was generated by integrating seven different features (four of which involve transferring interactions or characteristics of protein pairs from model organisms to human) in a probabilistic framework.  Interactions were predicted above a TP/FP ratio (number of true positives divided by the number of false positives in the test set) of 1.  Using such a threshold, the authors claim to predict 180 010 human interactions.  We do not compare our predictions to this dataset because such a threshold of TP/FP > 1 does not correspond to a posterior odds threshold > 1.  Depending on the positive-to-negative ratio used in the datasets, TP/FP > 1 might correspond to an average posterior odds ratio of 1.  In contrast, the average posterior odds ratio of our LR400 dataset is above 700.  In comparison, by using a threshold of TP/FP > 1 in our test set, we predict over 1 000 000 human interactions.  We do not believe that the quality of this large number of predictions is high enough to warrant their publication since the great majority of these protein pairs achieve a posterior odds ratio below 1.

Independent validation

Although the overlap between the LR400 dataset and the HPRD-derived positive training set is below 10% as shown in Figure 6C, the proportion of interactions common to these two sets is not equally distributed for all posterior odds ratios of interaction values.  As shown in Figure 7, while less than 3% of the protein pairs predicted to interact at posterior odds ratios between 1 and 2 overlap with the HPRD dataset used for training, this value increases to over 50% for the highest scoring subsets of the LR400 dataset.  These highest scoring predictions receive high likelihood ratios of interaction from all four predictive modules and represent the strongest examples of interaction as evaluated by our predictor.  Such examples include interactions that allow the formation of well-known protein complexes such as the proteasome, the MCM protein complex involved in the initiation of genome replication, replication factor C, the TBP/TAF complex (TBP-associated factors) and the EIF complex (eukaryotic translation initiation factors).  The highest scoring predictions in the LR400 dataset thus mainly represent interactions present in the HPRD dataset as well as interactions between proteins that have strong sequence identity to these known interacting pairs.  However, as the posterior odds ratio decreases, the overlap between the predictions and the HPRD-derived training set decreases.  Some subsets of quite high posterior odds have much smaller overlaps with the training set.  For example, interactions predicted at posterior odds ratios between 128 and 2048 have a 20 to 30% overlap with the training set as shown in Figure 7.  Although many of these novel predictions have not been previously investigated in the literature, there exists experimental evidence supporting a subset of these predictions which is not present in the June 2006 version of the HPRD used to train our predictor, thus providing independent validation of our method.  Five such validated predictions are reported here:

Overlap of different subsets of the LR400 dataset with the HPRD-derived training set

Overlap of different subsets of the LR400 dataset with the HPRD-derived training set.  The number of interactions predicted and the proportion of overlap with the training set (which was derived from the HPRD) were calculated for subsets of the LR400 dataset of different posterior odds ratios.  -TCPTP was predicted to interact with STAT6 at a posterior odds ratio of 4300.  It has been recently reported that TCPTP, the only protein tyrosine phosphatase known to localize to the nucleus, dephosphorylates STAT6 in this cellular compartment, which may in turn lead to the suppression of Interleukine-4 (IL-4) induced signaling 57.  -N-WASP and ARP3 achieve a predicted posterior odds ratio of interaction of 2700.  A recent report suggested that the IQGAP1 protein can activate N-WASP thus changing its conformation and allowing it to bind the ARP2/3 complex, which in turn directs the generation of branched actin filaments required for the extension of a lamellipodium 58.  -The VAMP3-VTI1A interaction was predicted with a posterior odds ratio of 1518.  Both these proteins are believed to be part of the SNARE (soluble N-ethylmaleimide-sensitive factor attachment protein receptor) family of proteins which are involved in membrane fusion events.  VTI1A is a trans-Golgi-network-localized putative t-SNARE 59 and VAMP3 is an early/recycling endosomal v-SNARE 60.  These two proteins were recently shown to interact, leading to their functional implication in the post-Golgi retrograde transport step 61.  -CDK2 and MCM4 were predicted to interact at a posterior odds ratio of 62.  CDK2 has recently been shown to phosphorylate MCM4, a subunit of a putative replicative helicase essential for DNA replication, on two distinct residues, leading to a change in its affinity to chromatin and its enrichment in the nucleolus 62.  -Sam68 and Smad2 achieve a predicted posterior odds ratio of 32.  This interaction has been experimentally demonstrated by large-scale yeast-two-hybrid analysis of the Smad signaling system 63.  Our probabilistic predictor therefore not only reproduces and completes well-known protein complexes but also identifies novel interactions, a subset of which have been independently validated.

Conclusion

The current human protein interaction map is estimated to be only 10% complete 18.  Here, we investigated the prediction of human protein-protein interactions in an effort to increase the coverage of the human interactome while simultaneously providing high quality predictions.  By considering several different types of orthogonal and quite distinct features including expression, orthology, combined protein characteristics and local network topology, we predicted over 37000 human protein interactions and explored a subspace of the human interactome that has not been investigated by previous large interaction datasets.  Our investigation led us to compare the influence of different training sets on the prediction accuracy.  The use of randomly generated negative training examples and large negative-to-positive ratios in the training set generated the most accurate predictors in the context of our model.  A comparison to other large human interaction datasets revealed the average false positive rate of our dataset to be 76%, which is much lower than the overall average for most large scale, currently available, human interaction datasets (experimental and computational) estimated to be 90% 18.  A subset of our novel predictions have been independently validated by identifying recent reports that experimentally investigated and confirmed that these protein pairs do interact.  We provide all our predictions ranked according to the posterior odds ratio of interaction in Additional File 3.  It is thus possible to restrict the dataset to the highest scoring protein pairs (and only choose for example, protein pairs that have an estimated true positive rate of interaction above 90%).  By making this human interaction prediction dataset publicly available, it is our hope that it will help to identify the most high-confidence interactions, leading to a more complete and accurate human interaction map.

Methods

Datasets

In order to investigate the likelihood of interaction of human proteins, 62322 human protein sequences were downloaded from the International Protein Index (IPI) database (version 3.16) 64.  Some of these proteins are alternative transcripts of the same gene but can have distinct interaction partners.  Known interactions were downloaded from the Human Protein Reference Database (HPRD; June 2006 version) 15.  Duplicate interactions and self-interactions were not considered.  Additionally, some proteins were not recovered in the conversion between different identifiers.  This resulted in 26896 distinct human protein interactions involving 7531 distinct human proteins present in the initial IPI dataset.  The 26896 interactions from the June 2006 version of the HPRD were used as the positive dataset in the training/testing of the predictor.  Two different sets of non-interacting protein pairs were investigated: the main analysis employed a randomly-generated negative dataset but this was also compared to a localization-derived negative dataset.  Both non-interacting protein datasets were cleaned by removing all protein pairs that came from the positive dataset as well as protein pairs that were annotated as interacting in other databases (DIP 65: 679 interactions, BIND 66: 2650 interactions), or predicted to interact in other studies (OPHID 67: 21815 interactions).  Of the 62322 human proteins from the initial IPI dataset, 22889 were characterized by at least one of the features that we considered to predict interaction (see the Features section).  These 22889 human proteins are encoded by 16904 distinct genes and are referred to as the Informative Protein Set.  The randomly-generated negative dataset used for the training and testing of the predictor was created by selecting protein pairs at random from the Informative Protein Set.  In contrast, the localization-derived negative dataset was created by selecting protein pairs from the Informative Protein Set for which the HPRD 15 annotates one as being primarily in the plasma membrane and the other as primarily in the nucleus.  Training and testing was performed with 5-fold cross-validation.  In addition, positive to negative ratios of 1:1 and 1:100 were considered.  The predictions were compared to the literature-mined Ramani dataset 16, the orthology-derived Lehner prediction dataset 36 and the probabilistic Rhodes prediction dataset 46.  All three datasets identify the interactions by stating the names and/or gene locus IDs of the genes that encode the interacting proteins.  In contrast, we work directly on the protein sequences and so related the gene annotations to our protein identifiers by extracting Entrez Gene IDs corresponding to the IPI protein entries from the IPI cross-reference files (for the IPI release 3.24) 64.  Ensembl gene identifiers (Ensembl 42) were also matched to Entrez Locus IDs (NCBI36) using BioMart 68.  Some gene-gene entries were not recovered in the conversion between different identifiers, or due to the deletion or replacement of some Entrez Locus IDs.  Despite this, 37714 gene-gene interactions were recovered from the Rhodes dataset and 6132 interactions from the Ramani dataset as well as 64306 and 10454 interactions from the Lehner full and core datasets respectively.

Learning method

Semi-naive Bayes classifiers were used to measure the likelihood of interaction of two proteins given the presence of the features considered.  This learning method was chosen because it allows the integration of highly heterogeneous data in a model that is easy to interpret and that can readily accommodate missing data.  The transparency of the method allows the straightforward determination of which features are most predictive of interaction at the level of the whole proteome as well as for individual protein pairs.  The prediction of protein interaction is a binary problem which can be expressed in Bayesian formalism.  We are interested in determining the posterior odds ratio of interaction of two proteins, given the presence of the features we are considering.  This posterior odds ratio can be re-written using Bayes rule: Opost = P(I|f1,..., fn)P(~I|f1,..., fn)=P(f1,..., fn|I)*P(I)P(f1,..., fn)P(f1,..., fn|~I)*P(~I)P(f1,..., fn)=P(f1,..., fn|I)*P(I)P(f,..., fn|~I*P(~I)=P(I)P(~I*P(f1,..., fn|I)P(f1,..., fn|~I)=Oprior*LR(1,...,fn) where I is a binary variable representing interaction, ~I represents non-interaction, f1 through fn are the features we are considering, Oprior is the prior odds ratio and LR is the likelihood ratio.  If the features considered are independent, the likelihood ratio LR can be calculated as the product of the individual likelihood ratios with respect to the features considered separately.  If the features are not independent, all possible combinations of all states of these features must be considered, which can be computationally quite intensive.  In the independent case, the likelyhood ratio can be calculated as: LR(f1,...,fn)=[P(f1,..., fn|I)P(f1,..., fn|~I)]=Πi=1n[P(fi|I)P(fi~I)]  The likelihood ratios for the different features considered can be estimated by evaluating the ratio of the proportion of interacting and non-interacting proteins for which a particular state of the feature is true in the training set (i.e. by determining to which bin of the feature the protein pair belongs, for every protein pair in the positive and negative training sets).  More precisely, the training step consisted of calculating the respective proportions of positive and negative examples that fall into each bin of the feature(s) considered (i.e. that have a particular state).  The likelihood ratio of interaction for a given state is simply the ratio of the proportion of all positives that have that state divided by the proportion of all negatives that have that same state.  When a particular state of a feature occurs only in positive examples (known interacting proteins), the likelihoods are set to the highest non-infinite value of any state for that feature (to avoid infinite values).  Additionally, when no data are available for a specific feature (for a given pair of proteins), the likelihood of the feature is set to 1.0.  For a detailed calculation of the likelihoods see Additional File 4.

Prior odds ratio estimate

The prior odds ratio (Oprior) is difficult to estimate because we do not know all the true interactions, even for a small subset of proteins.  The prior odds ratio of interaction for yeast was estimated by combining all protein-protein interactions (but only those related to direct physical interactions, and no entries derived by synthetic lethal-type experiments) from the BIND, DIP and GRID databases 656669.  This subset of interactions contains 36466 distinct interactions involving 5202 distinct proteins, thus resulting in a prior odds ratio of 1/370.  This is most likely a conservative estimate since a certain proportion of interactions remain unknown and so when more data become available, the prior odds ratio will increase.  For human proteins, 12191 distinct interactions were recovered, involving 5164 human proteins from the September 2005 version of the HPRD 15 and 26896 distinct interactions involving 7531 human proteins from the June 2006 version, leading respectively to prior odds estimates of 1/1093 and 1/1053.  However, taking the subset of 5164 proteins from the September 2005 version that are seen in the June 2006 version (20842 distinct interactions), gave a prior odds of interaction estimate of 1/639.  Thus, between the two releases of the HPRD, there was a large increase in the number of interactions for this subset of proteins and this is likely to continue for at least the next few releases.  Accordingly, it is reasonable to conclude that there are not enough known human interactions to calculate a realistic and stable estimate of the prior odds ratio of interactions for human.  As a consequence, a prior odds ratio of 1/400 was used for all work in the paper, which is similar to the estimate for yeast and is likely still an underestimate of the true value.

Features

Seven distinct features combined into five modules were investigated as summarized in Table 1 and described below.

1.

Expression module  Expression data were downloaded from the Gene Expression Omnibus 70.  The GDS596 dataset was used which examines gene expression profiles from 79 physiologically normal tissues obtained from various sources 71.  Expression data were recovered for 10642 distinct transcripts in 158 different arrays (2 arrays per tissue).  Pearson correlations were calculated for all 56620761 transcript pairs and correlation values were grouped into 20 bins of increasing co-expression.

2.

Orthology module  Orthology maps between human and yeast, worm and fly were downloaded from the InParanoid database 72.  Interaction datasets for model organisms were downloaded from the BIND 66, DIP 65 and GRID 69 databases.  Orthology interaction data were classified into 13 bins.  High, medium and low confidence bins were defined for human protein pairs that have interacting orthologs in either yeast, fly or worm (for a total of 9 bins).  The high confidence bins were populated by human protein pairs that have interacting orthologs that both achieve an InParanoid score of 1 (i.e. both proteins involved in an interaction in another organism are respectively the best orthology match for the two human proteins under consideration).  The medium confidence bins were populated by human protein pairs that have interacting orthologs but only one of the interacting orthologs has an InParanoid score of 1.  The low confidence bins were filled by human protein pairs that have interacting orthologs according to InParanoid but neither achieves a score of 1 (i.e. neither is the best match for the two human proteins under consideration).  The orthology module has four additional bins: two bin for human pairs that have interacting paralogs in human (a medium and a low confidence bin which use the same definition as above for the model organisms), one bin for human pairs that have interacting homologs in more than one organism (these can be orthologs in yeast, worm or fly, or paralogs in human) and one bin for human pairs that have only non-interacting orthologs.

3.

Combined module  This module incorporates three distinct features in a non-naive Bayesian framework: subcellular localization, domain co-occurrence and post-translational modification co-occurrence.  Subcellular localization  PSLT (Protein Subcellular Localization Tool) subcellular localization predictions 54 were used to classify protein pairs in one of four groups: pairs of proteins predicted to be in the same compartment, pairs of proteins predicted to be in neighboring compartments (cytosol-nucleus, endoplasmic reticulum-Golgi, Golgi-cytosol, cytosol-plasma membrane, and plasma membrane-secreted), pairs of proteins predicted in different non-neighboring compartments and pairs of proteins for which there were no localization predictions.  Neighboring compartments were chosen as compartment pairs sharing a high proportion of proteins, as investigated previously 54.  Co-occurrence of domains  The chi-square test was used as a measure of the likelihood of co-occurrence of specific InterPro domains and motifs 73 in protein pairs.  Chi-square scores were calculated for all pairs of domains/motifs that occurred in the training data and were then grouped into 5 bins of increasing value.  Additionally, Pfam 74 domain pairs known to interact from three-dimensional structures 75 were included in the highest Chi-square score bin.  When protein pairs contained more than one domain pair, the domain pair assigned to the highest Chi-square score bin was used to assign a likelihood of interaction.  Post-translational modification (PTM) pair co-occurrence  Likelihoods were assessed using a PTM pair enrichment score calculated as the probability of co-occurrence of two specific PTMs in all pairs of interacting protein pairs divided by the probability of occurrence of both of these PTMs separately: PTM_score=P(PTM[i],PTM[j]|I)P(PTM[i]|I)*P(PTM[j]|I) where PTM[i] and PTM[j] are distinct PTMs and I is the set of all interacting proteins that were used to train the predictor.  The annotations of PTMs in human proteins were downloaded from UniProt 76 and HPRD 15.  PTM instances described as "predicted", "probable" or "possible" were excluded, leaving 3439 distinct proteins with PTM annotations in the training set.  The PTM pair enrichment scores were grouped into 4 bins of increasing value.  The localization, co-occurrence of domains, and PTMs were considered simultaneously to measure their predictive power in assessing the likelihood of protein interaction.  To do this, all possible combinations of the 4 localization bins, 5 chi-square domain-co-occurrence bins and 4 PTM_score bins were investigated and are referred to as the combined module.

4.

Disorder module  It has been suggested that unstructured regions of proteins are often involved in binding interactions, particularly in the case of transient interactions 77.  Protein intrinsic disorder was predicted for all proteins considered by using the VSL2B predictor 78.  The disorder score for protein pairs was then calculated as the sum of percent disorder for each protein of the pair.  Disorder scores were grouped into 6 bins of increasing value.  The Expression, Orthology, Combined and Disorder modules are referred collectively as the Group A modules.  Likelihood ratios for each of the Group A modules are illustrated in Figure 1A (see Additional File 4 for complete likelihood ratios for every possible state of these modules and for detailed calculations of these likelihood ratios).

5.

Transitive module  The transitive module works on the premise that a pair of proteins is more likely to interact if it shares interacting partners.  It does this by considering the local topology of the network predicted by the integration of the Group A modules as depicted in Figure 2.  Thus, the transitive module takes as input the product of the likelihood ratios of all other modules considered by the predictor (as illustrated in Figure 1B).  For each pair of proteins in the training set, the product of the likelihood ratios from all other modules (referred to as the preliminary score (PS) in Figure 1) was calculated for all protein pairs neighboring the pair (i.e. all protein pairs which involve one protein from the initial protein pair under study and for which it is possible to calculate such a score).  All preliminary scores above 10 were kept.  This parameter was determined empirically.  A neighborhood topology score T was then calculated as follows:

Transitive module hypothesis

Transitive module hypothesis.  The Transitive module investigates whether two proteins (such as i and j) that share many common interactors and have few additional interactors that are not common to both proteins are more likely to interact than two proteins (such as i' and j') that share few common interactors.  T=ΣeϵEcse1+|Ei\Ec|+|Ej\Ec| where Ec is the set of edges that connect proteins i and j to their common interactors, Ei is the set of edges that involve protein i, se is the score (likelihood ratio) of edge e and Ei\Ec refers to the set difference of Ei and Ec.  For a given set of neighbors, T increases as the interactions with these neighbors become more likely (as the sum of se increases).  Additionally, the topology score T of a pair of proteins increases as the proportion of likely interactors that these two proteins share increases.  The topology scores were grouped into 5 bins of increasing value.  It should be noted that the neighborhood topology score calculated for a given protein pair does not consider the preliminary score assigned to that protein pair.  It only considers the preliminary scores of its neighbors and so is truly based on the local network topology around that protein pair.  Accordingly, the likelihood ratio the transitive module outputs for a given protein pair is independent of the likelihood ratio calculated by the Group A modules for this same protein pair.

Correlation analysis

The Pearson correlation between pairs of modules was estimated by taking 150 samples of 10000 protein pairs each and calculating the Pearson correlation of the likelihood ratios for the two modules considered, for each sample.  The reported correlation values are the average of the 150 experiments.  Samples of the protein pair space were taken instead of considering the whole space as this was more computationally tractable.

Accuracy measurements

The accuracy of the predictors was measured by performing five-fold cross validation experiments in which the datasets were randomly divided into five non-overlapping sets, four of which were used to train the predictor while the fifth was used to test the prediction accuracy.  The accuracy reported is the average measured for all combinations of training and testing sets using these five sets.  Testing was done by predicting the total likelihood scores for all protein pairs in the test set using the models computed in the training phase and then counting the number of pairs that were well predicted.  We used the area under partial ROC curves as a measure of accuracy.  Receiver operator characteristic (ROC) curves plot the true positive rate versus the false positive rate over their full range of possible values.  In some circumstances, it is more informative to use partial ROC curves (ROCn curves) which illustrate the number of true positives identified by the predictor that score higher than the n highest scoring negatives, plotted for all values from 0 to n.  There are many more negatives than positives in our datasets and this is also thought to be true for the full protein interaction networks we are modeling.  Since the aim is to identify the largest number of true interacting pairs while leaving out as many non-interacting pairs as possible, it is most informative to measure the performance of the predictor under conditions of very low false-positive rates.  Accordingly, ROC50 and ROC100 curves were analyzed because given the size of the datasets, these curves consider all the protein pairs predicted to have a posterior odds ratio above 1.0, for all the predictors investigated.  The area under ROC curves is often used as a summary measure of accuracy.  For ROCn curves, it can be calculated as AUC ROCn = 1nT*(Σi=1nTi) where i takes on values from 1 to n, T is the total number of positives in the test set and Ti is the number of positives that score higher than the ith highest scoring negative.

A novel ensemble learning method for de novo computational identification of DNA binding sites

Abstract

Background

Despite the diversity of motif representations and search algorithms, the de novo computational identification of transcription factor binding sites remains constrained by the limited accuracy of existing algorithms and the need for user-specified input parameters that describe the motif being sought.

Results

We present a novel ensemble learning method, SCOPE, that is based on the assumption that transcription factor binding sites belong to one of three broad classes of motifs: non-degenerate, degenerate and gapped motifs.  SCOPE employs a unified scoring metric to combine the results from three motif finding algorithms each aimed at the discovery of one of these classes of motifs.  We found that SCOPE's performance on 78 experimentally characterized regulons from four species was a substantial and statistically significant improvement over that of its component algorithms.  SCOPE outperformed a broad range of existing motif discovery algorithms on the same dataset by a statistically significant margin.

Conclusion

SCOPE demonstrates that combining multiple, focused motif discovery algorithms can provide a significant gain in performance.  By building on components that efficiently search for motifs without user-defined parameters, SCOPE requires as input only a set of upstream sequences and a species designation, making it a practical choice for non-expert users.  A user-friendly web interface, Java source code and executables are available at http://genie.dartmouth.edu/scope.

Backgound

The computational discovery of DNA binding sites for previously uncharacterized transcription factors in groups of co-regulated genes is a well-studied problem with a great deal of practical relevance to the biologist, since such binding sites provide targets for mutational analyses (for reviews see 123).  The position-specific variability of transcription factor binding sites makes their de novo identification challenging.  Many computational motif finding methods are based on the observation that transcription factor binding sites occur more often than expected by chance in the upstream regions of the set of genes regulated by the same transcription factor 1.  The problem thus simplifies to the identification of overrepresented motifs in a given set of upstream sequences.  Motif finding programs rely on a search algorithm to optimize a motif model (an abstract representation of a set of transcription factor binding sites).  Most recent programs represent motifs as position weight matrices (PWMs), which record the frequency of each base at every position in the motif.  Other motif finding programs have relied on the use of consensus motif models (in which every base is represented by a letter of the 15-letter IUPAC code, which accounts for degeneracies as well as single bases) or k-mismatch motif models (in which a non-degenerate word with at most k allowed mismatches is used to represent the word).  Regardless of the motif model used, a search for all overrepresented motifs of any length and degree of degeneracy leads to a dauntingly large search space.  Thus, motif finding algorithms restrict their search space by using simplified motif representations, employing heuristic search strategies that are prone to local optima, or invoking additional parameters to limit the search space and thereby pass some of the optimization process off to the user 3.  Program parameters (such as motif length, number of occurrences and orientation) that cannot be reasonably specified by the user without prior knowledge about the true binding sites are referred to as nuisance parameters 4.  Selection of the correct settings for these parameters is a crucial step in motif finding, and is often assumed to be the domain of experts.  In a recent evaluation, Hu and colleagues 4 compared the performance of five motif finders on a single prokaryotic genome, systematically exploring the effects of nuisance parameters, including expected motif length and number of occurrences.  Every motif finder they tested was found to be sensitive to values used for these parameters.  Guidance on the specific parameter settings to use for given motif finding situations is not provided in most publications presenting motif finders.  Even assuming that optimal parameter settings exist for a motif finding program for each specific situation, for the typical biologist looking to identify motifs in a set of uncharacterized sequences, acquiring such expertise is an onerous task.  Nuisance parameters complicate the interpretation of performance comparisons as well.  A recent large-scale performance comparison between thirteen different motif finding tools used expert knowledge in setting the parameters for every program 5.  Several of the programs contributing to the performance comparison were run with different parameter settings for each regulon, and in some cases, motifs were hand filtered as a post-processing step.  Such performance comparisons evaluate not just algorithms but also the expertise of the users, making it difficult for a first-time user to select a motif finder on a principled basis.  A key result of the Tompa, et al. study was the finding that all of the motif finders had roughly the same average performance under a wide range of conditions and test statistics 5.  This finding was particularly notable because the motif finders studied employed a wide range of motif representations, scoring functions and search strategies and all were operated under the most favorable conditions possible.  Although the average performance of the programs did not differ significantly, the authors found that, for each pair of programs, each program performed better than the other on some subset of the data 5.  Previous studies over smaller numbers of motif finders have found that no program clearly stands out as superior to the others and each program outperforms all others on some subset of the regulons 678.  This diversity of performance has led a number of authors to speculate that ensemble methods, comprising multiple motif finders, may lead to improvements in accuracy 158.  Ensemble methods, well known in the machine learning community 9, are typically composed of multiple methods comprising different search strategies (or the same search strategies with different initiation settings or random restarts) with a unified objective function.  The final predictions are chosen from the ensemble of methods by a learning rule, which may be as simple as finding the maximum score from all the methods, or as complex as optimizing a weighted scoring scheme from among the methods.  The construction of this learning rule is key to the performance of an ensemble learning method, as the performance of an ensemble method with an ineffective learning rule will be the average of the performance of its component algorithms.  In this context, we note that Tompa et al. 5 found that, although every motif finding program tested had some regulons on which its performance was clearly superior, it was not possible a priori to predict which motif finder represented the best choice under any given set of conditions 5.  This observation serves to illustrate the challenges to the construction of an effective learning rule.  To the best of our knowledge, only one study to date has explored ensemble learning in motif finding.  Hu, Li and Kihara 4 described a simple ensemble method wherein the component programs were random restarts of the same stochastic algorithm (such as Gibbs sampling or Expectation Maximization) and the learning rule was a voting scheme in which the results of each random restart cast a "vote" for which positions in the DNA sequence should be part of the final reported motif (hereafter, we refer to this as the HLK method).  Under this scheme, the authors found that ensemble learning resulted in an increase in performance ranging from 6 to 45%.  The HLK voting method provides a framework wherein a number of different motifs finders can be combined under the heuristic that if several motif finders make the same (or overlapping) prediction, then that prediction is accurate.  Here we present a novel ensemble motif finder based on a different conceptual approach.  Rather than randomly restarting the same search algorithm or comparing multiple search strategies that all search for the same global optimum (and are potentially vulnerable to the same local optima), our algorithm assumes that the "biological significance surface" primarily consists of three local optima, and that one of these peaks represents the global optimum.  Thus, our ensemble uses three specialized algorithms whose search spaces restrict them to each of these three local optima (BEAM for non-degenerate motifs, PRISM for degenerate motifs and SPACER for bipartite motifs).  We have previously demonstrated that the greedy search strategies employed by each of these methods allow them to reliably search their respective motif domains without the use of nuisance parameters, as the algorithms themselves efficiently optimize the parameters that are typically forced on the users 101112.  The results of these component algorithms are then combined using a learning rule that is simply the maximum score returned by each component algorithm.  To make comparisons possible, the motif scores returned by each algorithm are penalized according to the complexity of the motif.  The resulting ensemble algorithm, SCOPE, has no nuisance parameters and performs significantly better than its component algorithms.  In addition, we find that SCOPE performs favorably compared to a diverse range of existing methods and is robust to the presence of extraneous sequences in its input.

Results

Algorithm

SCOPE takes as input a set of sequences U that are upstream of a set of genes G that are thought to be coregulated.  The ultimate goal of a motif finder is to identify the specific subsequences U in U that act as binding sites for the transcription factor(s) that regulate G.  In practice, sets of binding sites are represented using a motif.  We have found that simple consensus motifs over the full IUPAC alphabet (a 15-letter code consisting of the bases A,T,C,G and all possible combinations) provide enough representational power to adequately describe U, while still allowing for an efficient search 34.  While alternative representations, such as position weight matrices (PWMs) are more expressive, their heuristic searches are prone to local optima and often do not perform well in practice @34111213@.  SCOPE has three component algorithms, BEAM, PRISM and SPACER, which search for non-degenerate, short degenerate, and long, highly degenerate and "gapped" motifs, respectively (Figure 1).  Each motif is scored considering one or both strands and the motif is marked to indicate which calculation scores higher.  The results of the three algorithms are merged and sorted.  Artifactual motifs, whose significance can be accounted for by higher scoring motifs that they overlap, are identified and removed (for details, see Additional file 1, section S1).

Flow diagram for SCOPE

Flow diagram for SCOPE.  BEAM and SPACER are run independently; PRISM runs on the top 100 motifs output by BEAM.  For yeast (whose upstream regions are standardized to 800 bp), BEAM and PRISM use the overrepresentation-KS objective function (so/ks), while SPACER's slower running time requires the simpler overrepresentation objective function (so).  The top 5 motifs from SPACER are rescored using the combined objective function.  For bacteria and Drosophila, upstream regions are defined to be the intergenic region upstream of each gene; thus, the KS objective function is not used.  The results of each program are sorted by Sig and lower scoring motifs that substantially overlap higher scoring motifs are removed.  The filtered lists of motifs from the three programs are finally merged by Sig score.  Repetitive motifs are identified and removed during all stages.  Each of SCOPE's three component algorithms seeks to maximize the same objective function over a different class of motifs.  Let M be a random variable over the full space of IUPAC words.  The statistical significance p(M = m) of a particular word m is determined by the distribution of M over the entire space of upstream sequences in the given species.  In general, we seek to maximize -log(p(M = m)).  All values of M are not, however, equally likely a priori.  For example, it is quite likely that there exists an extremely long sequence that is entirely unique to U.  (Such a unique sequence would appear to be highly significant), until we consider that we have in effect searched all possible sequences until we found one that is unique.  To correct for this multiple hypothesis testing problem, van Helden et al. 14 proposed using a Bonferroni correction, in which p(M = m) is penalized by the number of motifs N of length |m|:  Sig = -log(p(M = m)·N).  Thus, if m = "ACGT", N = 44.  We employed this same definition of Sig for BEAM, our algorithm that searches for non-degenerate motifs 10.  Defining N for degenerate or bipartite motifs raises a significant conceptual challenge.  Van Helden et al. 14 chose to use the same definition, but limited their search to a small number of degenerate bases.  In contrast, we have proposed that all characters should not be treated equally, but should be penalized in proportion to the information provided by them 1112.  By this logic, "ACGT" will not be penalized differently from "ACNNNNGT", as both have the same number of bases that contribute any information to protein-DNA binding.  Building on this intuition, one can argue that the characters "A" and "not-A" (IUPAC character "B") are roughly equivalent, while "A or G" (IUPAC character "R") is different from "A" as there are six ways to define a combination of two bases, while only four ways to define a combination of one base or three bases.  For motif m = m1m2...mn, we can therefore define N = ∏ Choose(4, |mi|), where |mi| is the number of DNA bases covered by the IUPAC character mi.  In the case were both orientations of the motif are considered, this number is adjusted to account for palindromes.  The resulting Sig score thus penalizes motifs based on their length and degeneracy, enabling fair comparisons to be made between different motif classes.

Testing  Evaluation of objective functions used by SCOPE

Each component algorithm in SCOPE efficiently searches its restricted search space, keeping SCOPE's runtime low (average runtime on our datasets were about one minute).  This efficiency allowed us to explore several objective functions for scoring the statistical significance p(M = m) of motifs.  These objective functions were as follows: position bias (based on the Kolmogorov-Smirnov, or KS, statistic), overrepresentation (a Poisson-based measure based on how often a motif occurs in U) and coverage (a Poisson-based measure based on how many upstream sequences contain the motif).  For precise definitions, see Methods.  To establish which objective function (or combination of functions) was most suitable, we tested each objective function independently of SCOPE, using a subset of the S. cerevisiae dataset.  The measure used to assess the biological relevance of a motif was accuracy, a measure of the nucleotide level overlap between a motif and the known binding sites (for details see Methods).  From each regulon from the SCPD database 15 we selected ten six-mers at random from the upstream sequences and ten six-mers at random from the collection of known binding sites for that regulon.  For each of these sampled six-mers, we calculated accuracy with respect to the known binding sites.  We also calculated the Sig score for each six-mer, using four objective functions (KS, overrepresentation, coverage and combined KS-overrepresentation).  We then plotted Sig versus accuracy for each objective function, to determine which objective functions correlated most strongly with biological relevance (Figure 2).

Correlation between accuracy and Sig scores

Correlation between accuracy and Sig scores.  Non-degenerate 6-mers from S. cerevisiae were scored according to Sig scores based on (a) Overrepresentation, (b) Overrepresentation-KS, (c) Coverage and (d) KS metrics of statistical significance.  The 6-mers were randomly sampled from both the upstream regions and the known binding sites to ensure coverage or a wide range of accuracy.  The x-axis plots the Bonferroni-corrected and log2 transformed Sig score for each metric.  The red lines indicate the 95th Sig percentile.  These plots demonstrate that overrepresentation is a closer approximation to biological relevance than coverage or KS alone.  Adding KS to overrepresentation modestly improved the correlation by 13% (as compared to overrepresentation alone) to R2 = 0.28.  To assess the degree of class separation achieved by the two objective functions, we ranked the sampled six-mers by Sig score, and calculated the percentage of motifs with high Sig scores (in the 95th percentile and above) that possessed a reasonable degree of overlap with the known binding sites (accuracy ≥ 0.10).  By the overrepresentation measure, 74.4% of high scoring motifs had accuracy = 0.10, while 79.1% of high scoring motifs by KS-overrepresentation had accuracy ≥ 0.10.  This analysis suggests that more complex objective functions may provide a better estimate of biological significance than the overrepresentation objective functions commonly used.  We thus chose to run SCOPE using the overrepresentation-KS combined objective function on the S. cerevisiae dataset, in which the upstream regions are of fixed length.  We used the overrepresentation objective function for the other species, as our upstream definitions for those species were of variable length due to the available annotations.  Because identifying the genomic positions of highly degenerate bipartite motifs is prohibitively slow, initial rankings of motifs for SPACER were computed using the overrepresentation objective function, and the overrepresentation-KS objective function was used only to produce the final ordering and scores.  Although the KS objective function is computationally expensive (linear in the frequency of the motif in the genome), the SCOPE algorithms all aggressively limit the search space, thereby making the use of this objective function – and exploration of other complex objective functions – possible.  The surprisingly low correlations between Sig and accuracy may indicate that the objective functions employed by motif finding programs are only a first approximation to biological significance.  Indeed, previous studies have reported little or no correlation between the significance measures of various motif finders and measures of accuracy 416.  Further research into more biologically accurate objective functions may yield better performance for motif discovery algorithms.

Evaluation of SCOPE performance and ensemble learning scheme

We first assessed the performance of the optimized SCOPE framework on synthetic datasets (for details, see Additional file 1, section S2).  SCOPE performed well on the synthetic datasets, correctly identifying 92% of planted motifs that are over-represented relative to background (those motifs with a Sig score of greater than 5; Figure 3).

Performance at different overrepresentation Sig values on synthetic data

Performance at different overrepresentation Sig values on synthetic data.  A motif was "found" if the top scoring motif returned by SCOPE overlapped the planted motif by at least 50%.  Different Sig values were achieved by varying the number of upstream regions, the number of motifs per upstream region, and the number of extraneous upstream regions without planted motifs.  A Sig value of 0 implies that one motif of that significance is expected by chance.  While synthetic test sets are useful in algorithmic development and initial testing, the results of such tests must be taken with a grain of salt, as they are highly dependent on the model used to generate the test sets 6.  We therefore tested SCOPE on an extensive array of regulons with known binding sites (for details of datasets, see Additional file 1, section S3).  We ran SCOPE on each regulon and, following the scoring methodology used by Sinha and Tompa 6, we computed the accuracy for each of the top three motifs reported by SCOPE against the known binding sites.  The motifs reported by SCOPE overlap to a large extent with the published cis-regulatory elements (as discussed in Additional file 1, section S3, a difference of one base pair length between the reported motif and the published cis-regulatory element results in an expected accuracy of about 0.25).  SCOPE was run on 78 regulons from S. cerevisiae, B. subtilis, E. coli and D. melanogaster.  On these datasets, SCOPE's average accuracy was 0.28, 0.29, 0.16, and 0.08 respectively.  SCOPE's reported accuracy was significantly higher than any of its component algorithms (Table 1).  Indeed, we found that SCOPE increased accuracy by 31–44% over BEAM, PRISM or SPACER alone.  This improvement was achieved by combining BEAM's high positive predictive value (PPV) with PRISM's high sensitivity (Figure 4).  Sensitivity is defined here as the fraction of the known binding sites (at the nucleotide level) predicted by the motif finder, and PPV is defined as the fraction of nucleotides predicted by the motif finder that correspond to the known binding sites (see Methods for details).

Summary results for performance comparisons between SCOPE and its component algorithms, on all regulons.

A "Win" is a regulon for which a program had the highest accuracy and that accuracy was at least 0.10.  Programs in a two-way tie are credited with 0.5 wins each, so by construction, SCOPE can at best share a win with one of the other programs.  A perfect winner-take-all ensemble method would have the same number of wins as all the component algorithms combined.  A "clear win (loss)" is a regulon for which SCOPE's accuracy was at least 0.10 higher (lower) than the other program.  The p-value reported for the paired t-test was Bonferroni-corrected to account for multiple (three) comparisons.

Average and standard error of sensitivity and PPV for the component algorithms of SCOPE on all 78 regulons

Average and standard error of sensitivity and PPV for the component algorithms of SCOPE on all 78 regulons.  Bars represent standard error.  An ensemble motif finder with a learning rule that is no better than random will provide an accuracy that is equal to the average of its three component algorithms.  To provide a basis for evaluating the performance of SCOPE's learning rule, we constructed an ensemble learning method (referred to here as BASELINE) from the results of BEAM, PRISM and SPACER, by randomly selecting one of the accuracies from these three programs for each regulon.  Over 120,000 trials, BASELINE's average performance on this dataset was 0.176 with a standard deviation of 0.013.  BASELINE's average score never exceeded that of SCOPE (p < 8.25 × 10-6).  When compared to its component algorithms, SCOPE picked the highest accuracy motif in 66% of the cases (as opposed to 33% for a random selection between three algorithms).  These results suggest that SCOPE's learning rule is highly effective, though it may certainly be improved further.  Of course, SCOPE's learning rule is extremely simple, and more complex learning rules may allow SCOPE to approach its theoretical upper bound.  One rule that may prove effective is to weight the output of each algorithm according to (for example) the frequency of occurrence of each class of motif (non-degenerate, short degenerate or long degenerate) in the species or by learning the appropriate weights on a representative training set, creating, in effect, a Naive Bayesian Network.  The training of a more complex learning rule must, however, be performed in a cross-validation framework, and the size of the available dataset of regulons will place a practical limit on the complexity of the learning rule that can be devised.

Comparison with other motif finding programs

To provide a frame of reference for SCOPE's performance, we ran ten other popular motif finders on these datasets (for details and references see Table 2).  We ran all programs directly from their websites, leaving all parameters at their defaults.  The only parameter that we specified (where available) was the species from which the background sequences were derived.  Thus, the results of this performance comparison may be interpreted as a comparison against other motif finders when those motif finders are run using their default values.

Motif discovery algorithms used in the performance comparison.

Nuisance parameters are parameters that cannot be precisely defined without knowledge of the true binding sites (such as motif length, number of occurrences and orientation).  For MotifSampler and wConsensus, the lower part of the range indicates required parameters, while the upper part indicates the total number of parameters, including "power user" parameters that the program authors stress should typically be left as default.  Motif model abbreviations: cons = consensus; PWM = position weight matrix; mis = consensus with predefined number of allowed non-position-specific mismatches.  SCOPE has no user-adjustable parameters, although its component algorithms do contain a number of internal parameters "("hyperparameters")" that govern their search over common nuisance parameters.  On synthetic datasets, we found SCOPE's component algorithms to be quite robust to the settings of these hyperparameters.  We have therefore fixed those parameters to reasonable values and do not expose them to the user 101112.  This construction means that SCOPE can only run in a default configuration.  We compared the motif finding programs using the criteria set forth in Sinha and Tompa, including average accuracy and the number of total wins (highest accuracy on a regulon, where that accuracy is at least 0.1) 6.  On this dataset, SCOPE had the highest score by both criteria (Figure 5a).  The cumulative distribution of accuracy shows that SCOPE had the most high-scoring motifs at every level (Figure 5b).  When we looked at the number of clear head-to-head wins (such a win is taken to occur when the difference in accuracy between SCOPE and another motif finder is greater than 0.1 6), we found that SCOPE scored a clear majority (82%) of clear head-to-head wins (Figure 5c).  The average accuracies of BEAM, PRISM and SPACER on this dataset were similar to those of the ten other programs.

Performance comparisons

Performance comparisons.  (a) Mean and standard error of accuracy for each of 78 regulons.  (b) Cumulative distribution of accuracy for each program.  (c) Fraction of regulons with a clear outcome (margin of difference in accuracy between two programs was greater than 0.10) won by SCOPE.  Program abbreviations and details in Table 2; performance details in tables S1 and S2 in Additional file 1.  A formal statistical analysis found that SCOPE's performance margin over the other motif finders run on this dataset was statistically significant at p < 10-5 (for details, see Additional file 1, section S3).  Corroborating the results of previously published performance comparisons @14567@, none of the other programs showed a statistically significant difference relative to the other nine.  Similarly, none of SCOPE's component algorithms outperformed the other ten programs on this dataset by a statistically significant margin.  SCOPE's high accuracy was a reflection of both high PPV and high sensitivity (Figure 6a; see Methods for a precise definition).  By these measures, SCOPE was the only program that scored highly in both sensitivity and PPV (ranking first and second respectively).  In contrast, none of the other motif finders that performed well by one criterion performed well by the other, as shown by the average ranks for each motif finder over both sensitivity and PPV (Figure 6b).

(a) Average and standard error of sensitivity and PPV for each program on all 78 regulons

(a) Average and standard error of sensitivity and PPV for each program on all 78 regulons.  In cases where the program failed to return a result, the sensitivity is 0 and the PPV is undefined.  Cases where a program did not support the species were not included.  (b) Ranks on this plot were computed by taking the average of sensitivity and PPV ranks for each program.

Performance in the presence of extraneous upstream sequences

In practice, microarray co-expression data are often used to identify genes in a particular regulon.  This approach identifies genes that are either directly or indirectly regulated by the transcription factor of interest.  Therefore, sets of genes identified from co-expression data may often contain multiple extraneous upstream sequences.  Adding sequences that do not contain binding sites decreases the signal-to-noise ratio, making motif finding more difficult 4.  We thus tested SCOPE's performance on regulons containing additional extraneous upstream sequences.  For all 33 regulons in the SCPD dataset, we added randomly selected upstream S. cerevisiae sequences such that the total number of extraneous sequences was between 0.5 and 4 times the number of true upstream sequences in the regulon.  SCOPE's accuracy on this dataset was remarkably stable in the presence of extraneous sequences.  Figure 7 shows the aggregate results of this test, with the SCPD regulons divided into three groups based on SCOPE's accuracy on the true regulon.  For each set of regulons, SCOPE's performance decayed gradually as increasing numbers of extraneous genes were added to the regulon.  These results were consistent with the relationship between the Sig score and performance on synthetic datasets (Figure 2).

Robustness of SCOPE performance on S. cerevisiae regulons containing extraneous upstream sequences

Robustness of SCOPE performance on S. cerevisiae regulons containing extraneous upstream sequences.  Increasing quantities of randomly selected upstream regions were added to each regulon.  The bold red line is the average across all regulons, while each of the other lines represent the performance of SCOPE on one-third of the total S. cerevisiae regulons.  The y-axis shows the average accuracy for each group of regulons.  The x-axis shows the ratio of extraneous upstream sequences to bona fide ones.

Discussion

The field of motif finding is saturated with a large number of algorithms representing myriad search strategies, objective functions and motif models.  Yet remarkably, performance comparisons consistently reveal disappointing performance for motif finders and fail to find any statistical significance between them.  A brief survey of the per-regulon results of these performance comparisons yields two key observations: (1) there are many regulons for which a large number of programs find a small portion of the binding sites (though not necessarily the same portion); and (2) every program has a respectable number of "wins" (i.e. every program is the best existing program on some handful of regulons 145678.  Such observations are common in many machine learning applications, and are the direct result of complex search spaces that force restrictions on either the search strategy or the representation of the solution space (in this case, the motif model used to represent the motifs).  For example, YMF and RSAT are guaranteed to find the optimal solutions in their motif space (fixed-length motifs with limited degeneracies), but that space is limited to the point that optimality provides no clear advantage over the other methods.  Conversely, the PWM-based methods have an apparently more powerful motif model 17, but their search strategies cannot guarantee optimality and often terminate at local optima.  The HLK ensemble method 4 successfully exploits the first key observation above.  By running the same (stochastic) algorithm multiple times and using a voting method, those subsequences of the binding sites that are repeatedly reported become clear while the spurious bases are eliminated.  Hu and colleagues report that this method increased accuracy and proposed that their approach may prove effective when running different algorithms as well 4.  The limitation arises, however, in regulons where only one program has a high accuracy and the others fail to find any portion of the binding sites.  In such cases, it is likely that a voting-based ensemble will follow the crowd and fail to find the true binding site.  The second observation, that all motif finders win some number of regulons and often perform roughly the same on average, is broadly consistent with a theorem in the Machine Learning field referred to as the No Free Lunch Theorem 1819.  Briefly, this theorem states that, averaged over all datasets, the performance of all search algorithms are exactly the same, with the corollary that two algorithms will have the exact same number of wins in relation to each other.  In practice, this theorem argues for the use of specialized domain knowledge 20, where available, and may suggest that similar average performance across a diversity of approaches is an indication of the diversity of the datasets themselves.  Thus, motif finders designed for one class of motifs will win on regulons containing those motifs, but will perform poorly on other regulons, while more general motif finders will tend to have more consistently mediocre performance.  In this light, SCOPE can be seen as leveraging the second key observation by embracing the No Free Lunch Theorem: rather than boost average performance by taking the average results of three general purpose algorithms, SCOPE uses highly specialized algorithms and assumes each will perform strongly on some regulons and weakly on others (and that the unified scoring metric can tell the difference).  The working hypothesis is, in effect, that the local maxima are predictable (corresponding to one of three motif classes) and exploitable (we can find the local maxima in each class and choose whichever is higher).  Consistent with this hypothesis, there was very little overlap among the component algorithms of SCOPE (each wins about 20 of the 78 regulons, with very few ties) and, by taking the maximum score from those three local maxima, SCOPE tended to choose the motif with the highest accuracy in a clear majority of the cases (66%, compared to 33% for a random learning rule).  Furthermore, SCOPE not only significantly outperformed its components on this dataset, it also outperformed a number of general purpose algorithms that seek to find the global maximum in a single pass.  Of course, based on the No Free Lunch Theorem, SCOPE's performance averaged over all theoretically possible datasets will still converge to that of the other motif finding approaches (including random guessing).  As the physical properties of transcription factors will inevitably constrain the structure of their binding sites, biologically relevant datasets comprise a subset of the space of all theoretically possible sequences.  Our test set of 78 regulons was selected in a blinded manner (for details, see Additional file 1, section S3), thus these results suggest the generalizability of SCOPE's use of domain knowledge on biologically relevant datasets from these species.  These observations are not offered as definitive proof that there are only three classes of motifs; rather, they show that power can be gained by identifying distinct motif classes and combining specialized algorithms with a unified scoring rule.  It is possible that more power could be gained by identifying other distinct motif classes and adding algorithms that explicitly search for those classes.  For example, Zinc finger transcription factors have been demonstrated to bind three triplets of nucleotides which overlap at their third base positions 21.  This observation could be leveraged by a search algorithm that explicitly searches for motifs matching this unique structure.  Thus, all nondegenerate triplets in a set of upstream regions could be scored and the highest-scoring triplets combined into a single five-mer with a two-base degeneracy (corresponding to the IUPAC characters R,Y, W, S, K or M) at the middle position.  The highest-scoring five-mers could then be combined with the highest scoring triplets to generate a seven-mer with two-base degeneracies at positions three and five.  Provided the appropriate Bonferroni correction is applied for this new class of motifs, these motifs may be easily compared with the results from BEAM, PRISM and SPACER, thereby extending the SCOPE ensemble to include a fourth class of motifs.  We note, however, that as more methods are added to SCOPE, it will be increasingly difficult to devise a scoring metric that can accurately choose the best result from among the components.  SCOPE may also serve as a complementary approach to the HLK method.  For example, the parameters of many methods can be set to search for specific classes of motifs (such as bipartite versus non-bipartite motifs).  Thus, analogous to the ensemble method described in this paper, one may build a hierarchical ensemble that first searches each motif class by the HLK method using a number of algorithms or random restarts, and then uses the SCOPE method to choose the best result from among the motif classes.  One constraint associated with such an approach is the run-time.  A second constraint associated with a hierarchical ensemble learning method is the multiplicative increase in the number of parameters associated with it, though this problem may be ameliorated by the use of parameter-free algorithms that employ restricted search spaces.  An important factor to consider when taking the best of multiple runs is the relative size of the search space.  Certainly to maintain statistical validity, some correction must be made for multiple hypothesis testing.  Furthermore, the effects of multiple testing may bias the results in favor of one of the component algorithms.  To ensure statistical validity and avoid such a bias, we developed a simple Bonferroni-like correction, which penalized every proposed motif proportional to its length and degree of degeneracy, resulting in a modest improvement of 8% in SCOPE's accuracy.  Although our test set of 78 regulons gave us enough power to find significance between SCOPE and its components or other algorithms, it did not provide enough power to disentangle the effects of small improvements (such as the Bonferroni correction, the objective function that takes position bias into account, or scoring motifs based off one or both strands), especially in the rigorous cross-validation framework necessary to decipher precisely which aspects contribute significantly to the performance.  Nevertheless, as larger datasets become available, SCOPE's efficient search strategy makes it an ideal platform for exploring the effect of focused improvements to the motif finding approach described, such as the use of complex objective functions that may better approximate biological significance.  The comparisons to other motif finding programs in this study are provided to place SCOPE's performance in the broader context of the motif finding field, particularly when viewed from the standpoint of the practicing "bench" biologist.  Any performance comparison must be interpreted with caution, since the results are highly dependent on the dataset used, the conditions of the testing and the metrics used for evaluation.  With this in mind, we sought to evaluate a wide representation of motif finders on a large number of regulons using performance metrics consistent with previous studies 67.  To the best of our knowledge, this dataset represents the largest set of biologically relevant regulons used for performance comparisons to date.  Whereas previous performance comparisons attempt to optimize the parameters of the programs in question 467 or allow expert users to tune their own programs and manually filter both the input and output 5 we intentionally made our comparisons between programs without manually optimizing any parameters for any species so as to emulate typical use conditions.  Our comparison thus complements the recent large scale study of Tompa et al., who gauge performance under optimal conditions on semi-synthetic data sets 5, as well as the study of Hu et al., who explore the effect of parameter optimization on a handful of popular motif finders 4.  Although the present philosophy of performance comparison implicitly benefits SCOPE, which has no parameters to optimize, it is arguably the most relevant comparison possible for the typical biologist.  Although previous studies have shown the potential importance of choosing parameters carefully 46, we note that the results we obtained under default settings were quite similar to those reported in previous studies (for details, see Additional file 1, section S3).  Arguably, systematic parameter optimization for each of these programs may well yield higher accuracy scores than those reported here.  However, in order to avoid the pitfall of overfitting to the dataset, such parameter optimization must be performed using cross-validation or some other resampling technique 92223.  We note that all the motif finders tested, including SCOPE, performed poorly on the Drosophila dataset.  Although SCOPE had the highest accuracy on this dataset, that accuracy was significantly less than on the bacterial and yeast data.  Especially poor performance on Drosophila was also reported in the Tompa et al. performance comparison, indicating that this difficulty is not limited to the current dataset 5.  One possible cause of poor performance in this study is that the "regulons" are derived from enhancer regions defined in an earlier computational paper 24.  Whereas a background set of promoter regions is easy to identify, it is not clear how to define a reasonable genomic sample of enhancers.  Thus, the background sequences used by SCOPE and the other programs may not be representative of the "true" background model of enhancers, leading to inaccurate statistics.  The persistently poor performance of motif finders on Drosophila regulons thus highlights the importance of using well-defined background sequences to calibrate the statistics of the objective functions being optimized.  Recently, algorithms have been reported that predict enhancer regions on a genome wide scale [2425262728].  It is possible that using such algorithms to define a collection of background enhancer sequences may improve the performance of SCOPE, as well as that of the other motif finders, on Drosophila.

Conclusion

Ensemble methods hold the potential for providing improvements in motif finding accuracy without resorting to the use of additional data (such as phylogenetic information or characterization of the domain structure of the transcription factor), which are not always available.  Typically, ensemble learning methods are plagued with certain liabilities, such as increased runtimes, logistical complexity and a multiplicity of nuisance parameters, all of which grow with the number of programs run.  In the machine learning field, ensemble methods have coexisted for many years with non-ensemble methods, with no clear superiority having been established between the two.  SCOPE serves as a proof-of-concept, demonstrating an efficient and effective approach to ensemble-based motif finding.  By dividing the search space into tractable domains, SCOPE mitigates the potential liabilities associated with ensemble methods, resulting in a program that is capable of finding cis-regulatory elements of arbitrary length, degree of degeneracy, motif orientation and frequency of occurrence.  Its strong performance, rapid runtime and freedom from nuisance parameters make it a simple and effective tool for the biologist.

Methods

Accuracy, Sensitivity and Positive Predictive Value

Each algorithm's accuracy for each regulon was measured via the Phi score (also referred to as nucleotide level performance coefficient, or nPC, in previous performance comparisons 45611.  This metric, first proposed by Pevzner and Sze 29, measures the degree of overlap between the actual instances of two motifs (or sets of motifs) m1 and m2 in the set of co-regulated upstream sequences.  The Phi score can be defined as follows: let U be a unique numbering of all the bases in the upstream sequences of a given gene set, and IU(m) ⊆ U be the set of bases that are covered by actual instances of m in U.  Phi is then defined as the ratio of the number of bases occupied by the actual instances of both the motifs, to the total number of bases occupied by the actual instances of either of the two motifs:  ΦU(m1, m2) = [IU(m1) ∩ IU(m2)]/[IU(m1) ∪ IU(m2)].  This metric therefore takes both false positives and false negatives into account at the level of the individual bases that are actually covered by the motif.  As in Hu et al. 4, we define accuracy to be the Phi score between the known and predicted binding sites.  Changing the denominator of the Phi equation to be IU(mi) yields the sensitivity (if mi represents the true binding sites) or the positive predictive value (PPV, if mi represents the reported binding sites).  See Additional file 1, section S3, for a discussion on the use of Phi score for measuring accuracy.

Objective functions for Statistical Significance

In line with other motif finders, we have used statistical significance as a surrogate for biological significance.  Since the latter cannot be defined without data that obviates the need for computational motif finding, objective functions that approximate biological significance are critical.  In this section, we detail the objective functions we used and their effect on SCOPE's performance.  For any motif m, each objective function provides a definition for p(m), the probability of observing a motif with the same sufficient statistics as m assuming a particular null model.  This p-value is used in the computation of the Sig score (see Results).

Overrepresentation

The most common statistical test in motif finding is based on overrepresentation, which can be roughly defined as the probability that a motif m that is observed C(m) times in the regulon would occur at least C(m) times in a random collection of the same number of genes.  In the context of consensus motifs, overrepresentation is expressed in terms of a multinomial model, in which each position i in each gene j is a random variable Xij that can take on any motif allowed by the particular motif model.  The probability of seeing m at least C(m) times in the regulon can be approximated by the Poisson distribution: p(m) = ∑k≥C(m) [(λke-λ)/k! ] where λ is the expectation of C(m) with respect to the null motif distribution and the number of positions in the regulon.  A detailed justification of this approach was given by Carlson et al. 11.  The expectation λ is most accurately modeled using Maximum Likelihood Estimators (MLEs) computed as the actual proportion of any given motif in the complete set of all upstream sequences in the genome 10.  These MLEs are implemented as lookups of exact substrings, which can be performed efficiently using a suffix array data structure 101112.

Coverage

A simple modification to the overrepresentation objective function is coverage, which is identical to overrepresentation with the modification that C(m) is the number of upstream regions in the regulon that have one or more instances of m and λ, the expectation of C(m), is determined from the proportion of upstream regions in the genome that contain the motif.  While this objective function prevents a single upstream region from dominating a motif's score, it fails to account for multiple instances of a binding site in a single gene that may arise due to cooperative binding.

Positional bias

Transcription factors often require their binding sites to be located in a restricted range relative to the start of transcription.  One well known example is TBP (TATA-binding protein), which localizes the RNA polymerase complex by binding the TATA-box motif roughly 25 bases upstream of the transcription start site 30.  While other examples of binding sites with positional restrictions are well known, few motif finders incorporate position in their scoring function.  In the case where all upstream regions are the same length, the Kolmogorov-Smirnov (KS) statistic provides a natural test for positional bias.  The Kolmogorov-Smirnov (KS) statistic is a non-parametric statistic that measures the probability that two samples are drawn from the same distribution.  Let X be the sample that we wish to compare to some reference sample Y.  The KS statistic is defined to be the maximum absolute difference between the unbiased cumulative distribution functions of X and Y.  The KS statistic has a well-defined distribution from which a p-value can be easily computed.  Kuiper's variation was used to increase sensitivity in the tails of the distribution 31.  In the context of motifs, we defined the test sample X for a motif m to be the set of starting positions (with respect to transcription start sites) of m in the regulon.  The reference sample Y is defined as the set of starting positions of m in all upstream regions in the genome.  Thus, pKS(m) is a measure of how m is localized differently in the regulon than in the genome as a whole.  It is also possible to define Y as the uniform distribution; however, we found that many motifs had non-uniform distributions throughout all upstream regions of the genome, possibly as an artifact of the non-uniform AT/CG distributions in upstream regions 32.

Combining overrepresentation and positional bias

Since overrepresentation and KS are independent, the probabilities can simply be multiplied together to yield the probability of randomly sampling a motif with a given degree of overrepresentation and positional bias.

Motif orientation

Many transcription factors will bind motifs on either DNA strand.  Others, such as the general transcription factor TBP (TATA-Binding Protein), require a specific orientation and will only function if bound to motifs on a specific DNA strand 30.  In scoring a motif m, a choice must therefore be made as to whether or not the reverse complement mR of m will be considered to be the same motif as m.  Most programs assume motif orientation does not matter and so define m = mR.  Such an assumption may be overly generous – as the TBP example above makes clear, the transcriptional machinery of a cell is clearly able to differentiate between the two strands.  We thus chose to attach a flag to each motif, indicating whether or not the motif should be orientation-neutral.  BEAM and SPACER thus enumerate and evaluate all motifs with both values of this flag.  SCOPE reports that orientation does matter (i.e. m ≠ mR) for 17% of the regulons in our biological test set.

Reuse of structural domain–domain interactions in protein networks

Abstract

Background

Protein interactions are thought to be largely mediated by interactions between structural domains.  Databases such as iPfam relate interactions in protein structures to known domain families.  Here, we investigate how the domain interactions from the iPfam database are distributed in protein interactions taken from the HPRD, MPact, BioGRID, DIP and IntAct databases.

Results

We find that known structural domain interactions can only explain a subset of 4–19% of the available protein interactions, nevertheless this fraction is still significantly bigger than expected by chance.  There is a correlation between the frequency of a domain interaction and the connectivity of the proteins it occurs in.  Furthermore, a large proportion of protein interactions can be attributed to a small number of domain interactions.  We conclude that many, but not all, domain interactions constitute reusable modules of molecular recognition.  A substantial proportion of domain interactions are conserved between E. coli, S. cerevisiae and H. sapiens.  These domains are related to essential cellular functions, suggesting that many domain interactions were already present in the last universal common ancestor.

Conclusion

Our results support the concept of domain interactions as reusable, conserved building blocks of protein interactions, but also highlight the limitations currently imposed by the small number of available protein structures.

Background

One way to understand a protein's function is to look at its composition of conserved domains.  Such families of related sequence regions, collected in the Pfam database 1, usually constitute structurally and functionally conserved modules.  It is assumed that binding interfaces, too, are conserved evolutionary modules that are reused between proteins of different functions and retained during evolution 23.  Therefore, domain–domain interactions are often regarded as the currency of protein–protein interactions.  Based on this assumption, Ng et al. described an approach to predict domain–domain interactions using literature curation, evolutionary history and the distribution of domains in protein interactions 4.  Wuchty et al. compared the relationship between this set of predicted interacting domain pairs to the domain coocurrence network 5.  More recently, other groups have come up with sophisticated statistical methods to estimate putatively interacting domain pairs, based on the assumption of domain reusability 678910.  However, none of these approaches offers structural evidence that the predicted domain pairs are able to form an interaction.  For complexes with known structure, it has been shown that domains can mediate interactions 1112.  Such interactions between pairs of domains are stored in the iPfam database 13.  The structural evidence lends strong support to the inferred domain pair, resulting in a high confidence set of domain pairs.  Unfortunately, the selection of complexes in the PDB database of protein structures 14 is rather small and biased 15.  There is often only a single structure that shows a certain protein pair to interact, while other complexes like haemoglobin have been crystalized dozens of times.  This makes it difficult to assess whether some domain pairs act as reusable modules in protein interactions from PDB data alone.  High-throughput experiments 161718 and extensive literature curation efforts 19 have yielded large databases of protein interactions 2021222324.  Despite the continuing growth of protein interaction databases, even the best studied protein interaction network of S. cerevisiae is thought to be incomplete and inaccurate 252627.  Given that this network already comprises around 60000 interactions, questions arise as to how such networks have evolved and how they are organised.  Furthermore, methods for assessing the quality of high-throughput experimental results are in high demand due to the error prone nature of the methods used.  In this study, we investigate how pairs of protein families taken from iPfam are distributed in experimental protein interactions from five major model species.  This allows us to address a number of questions: what proportion of each organism's protein interaction network, its interactome, can be attributed to a known domain–domain interaction?  How conserved are domain–domain pairs between species, and how many interacting domain pairs are still unknown?

Results

iPfam domain pairs are overrepresented in experimental protein interactions

We analysed the distribution of Pfam families known to interact from a PDB structure (iPfam domain pairs) in experimentally derived protein interactions (experimental interactions).  The experimental interactions were filtered to only include interactions with exactly two partners (see Methods).  The fraction of experimental interactions that contain at least one iPfam domain pair is referred to as the iPfam coverage.  Accordingly, the fraction of experimental interactions that contains any pair of Pfam domains (excluding the iPfam domain pairs) is called the Pfam coverage.  Figure 1 shows the Pfam and iPfam coverage for the analysed species as a column chart.  The number of resolved protein interactions varies greatly between species, as does the size of the underlying proteome (see Table 1).  The Pfam coverage, coloured red in Figure 1, lies between 49.46% and 66.73%.  Given that 74% of all UniProt proteins contain at least one Pfam match, this is not by itself surprising.  The iPfam coverage, shown in blue, is much smaller, ranging from 2.92% in D. melanogaster to 19.02% in H. sapiens.  In S. cerevisiae the species with the most comprehensively studied interactome, the iPfam coverage is 4.47%.

Comparison of coverage of iPfam domain pairs on protein interactions

Comparison of coverage of iPfam domain pairs on protein interactions.  For each species, the height of the column reflects the number of known protein–protein interactions in the data set.  The columns are split according to the proportion of interactions that contain an iPfam domain pair (blue), that contain any other Pfam domains on both proteins (red), and those that contain no Pfam domain pair (yellow).

iPfam domain pair coverage on protein interactions

For each species, we list the size of the proteome as defined in Integr8 and the fraction of this proteome that is represented in the protein interaction sets, followed by the total number of binary protein interactions and the fraction of those that contain an iPfam domain pair.  The last two columns show the results of the network shuffling experiments.  The relatively low iPfam coverage is by itself a disappointing finding.  However, the fact that only a small fraction of protein interactions contain known domain pairs could be a result of the scarcity of available structures of protein complexes.  Therefore, we asked whether the observed iPfam coverage is larger than would be expected by chance.  To test this, we created 1000 random networks per species using the algorithm described in Methods.  We then calculated the iPfam coverage on the protein interactions in each randomised network.  Mean and standard deviations of the randomisation experiments are shown in Table 1.  No P value (see Methods) was greater than 1.84 · 10-06.  This proves that the observed iPfam coverage is significantly higher than expected and iPfam domain pairs are enriched in real experimental protein interactions.

Few iPfam domain pairs are responsible for a majority of the coverage

To understand why iPfam domain pairs occur more often in experimental interactions than expected by chance, we analysed the two largest data sets, S. cerevisiae and H. sapiens in more detail.  In the following paragraph, we will call the experimental interactions that contain an iPfam domain pair the covered experimental interactions.  In Figure 2, we compare the distribution of iPfam domain pairs on the number of experimental interactions for E. coli, S. cerevisiae and H. sapiens.  This plot reflects how many iPfam domain pairs cover how many experimental interactions.  Domain pairs that cluster to the left of the plot can be called specific domain pairs, as they only occur in very few covered experimental interactions.  Conversely, domain pairs that cluster to the right of the plot occur in a large number of different covered experimental interactions and can be called promiscuous domain pairs.

Frequencies of iPfam domain pairs in E. coli, S. cerevisiae and H. sapiens protein interactions

Frequencies of iPfam domain pairs in E. coli, S. cerevisiae and H. sapiens protein interactions.  Each point in this graph represents a set of protein interactions.  The abscissa reflects the number of interactions in each set that contain the same iPfam domain pair.  The ordinate shows the number of distinct such sets, each defined by a different iPfam domain pair.  In both H. sapiens (blue) and S. cerevisiae (green) a small number of iPfam domain pairs covers a large fraction of the interactome, whereas in E. coli, no iPfam domain occurs in more than 4 experimental interactions at a time.  Dotted lines denote fitted monomial functions, showing that the distributions follow a power law.  All three distributions in Figure 2 resemble a power law distribution, according to the good fit of log-linear functions (log(f(x)) = k log x + log a) shown as dotted lines.  The slopes k of the H. sapiens and S. cerevisiae distributions are very similar (-1.53 and -1.60, respectively), while E. coli has a markedly smaller slope (-2.78).  This suggests that the ratio of specific to promiscuous iPfam domain pairs is very similar in S. cerevisiae and H. sapiens, whereas E. coli features fewer multiply reoccurring iPfam domain pairs.  The power law distribution of iPfam frequencies implies that the majority of covered protein interactions can be attributed to a minority of iPfam domain pairs.  51.7% of the iPfam domain pairs in S. cerevisiae and 45.3% in H. sapiens are seen in just one experimental interaction.  Conversely, 92.4% of H. sapiens and 85.4% of S. cerevisiae covered experimental interactions contain an iPfam domain pair that occurs more than once.  Even more, half of the covered experimental interactions in H. sapiens contain an iPfam domain pair that occurs in more than 16 different experimental interactions (5 for S. cerevisiae).

Degree distribution and iPfam domain pair frequency are correlated

We reasoned that if there are iPfam domain pairs that act as reusable modules in protein interactions, then (highly connected proteins should also be more likely to contain promiscuous iPfam domain pairs and vice-versa).  For each node (i.e. protein) in the filtered H. sapiens and S. cerevisiae protein interaction network, we calculated its degree, defined as the number of adjacent edges (i.e. interactions).  At the same time, we counted the number of iPfam domain pairs on the adjacent edges.  In Figure 3, we plot the mean number of iPfam domain pairs relative to the degree of the node.

Average frequency of iPfam domain pairs relative to degree of node

Average frequency of iPfam domain pairs relative to degree of node.  Each point represents a protein in the interaction networks of H. sapiens (blue) and S. cerevisiae (green).  For each protein, we calculate the degree, defined as the number of interactions the protein is involved in.  On the y-axis, we show the average number of iPfam domain pairs in edges adjacent to proteins of degree x.  We calculated a Spearman correlation of 0.68 and 0.71, for H. sapiens and S. cerevisiae.  The correlation is outlined by dotted lines.  We find that for proteins from a degree of 1 to 50, there is strong correlation in both H. sapiens and S. cerevisiae (Spearman correlation coefficients of 0.68 and 0.71, respectively) between degree and number of iPfam domain pairs on adjacent edges.  For the 1.2% of proteins in H. sapiens and 6.4% in S. cerevisiae which have a degree higher than 50, the correlation gradually diminishes.

Promiscuous domain pairs

Additional file 1 contains a list of all iPfam domain pairs and their frequencies in the experimental protein interactions, while Additional file 4 lists the frequencies of the single domains.  Interactions between protein kinase domains (Pkinase, Pfam acc. PF00069 and Pkinase_Tyr, Pfam acc. PF07714) are the most frequent iPfam domain pairs, as well as interactions involving recognition domains such as SH2 or SH3.  In S. cerevisiae, the Proteasome family (Pfam acc. PF00227, a family of peptidases) and WD40 (Pfam acc. PF00400, a repeat involved in multimer assembly) are also amongst the five most frequent iPfam domain pairs.  As expected, more frequent domains are also more likely to be found as pairs in interacting proteins.  It should be noted however that in the PDB structures, some of the observed domain pairs (Pkinase_Tyr - SH3_1, Pkinase_C - Pkinase and others) are only seen to interact within one protein (intrachain interactions) as opposed to interactions between two distinct proteins (interchain interaction).  The table in Additional file 5 lists the number of PDB structures for each iPfam domain pair, distinguishing between intrachain and interchain interactions.  Looking for example at the covered experimental interactions in H. sapiens(Additional file 1), only 8 out of the 100 most frequent iPfam domain pairs are seen in intrachain interactions exclusively, while 61 are exclusive to interchain interactions and 31 are seen in both.  A possible explanation for the occurrence of purely intrachain iPfam domain pairs in the covered experimental interactions is that they frequently cooccur together on the same protein with other iPfam domain pairs.  A list of all combinations of iPfam domains (the domain architecture) on interacting proteins is given in Additional file 2.  It reveals that certain iPfam domains such as SH2, SH3_1 or Pkinase_tyr frequently occur in the same architecture.  Without further experiments, we cannot assign the correct interacting domains with certainty.  This highlights a basic assumption of this study that could be a source of error.  We assume that interacting proteins that contain an iPfam domain pair interact through these domains.  This, of course, is not necessarily the case.  Although it has been shown that sequence similarity is linked to the mode of interaction 28, not every protein interaction that contains an iPfam domain pair is necessarily mediated by exactly this domain pair.  To gain a rough estimate of the false positive rate due to this assumption, we counted how many protein pairs in the PDB contain an iPfam domain pair that does not mediate an interaction in one complex structure but does so in another.  3671 out of a total of 5380 interacting protein pairs from the PDB contain an iPfam domain pair that does not interact in one complex structure but does so in another.  This means that for more than 32% of the protein interactions in the PDB, the iPfam domain pair assignment is correct.  For the remaining 68%, the iPfam domain pair assignments are wrong in one case but correct in another.  The real false positive rate is likely to be smaller, because some iPfam domain pairs might still independently mediate an interaction with a different, possibly unknown, partner protein.

iPfam domain pairs are enriched in S. cerevisiae complexes

We tested whether iPfam domain pairs are enriched in known protein complexes from S. cerevisiae.  This is interesting firstly because domain–domain interactions are thought to be more common in obligate interactions.  Secondly, the described modularity of known S. cerevisiae complexes lends support to the assumption that the underlying iPfam domain pairs are modular.  In fact, we find a two-fold enrichment for iPfam domain pairs in the complexes described by Gavin et al. 29.  From the 294 binary protein interactions in this data set, 24 contained an iPfam domain pair, which corresponds to a coverage of 8.16% (P value 2.7 · 10-47).  We also analysed the full dataset of protein complexes.  From 491 complexes described by Gavin et al., 157 contained at least one pair of proteins with an iPfam domain pair (31.9%).  In total we found 617 pairs of proteins that contained an iPfam domain pair.  Interestingly, we find that the distribution of iPfam domain pairs on complexes is uneven.  When we drew 617 protein pairs randomly from all possible protein pairs in the complexes, we covered 192 complexes on average, with a standard deviation of 7.22.  The probability of covering only 157 complexes is just 6.24 · 10-07.  Thus, some complexes contain a greater number of iPfam domain pairs, while other complexes do not contain any at all.  This suggests that some sets of domain pairs are specific to certain complexes or pathways.  Typical examples are the RNA polymerase II complex (IntAct id: EBI-815049) or the U1 snRNP complex which contain numerous iPfam domain pairs that are specific to these complexes.

iPfam domain pairs are conserved between species

Within the 3 to 19% of experimental interactions covered by iPfam, we analysed the conservation of iPfam domain pairs between species.  We call an iPfam domain pair conserved when the same pair is observed in experimental interactions of two different species.  The matrix in Table 2 shows the pair-wise conservation of iPfam domain pairs.  For each species, a maximum of 40% to 90% of iPfam domain pairs can also be found in another species, although not all overlaps are as large.

Matrix of mutual shared iPfam domain pairs

The Table shows the number of co-occurences of iPfam domain pairs between two species.  The right-most column lists the total number of unique iPfam pairs found in each species' experimental interactions.  Figure 4 shows a Venn diagram of the mutual overlaps between the two eukaryotes S. cerevisiae and H. sapiens and the prokaryote E. coli.  While the eukaryotes share 524 domain pairs, only 158 iPfam domain pairs are shared between S. cerevisiae and E. coli, and only 135 between E. coli and H. sapiens.  Remarkably, 53% of the observed iPfam domain pairs in E. coli are also observed in one of the two eukaryotes, and 107 iPfam domain pairs are even conserved amongst all three species.  The iPfam domains in these pairs are related to housekeeping activities such as translation, replication or ATP synthesis.  Additional file 3 contains a list of the conserved iPfam domain pairs.

Venn diagramm showing the fractions of iPfam domain pairs found in the E. coli, S. cerevisiae and H. sapiens binary protein interaction sets

Venn diagramm showing the fractions of iPfam domain pairs found in the E. coli, S. cerevisiae and H. sapiens binary protein interaction sets.  The three circles represent the iPfam domain pairs observed in the respective species.  The overlaps denote co-observed iPfam domain pairs.  The grey set in the background represents iPfam domain pairs not found in the three species.  We also compared the iPfam domain pair frequencies between H. sapiens and S. cerevisiae directly.  We derive a Spearman correlation coefficient of 0.50 for the frequencies of all 524 iPfam domain pairs that are conserved between S. cerevisiae and H. sapiens.  To test whether the correlation is an artefact of the distribution of the values, we recalculated the correlation 1000 times, each time shuffling one distribution randomly.  From these random results, we derive a P value of 3.6 · 10-30 that the observed correlation is random.  This suggests that iPfam domain pairs with a large number of occurrences in one species tend also to be more frequent in the other.

Predicting the total number of iPfam domain pairs in nature

Our analysis allow us to estimate how many iPfam domain pairs would eventually cover all protein interactions.  This corresponds to the predictions made by Aloy and Russel 2.  Similar to their approach, we make a linear estimation with the following factors:  χS The number of iPfam domain pairs observed in species S  θS The number of observed interactions in species S that contain an iPfam domain pair  ΘS The total number of observed interactions in species S  ψS The number of proteins from species S that are seen in an interaction screen  ΨS The proteome size for species S  ξS The number of Pfam domains observed in all protein of species S  Ξ The total number of known Pfam domains  We denote the estimated number of iPfam domain pairs in species S with x^S.  The formula we apply is x^S=ΨS·ΘSθS·ΨSψS  This means we scale the observed number of iPfam domain pairs to cover all observed interactions.  We then use the relative proteome coverage to estimate the total number of iPfam domain pairs in all proteins.  Finally, we follow the argument of Aloy and Russel that the number of Pfam families seen in species S indicates the fraction of the protein universe represented in the species.  We therefore predict the total number of iPfam domain pairs x^ as x^=x^S·ΞξS.  Both parameters and results of the calculation are shown in Table 3.  The estimates for the total number of iPfam domain pairs ranges from 33813 to 120511, with an average of 76918.

Prediction of total number of iPfam domain pairs

ΘS The total number of observed interactions in species S  θS The number of observed interactions in species S that contain an iPfam domain pair  ΨS The proteome size for species S  ψS The number of proteins from species S that are seen in an interaction screen  ξS The number of iPfam domain pairs observed in species S  The predicted total number of iPfam domain pairs in species S  Ξ The total number of known Pfam domains  ξS The number of Pfam domains observed in all protein of species S  The estimated total number of iPfam domains in all species  Prediction results are shown in bold font.

Discussion

iPfam coverage is low

The coverage of iPfam on experimentally derived protein interactions is low.  For S. cerevisiae, the species with the best mapped interactome, only 4.47% of the protein interactions contain an iPfam domain pair.  Even in H. sapiens, where we suspect a positive bias due to the overrepresentation of disease-related proteins in both the PDB and protein interaction databases, 81% of protein interactions do not contain an iPfam domain pair.  This reveals the limits of our understanding of the molecular structure of protein interactions.  Figure 1 also shows that a majority of protein interactions contains at least one pair of Pfam domains.   While there is no structural information about putative interactions between these pairs, this fraction can already be analysed using statistical methods to identify putative domain interactions 7910.  This in turn creates new targets for future structural genomics projects 30.  Prioritising these targets according to the number of covered experimental interactions could increase the coverage of databases like iPfam quickly.  We find, however, that iPfam domain pairs occur significantly more often in experimental interactions than would be expected by chance.  This requires that at least a subset of the iPfam domain pairs are reused in several experimental interactions.

iPfam domain pairs can act as modules

Despite the low overall coverage, iPfam domain pairs are found in more protein interactions than would be expected by chance (see Table 1).  This statistical overrepresentation suggests that certain iPfam domain pairs constitute modules of molecular recognition which are reused in different protein interactions 2.  In fact, we find a characteristic power law distribution when we plot the histogram of experimental interactions per iPfam domain pair, see Figure 2.  This underlines that a few promiscuous iPfam domain pairs are responsible for the majority of the iPfam coverage.  These iPfam domain pairs are most likely to be reusable modules.  In fact, we find the most frequent iPfam domain pairs to be recognition domains in signal transduction.  Conversely, a large number of iPfam domain pairs are specific to a small number of protein interactions.  This implies that recognition specificity amongst proteins is often achieved by maintaining an exclusive interacting domain pair.  This could pose a problem for purely statistical approaches to infer domain interactions: if for many interfaces the real interacting domain pair will only occur once in an interactome, it will be hard to elucidate this on a statistical basis.  The concept of modularity of interacting domain pairs is furthermore supported by the positive correlation between the number of protein interactions an iPfam domain pair is seen in and the connectivity of the interacting proteins.  We hypothesise that if during the course of evolution a protein is duplicated, it is likely to retain connections with other proteins which contain the same domain interaction modules.  It is clear, however, that even though recognition domains are reused in various proteins, their specificity is bound to be controlled.

Many domain–domain interfaces remain to be resolved

We tried to estimate how many iPfam domain pairs exists in all interactomes.  Our predictions lie almost an order of magnitude higher than the 10000 domain interaction types proposed by Aloy and Russel 2.  While all such estimates should be taken with caution, our results show that at best 10% of all structural domain pairs are represented in iPfam.  The statistical approaches described in the introduction can only cover a small fraction of this domaininteraction space.  Riley et al. for example report only 3005 interacting domain pairs which could be inferred from protein interactions 7.  Even under the assumption that many interactions involve short linear motifs, it seems likely that a large number of domain interactions remain to be resolved.

iPfam domain pairs are conserved during evolution

iPfam domain pairs are not only recurrent within the protein interaction network of one species.  They also appear to be conserved between species.  In a small set of protein structures from S. cerevisiae, it has been shown that interacting domain pairs are more conserved than non-interacting domain pairs 10.  Here, we call an iPfam domain pair conserved if there are protein interactions in two species which contain the same iPfam domain pair.  In a recent study 31, Gandhi et al. have assessed the conservation of protein interactions as the co-occurrence of orthologous interacting proteins.  They found only 16 orthologous interacting protein pairs that were conserved in S. cerevisiae, C. elegans, D. melanogaster and H. sapiens.  Conversely, we find that 71 iPfam domain pairs are conserved in the experimental interactions of these species.  Even between a prokaryote like E. coli and the two eukaryotes S. cerevisiae and H. sapiens there is a considerable proportion of conserved iPfam domain pairs, to the extent that 53% of the iPfam domain pairs from E. coli are also observed in a eukaryote (Table 2).  107 domain pairs are shared between E. coli, S. cerevisiae and H. sapiens.  These domains are predominantly related to transcription, translation and other basic essential cellular activities, which is in congruence with the findings of Gandhi et al.  Although the low overall iPfam coverage hampers the interpretation of our results, it looks as if there has been a diversification of domain interactions from E. coli to H. sapiens.  While more than half of the iPfam domain pairs in E. coli have been retained throughout evolution, numerous new ones seem to have emerged in eukaryotic development.  The significant positive correlation in the frequency of iPfam domain pairs conserved between S. cerevisiae and H. sapiens also suggests that the binding interfaces are more often kept or even reused rather than lost in the course of evolution.  Conversely, this also raises the question of whether one could establish a comprehensive set of domain interactions that were present in the last universal common ancestor.

Conclusion

In this study, we addressed the utility of current knowledge about structural domain interactions in order to interpret experimental protein interactions.  Disappointingly, only a small fraction of all experimental interactions can be attributed to a known domain interaction.  Within this subset of interactions, we nevertheless made several reassuring observations: structural domain pairs are enriched in experimental protein interactions.  Some of the domain pairs seem to mediate a large number of protein interactions, thus acting as reusable connectors.  This property is also conserved between species.  Taken as a whole, this further underlines that solving structures of protein complexes should be an important focus for future structural genomics projects.  Targeting the most frequent domain pairs would increase the coverage of databases such as iPfam, shedding more light onto the molecular mechanisms underpinning cellular networks.

Methods

Protein interaction data

The complete interaction sets from BioGRID 20, DIP 21, HPRD 22, IntAct 23 and MPact 24 were downloaded.  A wide range of databases were used to cover as many distinct experimental data sets as possible.  BioGRID for example contains a large manually curated set of protein interactions for S. cerevisiae 19.  Similarily, HPRD hosts a set of manually curated protein interactions for H. sapiens.  IntAct on the other hand contains results from high-throughput screens and integrates data from other protein interaction databases as part of the IMEx collaboration.  The MPact database combines the manually curated S. cerevisiae protein complexes data set formerly known as the MIPS complexes with other high-throughput interaction experiments data.  Taken together, these databases represent most of the protein interactions currently stored in machine-accessible form.  Despite great efforts to unify access to protein interaction data 32, acquiring large data sets from diverse sources is still far from trivial and error prone.  The PSI-MI XML data exchange format provided by the aforementioned databases was used to generate a local relational database of protein interactions.  All entries were mapped to UniProt 33 by either relying on existing annotations from the source databases or by pair-wise sequence alignment to all UniProt proteins from the same species as the query protein.  The direct sequence comparison was performed using pmatch, a very fast pairwise alignment algorithm developed by Richard Durbin (unpublished, source code available 34).

Species

To allow cross-species comparisons, the data were split into five distinct species sets: E. coli, S. cerevisiae, C. elegans, D. melanogaster and H. sapiens.  It should be noted that the proportion of proteins for which an interaction is known varies greatly between the species, see Table 1.  This might affect the results if there is a systematic bias on the composition of a protein interaction set.  To prevent bias from multiple alternative versions of the same protein, all interacting proteins were mapped to reference proteomes as defined by Integr8 35, again using pmatch.  An average of ≈ 16% of interaction entries were lost in the mapping process, either if no sequence was provided with the original entry or if no significant matching sequence could be found in Integr8.  The total number of missing proteins will be lower, as several entries from different databases refer to the same sequence.

iPfam

The iPfam database is derived from protein structures deposited in the PDB.  Regions in every protein structure that match a Pfam domain are scanned for interactions with residues in another Pfam domain.  All such interacting domain pairs are stored in a database together with detailed information on the residues involved 13.  Every pair of Pfam families that are found to interact in a PDB structure are called an iPfam domain pair throughout the text.  Single Pfam families that are part of an iPfam domain pair are then called iPfam domains.  For example, in PDB entry 1k9a the two iPfam domains SH2 (Pfam accession PF00017) and Pkinase_Tyr (PF07714) interact, therefore they form an iPfam domain pair.  In this study, iPfam version 21 was employed, containing 2837 iPfam domains, forming 4030 iPfam domain pairs.  Figure 5 shows the species distribution of iPfam domain pairs.  H. sapiens, E. coli and S. cerevisiae are clearly over-represented compared to the other 1113 species with less than 179 complex structures.  Some iPfam domain pairs are seen to form interactions between distinct peptide chains in the structure (interchain), while others form an interaction between two distinct domains within the same chain (intrachain).  In iPfam version 21, there are 3407 interchain and 1171 intrachain domain pairs, which means that 548 domain pairs mediate both inter- and intrachain interactions.  In this analysis, both types of domain interactions were used equivalently, assuming that intrachain interactions can become interchain interactions and vice-versa as a result of a gene-fission/fusion events.

Species distribution of iPfam domain pairs

Species distribution of iPfam domain pairs.  This pie chart shows how many iPfam domain pairs were found in PDB structures from each species.  The total number is larger than the 4030 unique iPfam pairs in the database because an iPfam pair can be found in structures from several species.

Filtering

There are many types of experiments used to derive protein interactions, with different properties and error rates.  For this analysis, solely the properties of physically interacting proteins is of interest.  Therefore, only interactions between exactly two proteins per experiment were considered.  That means all protein complex data that were derived by co-purification methods were removed, unless a particular experiment had identified exactly two binding partners.  All genetic interactions were also removed.

Random networks

Randomised protein interaction networks with identical degree distributions were generated from the original filtered experimental interaction data for each species.  In each randomisation step, a mapping is created that assigns every node a randomly chosen replacement node.  In this way the edges of the network remain in place, while the nodes are shuffled randomly.  It should be noted that the degree distribution per node is not maintained.  Instead, this behaviour simulates a network with a high false positive rate.

P values

P values for observations x were calculated as P(X ≥ x) = f(x; µ, σ), where f(x; µ, σ) is the probability density function of the normal distribution with mean µ and standard deviation σ.  µ and σ are estimated through the randomisation experiments.  The density function thus provides the probability that a value less than or equal to x is observed by chance, given the distribution estimated by a random resampling method.  Where appropriate, the inverse probability P(X > x) = 1 - f(x; µ, σ) was applied.

Two Distinct E3 Ubiquitin Ligases Have Complementary Functions in the Regulation of Delta and Serrate Signaling in Drosophila

Abstract

Signaling by the Notch ligands Delta (Dl) and Serrate (Ser) regulates a wide variety of essential cell-fate decisions during animal development.  Two distinct E3 ubiquitin ligases, Neuralized (Neur) and Mind bomb (Mib), have been shown to regulate Dl signaling in Drosophila melanogaster and Danio rerio, respectively.  While the neur and mib genes are evolutionarily conserved, their respective roles in the context of a single organism have not yet been examined.  We show here that the Drosophila mind bomb (D-mib) gene regulates a subset of Notch signaling events, including wing margin specification, leg segmentation, and vein determination, that are distinct from those events requiring neur activity.  D-mib also modulates lateral inhibition, a neur- and Dl-dependent signaling event, suggesting that D-mib regulates Dl signaling.  During wing development, expression of D-mib in dorsal cells appears to be necessary and sufficient for wing margin specification, indicating that D-mib also regulates Ser signaling.  Moreover, the activity of the D-mib gene is required for the endocytosis of Ser in wing imaginal disc cells.  Finally, ectopic expression of neur in D-mib mutant larvae rescues the wing D-mib phenotype, indicating that Neur can compensate for the lack of D-mib activity.  We conclude that D-mib and Neur are two structurally distinct proteins that have similar molecular activities but distinct developmental functions in Drosophila.

Introduction

Cell-to-cell signaling mediated by receptors of the Notch (N) family has been implicated in various developmental decisions in organisms ranging from nematodes to mammals [1].  N is well-known for its role in lateral inhibition, a key patterning process that organizes the regular spacing of distinct cell types within groups of equipotent cells.  Additionally, N mediates inductive signaling between cells with distinct identities.  In both signaling events, N signals via a conserved mechanism that involves the cleavage and release from the membrane of the N intracellular domain that acts as a transcriptional co-activator for DNA-binding proteins of the CBF1/Suppressor of Hairless/Lag-2 (CSL) family [2].  Two transmembrane ligands of N are known in Drosophila, Delta (Dl) and Serrate (Ser) [3].  Dl and Ser have distinct functions.  For instance, Dl (but not Ser) is essential for lateral inhibition during early neurogenesis in the embryo [4].  Conversely, Ser (but not Dl) is specifically required for segmental patterning [5].  Some developmental decisions, however, require the activity of both genes: Dl and Ser are both required for the specification of wing margin cells during imaginal development [6,7,8,9,10].  These different requirements for Dl and Ser appear to primarily result from their non-overlapping expression patterns rather than from distinct signaling properties.  Consistent with this interpretation, Dl and Ser have been proposed to act redundantly in the sensory bristle lineage where they are co-expressed ([11]; note, however, that results from another study have indicated a non-redundant function for Dl in the bristle lineage [12]).  Furthermore, Dl and Ser appear to be partially interchangeable because the forced expression of Ser can partially rescue the Dl neurogenic phenotype [13].  Additionally, the ectopic expression of Dl can partially rescue the Ser wing phenotype [14].  The notion that Dl and Ser have similar signaling properties has, however, recently been challenged by the observation that human homologs of Dl and Ser have distinct instructive signaling activity [15].  Endocytosis has recently emerged as a key mechanism regulating the signaling activity of Dl.  First, clonal analysis in Drosophila has suggested that dynamin-dependent endocytosis is required not only in signal-receiving cells but also in signal-sending cells to promote N activation [16].  Second, mutant Dl proteins that are endocytosis defective exhibit reduced signaling activity [17].  Third, two distinct E3 ubiquitin ligases, Neuralized (Neur) and Mind bomb (Mib), have recently been shown to regulate Dl endocytosis and N activation in Drosophila and Danio rerio, respectively [18,19,20,21,22,23,24,25].  Ubiquitin is a 76-amino-acid polypeptide that is covalently linked to substrates in a multi-step process that involves a ubiquitin-activating enzyme (E1), a ubiquitin-conjugating enzyme (E2), and a ubiquitin–protein ligase (E3).  E3s recognize specific substrates and catalyze the transfer of ubiquitin to the protein substrate.  Ubiquitin was first identified as a tag for proteins destined for degradation.  More recently, ubiquitin has also been shown to serve as a signal for endocytosis [26,27].  Mib in D. rerio and Neur in Drosophila and Xenopus have been shown to associate with Dl, regulate Dl ubiquitination, and promote its endocytosis [18,19,20,22,25,28].  Moreover, genetic and transplantation studies have indicated that both Neur and Mib act in a non-autonomous manner [18,21,22,23,25,29], indicating that endocytosis of Dl is associated with increased Dl signaling activity.  Finally, epsin, a regulator of endocytosis that contains a ubiquitin-interacting motif and that is known in Drosophila as Liquid facet, is essential for Dl signaling [30,31].  In one study, Liquid facet was proposed to target Dl to an endocytic recycling compartment, suggesting that recycling of Dl may be required for signaling.  Accordingly, signaling would not be linked directly to endocytosis, but endocytosis would be prerequisite for signaling [30].  How endocytosis of Dl leads to the activation of N remains to be elucidated.  Also, whether the signaling activity of Ser is similarly regulated by endocytosis is not known.  Neur and Mib proteins completely differ in primary structure.  Drosophila Neur is a 754-amino-acid protein that contains two conserved Neur homology repeats of unknown function and one C-terminal catalytic really interesting new gene (RING) domain.  D. rerio Mib (also known as DIP-1 in the mouse [32]) is a 1,030-amino-acid protein with one ZZ zinc finger domain surrounded by two Mib/HERC2 domains, two Mib repeats, eight ankyrin repeats, two atypical RING domains, and one C-terminal catalytic RING domain.  Both genes have been conserved from flies to mammals [18,19,33,34].  While genetic analysis has revealed that neur in Drosophila and mib in D. rerio are strictly required for N signaling, knockout studies of mouse Neur1 has indicated that NEUR1 is not strictly required for N signaling [33,34].  One possible explanation is functional redundancy with the mouse Neur2 gene.  Conversely, the function of Drosophila mib (D-mib), the homolog of D. rerio mib gene, is not known.  To establish the respective roles of these two distinct E3 ligases in the context of a single model organism, we have studied the function of the Drosophila D-mib gene.  We report here that D-mib, like D. rerio Mib, appears to regulate Dl signaling during leg segmentation, wing vein formation, and lateral inhibition in the adult notum.  We further show that D-mib is specifically required for Ser endocytosis and signaling during wing development, indicating for the first time, to our knowledge, that endocytosis regulates Ser signaling.  Interestingly, the D-mib activity was found necessary for a subset of N signaling events that are distinct from those requiring the activity of the neur gene.  Nevertheless, the ectopic expression of Neur compensates for the loss of D-mib activity in the wing, indicating that Neur and D-mib have overlapping functions.  We conclude that D-mib and Neur are two structurally distinct proteins with similar molecular activities but distinct and complementary functions in Drosophila.

Results

Isolation of D-mib Mutations

The closest Drosophila homolog of the vertebrate mib gene is the predicted gene CG5841, D-mib [18].  The D-mib mutations identified are shown in Figure 1.  A P-element inserted into the 5' untranslated region of the D-mib gene was recently isolated (http://flypush.imgen.bcm.tmc.edu/pscreen/) (Figure 1A).  Insertion of this P-element confers late pupal lethality.  Lethality was reverted by precise excision of the P-element, suggesting that insertion of this P-element is a D-mib mutation, referred to as D-mib1.  A 13.6-kb deletion that removes the entire D-mib coding region was selected by imprecise excision of this P-element.  This deletion represents a null allele of D-mib and was named D-mib2.  This deletion also deletes the 3' flanking RpS31 gene (Figure 1A).  The D-mib1 and D-mib2mutant alleles did not complement the l(3)72CdaJ12 and l(3)72CdaI5 lethal mutations that have been mapped to the same cytological interval as the D-mib gene [35].  This indicates that these two lethal mutations are D-mib mutant alleles, and they were therefore renamed D-mib3 and D-mib4, respectively.  The D-mib1 and D-mib3 mutations behave as genetic null alleles (see Materials and Methods).  In contrast, D-mib4 is a partial loss-of-function allele because flies trans-heterozygous for D-mib4 and any other D-mib null alleles are viable.  These four mutations identify the CG5841 gene as D-mib by the following evidence.  First, lethality of homozygous D-mib1 pupae is associated with the insertion of a P-element into the 5' UTR of the D-mib gene.  Second, genomic sequencing of the D-mib3 allele revealed the presence of a stop codon at position 258 (Figure 1B).  This allele is therefore predicted to produce a truncated protein devoid of the catalytic RING domain, consistent with D-mib3 being a null allele.  Genomic sequencing of the D-mib4 allele showed that this mutation is associated with a valine-to-methionine substitution at a conserved position in the second Mib repeat (Figure 1B).  Third, Western blot analysis showed that the D-mib protein was not detectable in imaginal disc and brain complex extracts prepared from homozygous D-mib1 and D-mib1/D-mib2 larvae (Figure 1C and C').  Fourth, the leaky, GAL4-independent expression of a UAS-D-mib transgene fully rescued the lethality of D-mib1/D-mib2 flies (data not shown; see also Figure 1H).  Thus, our analysis identified both complete and partial D-mib loss-of-function alleles.

Molecular and Genetic Characterization of D-mib Mutations

(A) Molecular map of the D-mib locus showing the position of the P-element inserted into the 5' untranslated region (allele D-mib1) and the 13.6 kb deletion that removes the D-mib and the RpS31 genes (allele D-mib2).  Transcribed regions are indicated with arrows, and exons are indicated with boxes.  Open reading frames are shown in black.  (B) Domain composition of D-mib and D. rerio Mib.  Both proteins show identical domain organization.  D-mib has an N-terminal ZZ zinc finger flanked on either side by a Mib/HERC2 (M-H) domain, followed by two Mib repeats, six ankyrin repeats, two atypical RING domains, and a C-terminal protypical RING that has been associated with catalytic E3 ubiquitin ligase activity.  The D-mib3 mutant allele is predicted to produce a truncated protein devoid of E3 ubiquitin ligase activity whereas the D-mib4 protein carries a mutation at a conserved position in the second Mib repeat.  (C and C') Western blot analysis of D-mib (C).  The endogenous D-mib protein (predicted size: 130 kDa) was detected in S2 cells (lane 2) and in imaginal discs from wild-type larvae (lane 3) but was not detectable in homozygous D-mib1 (lane 4) and D-mib1/D-mib3 (lane 5) third instar larvae.  The D-mib protein produced in transfected S2 cells from the cDNA used in this study (lane 1) runs exactly as endogenous D-mib (lane 2).  Panel C' shows a Red Ponceau staining of the gel with the same protein samples as in panel C.  (D–H) Wings from wild-type (D), D-mib1 (E), SerRX82/Serrev6.1 (F), D-mib2/D-mib4 (G), and UAS-D-mib2/+; D-mib1/D-mib2 flies (H).  D-mib (E) and Ser (F) mutant flies showed similar wing loss phenotypes.  The D-mib mutant phenotype could be almost fully rescued by a leaky UAS-D-mib transgene (H).  (D') and (G') show high magnification views of (D) and (G), respectively, to show that D-mib2/D-mib4 mutant flies (G') exhibited ectopic sensilla (arrowheads) along vein L3.  (I–N) Nota (I–K) and legs (L–N) from wild-type (I and L), D-mib1 (J and M), and SerRX82/Serrev6.1 (K and N) flies.  D-mib mutant flies showed a weak neurogenic phenotype (J) that was not observed in Ser mutant flies (K).  Ectopic sensory organs in D-mib mutant flies developed from ectopic sensory organ precursor cells (not shown).  D-mib (M) and Ser (N) mutant legs also showed distinct growth and/or elongation defects.  Arrows in (J) show ectopic macrochaetes.  Arrows in (L–N) indicate the joints.  Ti, tibia; t1 to t5, tarsal segments 1 to 5.

D-mib Regulates Dl Signaling

Complete loss of zygotic D-mib activity in homozygous D-mib1 and trans-heterozygous D-mib2/D-mib3, D-mib1/D-mib3 and D-mib1/D-mib2 individuals led to late pupal lethality.  Mutant pupae died as pharate adults showing ectopic macrochaetes, increased microchaete density on the dorsal thorax (Figure 1I and 1J), short legs lacking tarsal segmentation (Figure 1L and 1M), and nearly complete loss of eye and wing tissues (Figure 1D and 1E).  Tissue losses were associated with a dramatic reduction in size of the eye field and of the wing pouch in mutant discs of third instar larvae (Figure 2A–2E).  Hypomorphic D-mib2/D-mib4 mutant flies only showed ectopic sensory organs, rough eyes, small wings, and thickened veins (Figure 1D, 1D', 1G, and 1G'; data not shown).  All these phenotypes may result from reduced N signaling.  More specifically, the bristle and leg phenotypes are likely to result from reduced signaling by Dl (and not by Ser).  Indeed, a reduction in Dl-mediated lateral inhibition can result in ectopic sensory organs and increased bristle density on the body surface.  In contrast, a complete loss of Ser signaling had no effect on bristle density (Figure 1K).  Likewise, loss of Dl signaling has been shown to result in short unsegmented legs, similar to the ones seen in the absence of D-mib activity (Figure 1M), whereas a complete loss of Ser activity led to the formation of elongated unsegmented legs (Figure 1N) [36,37,38].  Finally, the vein phenotype seen in D-mib hypomorphic flies is similar to the one seen in Dlts mutant flies [39].  Together, these observations suggest that D-mib regulates Dl signaling in several developmental contexts.  Consistent with this conclusion, we have shown that D-mib binds Dl and promotes Dl signaling and that overexpression of D-mib down-regulates the accumulation of Dl at the cell surface (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).

The D-mib and neur Genes Have Distinct Functions during Wing Development

(A–E) Wing imaginal discs (B–E) from wild-type (B and D), D-mib1 (C), and D-mib1/D-mib2 (E) third instar larvae stained for Cut (B and C) and wg-lacZ (D and E).  D-mib mutant discs showed a dramatically reduced size of the wing pouch (see diagram in [A] showing the different regions of the wing imaginal disc; V, ventral; D, dorsal), as well as a complete loss of Cut and wg-lacZ (red arrows in [B–E]) expression at the wing margin.  Expression of wg-lacZ in the hinge region (arrowheads in [D] and [E]) and the accumulation of Cut in sensory cells (small arrows in [B] and [C]) and muscle precursor cells (large arrowheads in [B] and [C]) appeared to be largely unaffected).  (F and F') Expression of Cut (red) at the wing margin was not affected by the complete loss of neur activity in neur1F65 mutant clones (indicated by the loss of the nuclear green fluorescent protein [GFP] marker, in green).  Bar is 50 μm in (B–E) and 20 μm in (F and F').

D-mib and neur Have Distinct Functions

We then studied in more detail the function of D-mib during wing development.  Growth of the wing pouch depends on the activity of an organizing center located at the dorsal-ventral (D-V) boundary [40,41].  This boundary is established in first instar larvae and is defined by the apterous expression boundary.  Apterous activates the expression of the Ser and fringe genes in dorsal cells.  High levels of Ser in dorsal cells activate N in trans in ventral cells and suppress N activation in cis in dorsal cells, whereas Fringe modifies N in dorsal cells such that dorsal cells located at the D-V boundary respond to Dl.  Thus, composite signaling by Ser and Dl leads to symmetric N activation in margin cells located along the D-V boundary [8,9,42,43].  N then regulates the expression of the vestigial and wingless (wg) genes that cooperate to promote growth of the wing pouch.  N also regulates expression of the cut gene in margin cells [44].  Thus, loss of N signaling results in a reduction in size of the wing pouch accompanied by the loss of cut and wg expression along the D-V boundary.  A complete loss of Cut and Wg accumulation and wg-lacZ expression was observed in the central region of third instar D-mib mutant wing discs (data not shown).  Thus, the D-mib wing phenotype may result from defective N inductive signaling at the D-V boundary.  We conclude that the activity of the D-mib gene is required for the specification of the wing margin and, hence, growth of the wing pouch.  Interestingly, wing margin formation and expression of Cut are not affected by the complete loss of neur activity (Figure 2F and 2F') [45].  Similarly, loss of neur activity had no detectable effect on leg segmentation (data not shown) and vein determination [45], two processes shown here to depend on D-mib gene activity.  We therefore conclude that D-mib and neur have distinct and complementary functions in Drosophila.

D-mib Co-Localizes with Dl and Ser at the Apical Cortex

We next studied the subcellular localization of D-mib (Figure 3).  Anti-D-mib antibodies were generated that specifically detected D-mib on Western blots (see Figure 1C) and on fixed tissues (Figure 3F–F').  Using these antibodies, we found that D-mib was detected in all imaginal disc cells (Figure 3A and 3B).  We then examined D-mib subcellular distribution in epithelial cells located along the edge of the wing discs because cross-sectional imaging affords better resolution along the apical-basal axis.  D-mib co-localized with Ser, Dl, and N at the apical cortex (Figure 3B–3D''').  Dl and Ser were also detected in large intracellular vesicles that probably correspond to multivesicular bodies in that they also stained for hepatocyte growth factor-regulated tyrosine kinase substrate [46] (Figure 3B–3C'''; data not shown).  The intracellular dots seen with the anti-D-mib antibodies were distinct from the Dl- and Ser-positive dots and appeared to result from background staining (data not shown).  The reduced cytoplasmic staining seen in D-mib mutant cells (Figure 3F–3F'') suggests that D-mib is also present in the cytoplasm.  A similar localization at the apical cortex and in the cytoplasm was seen for a functional yellow fluorescent protein (YFP)::D-mib fusion protein (see Figure 6 below).  These localization data suggest that D-mib may act at the apical cortex to regulate the activity of Dl and/or Ser.

D-mib Co-Localizes with Dl and Ser at the Apical Cell Cortex

(A and A') D-mib (green) is detected in all cells of the wing imaginal disc.  In (A), Ser is in red and Discs-large (Dlg) is in blue.  (B–D''') D-mib (green in B, B', C, C', D, and D') co-localized with Ser (red in [B and B'']), Dl (red in [C and C'']), N (red in [D and D'']), and E-Cadherin (E-Cad; blue in [D and D''']) and was found apical to Discs-large (Dlg; blue in [B, B''', C, and C''']) in notum cells located at the edges of the wing discs.  (E–E'') D-mib (green in [E and E']) co-localized with Dl (red in [E and E'']) at the apical cortex of wing pouch cells.  (F–F'') D-mib staining at the apical cortex (blue in [F and F']) was not detected in D-mib2 mutant clone (marked by loss of nuclear GFP staining; green in [F]).  Loss of D-mib activity has no detectable effect on the apical accumulation of Dl (red in [F and F'']).  Bar is 50 μm for (A and A') and 10 μm for (B–F'').

D-mib Is Required in Dorsal Cells for Margin Expression of Cut

Large dorsal clones of D-mib2 mutant cells (marked by the loss of nuclear GFP, in green) resulted in a complete loss of Cut (red) expression (A and B).  This indicates that D-mib is required for Ser signaling by dorsal cells.  In contrast, ventral clones did not prevent the expression of Cut (C and D), implying that D-mib is not strictly required for Dl signaling.  Note that mutant ventral cells abutting wild-type dorsal cells expressed Cut (arrow in [D]), indicating that D-mib is not required for N signal transduction.  Low-magnification views of the wing portion of the discs are shown in (A) and (C).  (B) and (D) show high-magnification views of the areas boxed in (A) and (C), respectively.

D-mib Regulates the Cell-Surface Level of Ser

We next examined the potential role of D-mib in regulating Dl and Ser distribution in wing imaginal discs.  We focused our analysis on the notum region since D-mib mutant discs have no wing pouch (Figure 4).  Dl and Ser co-localized both at the apical cortex and in large intracellular vesicles in wild-type cells (Figure 4A–4C').  The complete loss of D-mib activity in D-mib1 mutant discs did not detectably change the subcellular localization of Dl (Figure 4C, 4C', 4F, and 4F').  In contrast, the accumulation of Ser at the apical cortex was strongly increased (Figure 4E) and Ser accumulation in Dl-positive vesicles was dramatically reduced (Figure 4E') in D-mib1 mutant discs.  Similar results were also obtained in D-mib2 mutant clones, which showed strongly elevated levels of cortical Ser (Figure 4H) whereas the amount of Dl at the apical cortex was not detectably modified (see Figures 3F–3F'' and 4J).  Of note, loss of D-mib2 activity in clones did not block the accumulation of Ser into intracellular dots (Figure 4H').  Thus, trafficking of Ser towards this intracellular compartment is, at least in part, D-mib-independent.  We therefore conclude that the D-mib gene is required to regulate the level of Ser at the apical cortex of wing disc cells.

D-mib Is Required to Down-Regulate Ser at the Apical Cortex

(A–F') Distribution of Dl (green) and Ser (red) in the notum region of wild-type (A–C') and D-mib1 mutant (D–F') wing imaginal discs.  The boxed areas in (A) and (D) are shown at higher magnification in (B–F').  The specific loss of Ser accumulation into intracellular vesicles (compare [E'] with [B']) correlated with the elevated levels of Ser seen at the apical cortex of D-mib mutant cells (compare [E] with [B]).  (G–J') Ser (red in [H and H']) accumulated at the apical cortex (H) as well as in intracellular dots (H') in D-mib2 mutant cells (marked by the loss of nuclear GFP; green in [G]).  Cut is shown in blue (G).  The distribution of Dl (red in [J and J']) was not affected by the loss of D-mib activity.  Low-magnification views of the wing portion of the discs are shown in (G) and (I).  (H and H') and (J and J') show high magnification views of the areas boxed in (G) and (I), respectively.  Clone boundaries are outlined in (H and H') and (J and J').  Bar is 40 μm for (A, D, G), 5 μm for (B–C' and E–F'), and 10 μm for (H–J').

D-mib Is Required for Ser Endocytosis

Ubiquitin-mediated endocytosis is thought to depend on monoubiquitination.  Thus, by analogy with the function of Mib in D. rerio [18,28], we suggest that D-mib may directly monoubiquitinate Ser.  Consistent with this hypothesis, we show in a companion paper that D-mib binds Ser (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).  Moreover, a mutation in the C-terminal catalytic RING domain of D-mib abolished its ability to internalize Ser in transfected S2 cells (R. L. B. and F. S., unpublished data) implying that the E3 ubiquitin ligase activity of D-mib is required for Ser internalization.  Biochemical analysis of the ubiquitination events regulated by D-mib will be needed to further define the mechanism by which D-mib regulates the endocytosis of Ser in vivo.  To test whether this specific increase in the level of Ser at the apical cortex resulted from reduced Ser endocytosis in D-mib mutant cells, we followed the endocytosis of Ser in living imaginal discs using an antibody uptake assay.  Briefly, dissected wing discs were cultured for 15 min in the presence of antibodies that recognize the extracellular part of Ser or Dl, then washed, cultured for another 45 min in medium without antibodies, and then fixed.  The uptake of anti-Ser and anti-Dl antibodies was then assessed using secondary antibodies.  The results are shown in Figure 5.  Using this assay, we found that anti-Ser-and anti-Dl antibodies were internalized in wild-type epithelial cells (Figure 5A–5C'').  The complete loss of D-mib activity in D-mib1 wing discs did not significantly change the internalization of anti-Dl antibodies (Figure 5D'', 5E'', and 5F''), indicating that D-mib is not required for Dl endocytosis in this tissue.  However, the loss of D-mib activity strongly inhibited the endocytosis of anti-Ser antibodies (Figure 5E').  Moreover, high levels of anti-Ser antibodies were seen at the apical surface (Figure 5D' and 5F'), confirming that D-mib mutant cells accumulate high levels of Ser at their surface.  We therefore conclude that D-mib is specifically required for the endocytosis of Ser in wing discs.

D-mib Is Required for Ser Endocytosis

Localization of the anti-Ser (red) and anti-Dl (green) antibodies that have been internalized by wild-type (A–C'') and D-mib1 mutant (D–F'') cells in the notum region of wing discs.  (A–A'') and (D–D'') show apical sections and (B–B'') and (E–E'') show basal sections.  (C–C'') and (F–F'') show confocal z-sections.  The z-section axes are shown with a double-headed arrow in (A) and (D).  Internalized anti-Ser and anti-Dl antibodies co-localized in wild-type cells.  In contrast, high levels of anti-Ser antibodies were detected at the cell surface of D-mib mutant epithelial cells whereas anti-Dl antibodies were efficiently internalized.  Bar is 10 μm for all panels.

D-mib Regulates Ser Signaling

The regulation of Ser endocytosis by D-mib suggests that D-mib may regulate Ser signaling.  Ser expression is restricted to dorsal cells in second instar wing imaginal discs [7,10,44,47,48].  Ser in dorsal cells signals across the D-V boundary to activate N in ventral cells [8,9].  If D-mib is required for Ser signaling during wing development, then loss of D-mib activity in dorsal cells should affect the specification of the wing margin in a non-autonomous manner.  Loss of D-mib activity in large dorsal clones of D-mib2 mutant cells resulted in a loss of Cut expression at the D-V interface (Figure 6A and 6B).  The lack of Cut expression in wild-type ventral cells abutting the D-V boundary indicates that D-mib is required for Ser signaling by dorsal cells and acts in a non-autonomous manner to activate N in ventral cells.  Conversely, loss of D-mib activity in large ventral clones (Figure 6C and 6D) did not disrupt margin specification, indicating that D-mib is not strictly required for Dl signaling by ventral cells.  However, a narrowing of the Cut-positive margin was observed (Figure 6D), suggesting that D-mib contributes to regulating the level of Dl signaling.  Of note, ventral D-mib mutant cells expressed Cut, implying that D-mib is not required for N signal transduction.  We next tested whether expression of D-mib in dorsal cells is sufficient to rescue the D-mib wing phenotype.  D-mib was expressed in dorsal cells of D-mib2/D-mib3 mutant discs using Ser-GAL4.  Similarly to the expression of the Ser gene, Ser-GAL4 expression is restricted to dorsal cells in second/early third instar larvae and is weakly expressed in ventral cells in mid/late third instar larvae, i.e., after margin cell specification [49,50].  Expression of D-mib in dorsal cells was sufficient to rescue growth of the wing pouch and of the expression of Cut in margin cells in D-mib mutant discs (Figure 7A).  This result confirmed that D-mib regulates Ser signaling by dorsal cells.  A similar rescue was observed with a YFP::D-mib protein (Figure 7B–7B''), indicating that YFP::D-mib is functional.  YFP::D-mib localized at the apical cortex and in the cytoplasm (Figure 7C–7 D'''), as seen for endogenous D-mib (see Figure 3).  YFP::D-mib co-localized with Dl and Ser at the apical cortex of cells expressing low levels of YFP::D-mib.  However, cells expressing high levels of YFP::D-mib showed a strong reduction in the level of both Dl and Ser at the cortex (Figure 7C–7C'''), further indicating that D-mib down-regulates the levels of both Ser and Dl at the apical cortex (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).

Expression of D-mib in Dorsal Cells Is Sufficient to Rescue the D-mib Mutant Phenotype

(A) Expression of D-mib (green) in dorsal cells, using Ser-GAL4, rescued the growth of the wing pouch and margin Cut (red) expression in D-mib2/D-mib3 mutant discs.  (B–D''') Ser-GAL4-driven expression of YFP::D-mib (green) rescued the D-mib2/D-mib3 phenotype and strongly reduced the level of Dl (blue in [B, B', C, C'', D, and D'']) and Ser (red in [B, B'', C, C''', D, and D''']) in dorsal cells.  (C–D''') are high-magnification views (apical [C–C'''] and basal [D–D''']) of the disc shown in (B–B'').  YFP::D-mib co-localized with Dl and Ser at the apical cortex in cells expressing only low levels of YFP::D-mib.  Bar is 50 μm for (A–B'') and 10 μm for (C–D''').

D-mib Acts Downstream of Ser and Upstream of Activated N

The functional assay was then used to genetically position the requirement for the D-mib gene activity relative to Ser and N (Figure 8).  Expression of an activated version of N, Ncdc10 [51], led to the activation of Cut and promoted growth in dorsal cells of D-mib2/D-mib3 mutant discs (Figure 8C).  This indicates that D-mib acts at a step upstream of N activation.  By contrast, elevated levels of Ser expression failed to restore Cut expression and growth of the wing pouch in D-mib2/D-mib3 mutant larvae (Figure 8B).  This confirms that Ser signaling requires the activity of the D-mib gene, i.e., that D-mib acts downstream of Ser.

Expression of Neur in Dorsal Cells Is Sufficient to Rescue the D-mib Mutant Phenotype

D-mib2/D-mib3 mutant discs expressing GFP (A) (GFP staining not shown), Ser (B), Ncdc10 (C), or Neur (D) under the control of Ser-GAL4 were stained for Cut (red).  Expression of Ser in dorsal cells did not rescue the D-mib2/D-mib3 wing pouch mutant phenotype (compare [B] with [A]), consistent with D-mib being required for Ser signaling.  By contrast, expression of Ncdc10, an activated version of N, led to the deregulated growth of the dorsal compartment and the expression of Cut in most dorsal cells (C), indicating that activated N acts downstream of D-mib.  Expression of Neur in dorsal cells was sufficient to compensate for the loss of D-mib activity (D).  Bar is 40 μm for all panels.

neur and D-mib Functions Partially Overlap

The different requirements for neur and D-mib gene activity may suggest that Neur and D-mib have distinct molecular activities.  Alternatively, this difference may reflect a difference in gene expression.  Consistent with the latter hypothesis, the neur gene is not expressed in wing pouch and wing margin cells, where it is not required, and appears to be expressed only in sensory cells [52], where it is required.  By contrast, D-mib appears to be uniformly expressed in imaginal discs.  To test this hypothesis, we examined whether the forced ubiquitous expression of the neur gene can suppress the D-mib loss-of-function phenotype.  Expression of Neur, using actin-GAL4, restored growth of the wing pouch and formation of the wing margin (data not shown).  Moreover, expression of Neur in dorsal cells, using Ser-GAL4, was sufficient to rescue growth of the wing pouch as well as the expression of Cut in margin cells in D-mib mutant discs (Figure 8D).  We conclude that ectopic expression of Neur compensates for the loss of D-mib activity.  In a converse experiment, we found that the neur-driven expression of D-mib, using neurPGAL4, did not rescue the cuticular neurogenic phenotype of neurPGAL4/neur1F65 embryos.  Three UAS-D-mib transgenic lines were tested, and none showed detectable rescue whereas the two UAS-neur lines used as positive controls either fully or partially rescued the cuticular neurogenic phenotype of neurPGAL4/neur1F65 embryos (data not shown; UAS-D-mib+/+, neurPGAL4+/+ embryos developed normally).  This indicates that a key function of Neur in the embryo cannot be provided by D-mib.  We therefore suggest that Neur and D-mib functions overlap but are not strictly identical.

Discussion

Many recent studies have revealed that endocytosis plays multiple roles in the regulation of N signaling (reviewed in [2]; see also [53,54]).  Here, we show that the conserved E3 ubiquitin ligases Neur and D-mib have similar molecular activities in the regulation of Dl and Ser endocytosis but distinct developmental functions in Drosophila.  Our analysis first establishes that D-mib regulates Ser signaling during wing development.  First, clonal analysis revealed that the activity of the D-mib gene is specifically required in dorsal cells for the expression of Cut at the wing margin.  Second, expression of D-mib in the dorsal Ser-signaling cells was sufficient to rescue the D-mib mutant wing phenotype.  Third, results from an in vivo antibody uptake assay indicated that the endocytosis of Ser (but not of Dl) was strongly inhibited in D-mib mutant cells.  This inhibition correlated with the strong accumulation of Ser (but not Dl) at the apical cortex of D-mib mutant cells.  Thus, an essential function of D-mib in the wing is to regulate the endocytosis of Ser in dorsal cells to non-autonomously promote the activation of N along the D-V boundary.  By analogy, the defective growth of the eye tissue may similarly result from the lack of Ser signaling and of N activation along the D-V boundary [55].  Because D-mib co-localizes with Ser at the apical cortex of wing disc cells, acts in a RING-finger-dependent manner to regulate Ser endocytosis in S2 cells (R. L. B. and F. S., unpublished results), and physically associates with Ser in co-immunoprecipitation experiments (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data), D-mib may ubiquitinate Ser and directly regulate its endocytosis.  Our analysis further suggests that endocytosis of Ser is required for Ser signaling.  This conclusion is consistent with observations made earlier showing that secreted versions of Ser cannot activate N but instead antagonize Ser signaling [56,57].  Thus, endocytosis of both N ligands appears to be strictly required for N activation in Drosophila.  Different models have been proposed to explain how endocytosis of the ligand, which removes the ligand from the cell surface, results in N receptor activation (discussed in [17,20,21,30]).  Interestingly, the strong requirement for Dl and Ser endocytosis seen in Drosophila is not conserved in Caenorhabditis elegans, in which secreted ligands have been shown to be functional [58,59].  Noticeably, there is no C. elegans Mib homolog, and the function of C. elegans neur (F10D7.5) is not known.  We speculate that endocytosis of the ligands may have evolved as a means to ensure tight spatial regulation of the activation of N.  Our analysis also establishes that the activity of the D-mib gene is required for a subset of N signaling events that are distinct from those that require the activity of the neur gene.  We have shown that the D-mib gene regulates wing margin formation, leg segmentation, and vein formation, whereas none of these three processes depend on neur gene activity ([45,60]; this study).  Conversely, the activity of the neur gene is essential for binary cell-fate decisions in the bristle lineage [22] that do not require the activity of the D-mib gene (no bristle defects were seen in D-mib mutant flies).  The activity of the neur gene is also required for lateral inhibition during neurogenesis in embryos and pupae [4,45,61].  This process is largely independent of D-mib gene activity since the complete loss of D-mib function only resulted in a mild neurogenic phenotype in the notum.  These data thus indicate that the neur and D-mib genes have largely distinct and complementary functions in Drosophila.  Whether a similar functional relationship between Neur and D-mib exists in vertebrates awaits the study of the D. rerio neur genes and/or of the murine Mib and Neur genes.  The functional differences observed between D-mib and neur cannot be simply explained by obvious differences in molecular activity and/or substrate specificity.  First, both Neur and D-mib physically interact with Dl ([20]; E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data) and promote the down-regulation of Dl from the apical membrane when overexpressed (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).  Furthermore, Dl signaling appears to require the activity of either Neur or D-mib, depending on the developmental contexts.  We have shown here that specific aspects of the D-mib phenotype in legs and in the notum cannot simply result from loss of Ser signaling and are consistent with reduced Dl signaling, suggesting that D-mib regulates Dl signaling.  Consistent with this interpretation, overexpression studies indicate that D-mib up-regulates the signaling activity of Dl, whereas a dominant-negative form of D-mib inhibits it (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).  We note, however, that no clear defects in Dl subcellular localization and/or trafficking were observed in D-mib mutant cells.  It is conceivable that the contribution of D-mib to the endocytosis of Dl is masked by the activity of D-mib-independent processes that may, or may not, be linked to Dl signaling.  We have also shown that, reciprocally, Neur and D-mib may similarly regulate Ser.  Neur and D-mib were shown to similarly promote down-regulation of Ser from the cell surface when overexpressed (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).  Moreover, D-mib binds Ser (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data) and regulates Ser signaling (this study).  Whether endogenous Neur binds and activates Ser remains to be tested.  However, the ability of Neur to rescue the D-mib mutant wing phenotype when expressed in dorsal cells strongly indicates that Neur can promote Ser signaling.  Together, these data indicate that Neur and D-mib have similar molecular activities.  D-mib and Neur may have identical molecular activities but distinct expression patterns, hence distinct functions at the level of the organism.  Consistent with this possibility, D-mib is uniformly distributed in imaginal discs, whereas Neur is specifically detected in sensory cells [52].  Importantly, the rescue of the D-mib mutant phenotype by ectopic expression of Neur strongly supports this interpretation.  This result further suggests that Neur can regulate Ser signaling.  Consistent with this idea, overexpression of Neur in imaginal discs resulted in a strong reduction of Ser accumulation at the apical cortex (data not shown).  Thus, despite their obvious structural differences, Neur and D-mib appear to act similarly to promote the endocytosis of Dl and Ser.  Nevertheless, our observation that D-mib could not compensate for the loss of neur activity in the embryo indicates that D-mib and Neur have overlapping rather than identical molecular activities.  In conclusion, Neur and D-mib appear to have similar molecular activities in the regulation of Dl and Ser endocytosis but distinct developmental functions in Drosophila.  The conservation from Drosophila to mammals of these two structurally distinct but functionally similar E3 ubiquitin ligases is likely to reflect a combination of evolutionary advantages associated with: (i) specialized expression pattern, as evidenced by the cell-specific expression of the neur gene in sensory organ precursor cells [52]; (ii) specialized function, as suggested by the role of murine MIB in TNFα signaling [32]; (iii) regulation of protein stability, localization, and/or activity.  For instance, Neur, but not D-mib, localizes asymmetrically during asymmetric sensory organ precursor cell divisions [22].

Materials and Methods

Flies.

The D-mib1 mutation corresponds to the EY97600 P-element insertion generated by the Gene Disruption Project (http://flypush.imgen.bcm.tmc.edu/pscreen/).  The D-mib2 allele was selected as w- D-mib mutant derivative by imprecise excision of the EY97600 P-element.  The precise breakpoints of the D-mib2 deletion were determined by sequencing a PCR fragment amplified from genomic DNA prepared from D-mib2 homozygous larvae.  The l(3)72CdaJ12 and l(3)72CdaI5 alleles originally isolated by [35] failed to complement the D-mib1 and D-mib2 mutations and were renamed D-mib3 and D-mib4.  The D-mib1, D-mib2, and D-mib3 alleles appear to be genetically null alleles since the phenotypes of D-mib1/D-mib1 and D-mib1/D-mib3 mutant pupae are indistinguishable from the ones seen in D-mib1/D-mib2 and D-mib2/D-mib3 pupae.  Sequence analysis of the D-mib3 and D-mib4 alleles was carried on PCR products prepared from genomic DNA prepared from D-mib3/D-mib2 and D-mib4/D-mib2 mutant pupae.  Genomic DNA from l(3)72Cda/D-mib2 mutant pupae was used as control for polymorphism.  D-mib2 mutant clones were generated in y w hs-flp;FRT2A D-mib2/FRT2A M(3)i55 ubi-nlsGFP larvae.  neur1F65 mutant clones were generated as previously described [22].  UAS-D-mib and UAS-YFP::D-mib lines were generated via standard P-element transformation.  These constructs were derived from the SD05267 cDNA obtained from ResGen (Invitrogen, Carlsbad, California, United States).  Cloning details for these constructs are available upon request.  UAS-Dl (gift from M. Muskavith), UAS-Ser (gift from R. Fleming), UAS-Neur (gift from C. Delidakis), UAS-Ncdc10 (gift of T. Klein), Ser-GAL4 lines, and Ser mutant alleles are described in FlyBase (http://flybase.bio.indiana.edu/).

Antibodies.

Dissected imaginal discs were fixed in 4% paraformaldehyde (15 min) and incubated with antibodies at room temperature in PBS 1X with 0.1% TritonX-100.  Rabbit polyclonal anti-D-mib antibodies were raised against the CYNERKTDDSELPGN peptide (CovalAb, Lyon, France).  Immunopurified anti-D-mib antibodies (rabbit 541) were used (immunofluorescence, 1:100; Western blot, 1:1,000).  Other primary antibodies were mouse anti-Cut (2B10; Developmental Studies Hybridoma Bank [DSHB, Iowa City, Iowa, United States]; 1:500); rat anti-DE-Cadherin (gift from T. Uemura; 1:50); guinea pig anti-Discs-large (gift from P. Bryant; 1:3,000); anti-β-galactosidase (Cappel [MP Biomedicals, Irvine, California, United States]; 1:1,000); mouse anti-DeltaECD (C594.9B; DSHB; 1:1,000); mouse anti-NotchECD (C548.2H; DSHB; 1:1,000); rat anti-Ser (gift from K. Irvine; 1:2,000); rat anti-Ser (gift from S. Cohen; 1:200); rabbit anti-Ser (gift from E. Knust; 1:10); and guinea pig anti-Senseless (gift from H. Bellen; 1:3,000).  Cy2-, Cy3-, and Cy5-coupled secondary antibodies were from Jackson Laboratory (Bar Harbor, Maine, United States).  Alexa488-coupled secondary antibodies and phalloidin were from Molecular Probes (Eugene, Oregon, United States).  Images were acquired on a Leica (Wetzlar, Germany) SP2 microscope and assembled using Adobe Photoshop (Adobe Systems, San Jose, California, United States).

Endocytosis assay.

Third instar larvae wing discs were dissected in Schneider's Drosophila medium (Gibco BRL, San Diego, California, United States) containing 10% fetal calf serum (Gibco BRL).  Wing discs were cut between the wing pouch and the thorax to facilitate antibody diffusion.  Wing discs were cultured for 15 min with mouse anti-Dl (C594–9B at 1:100) and rat anti-Ser antibody (1:500; from K. Irvine).  Following three medium changes and a 45-min chase period, wing discs were fixed and incubated with secondary antibodies.

RAG1 Core and V(D)J Recombination Signal Sequences Were Derived from Transib Transposons

Abstract

The V(D)J recombination reaction in jawed vertebrates is catalyzed by the RAG1 and RAG2 proteins, which are believed to have emerged approximately 500 million years ago from transposon-encoded proteins.  Yet no transposase sequence similar to RAG1 or RAG2 has been found.  Here we show that the approximately 600-amino acid “core” region of RAG1 required for its catalytic activity is significantly similar to the transposase encoded by DNA transposons that belong to the Transib superfamily.  This superfamily was discovered recently based on computational analysis of the fruit fly and African malaria mosquito genomes.  Transib transposons also are present in the genomes of sea urchin, yellow fever mosquito, silkworm, dog hookworm, hydra, and soybean rust.  We demonstrate that recombination signal sequences (RSSs) were derived from terminal inverted repeats of an ancient Transib transposon.  Furthermore, the critical DDE catalytic triad of RAG1 is shared with the Transib transposase as part of conserved motifs.  We also studied several divergent proteins encoded by the sea urchin and lancelet genomes that are 25%-30% identical to the RAG1 N-terminal domain and the RAG1 core.  Our results provide the first direct evidence linking RAG1 and RSSs to a specific superfamily of DNA transposons and indicate that the V(D)J machinery evolved from transposons.  We propose that only the RAG1 core was derived from the Transib transposase, whereas the N-terminal domain was assembled from separate proteins of unknown function that may still be active in sea urchin, lancelet, hydra, and starlet sea anemone.  We also suggest that the RAG2 protein was not encoded by ancient Transib transposons but emerged in jawed vertebrates as a counterpart of RAG1 necessary for the V(D)J recombination reaction.

Introduction

The immune system of jawed vertebrates detects and destroys foreign invaders, including bacteria and viruses, by a specific response to an unlimited number of antigens expressed by them.  The antigens can be identified after they are specifically bound by surface receptors of vertebrate B and T immune cells (BCRs and TCRs, respectively).  Because the vast repertoire of BCRs and TCRs cannot be encoded genetically, ancestors of jawed vertebrates adopted an elegant combinatorial solution [1].  The variable portions of the BCR and TCR genes are composed of separate V (variable), D (diversity), and J (joining) segments, which are represented by fewer than a few hundred copies each.  In a B and T cell site-specific recombination reaction, commonly known as V(D)J recombination, one V, one D, and one J segment are joined together into a single exon encoding the variable antigen-binding region of the receptor.  In addition to this combinatorial diversity, further diversity is generated by small insertions and deletions at junctions between the joined segments.  In V(D)J recombination, DNA cleavage is catalyzed by two proteins encoded by the recombination-activating genes, approximately 1040-amino acid (aa) RAG1 and approximately 530-aa RAG2 [2,3].  The site specificity of the recombination is defined by the binding of RAG1/2 to RSSs flanking the V, D, and J segments [4].  All RSSs can be divided into two groups, referred to as RSS12 and RSS23, and consist of conserved heptamer and nonamer sequences separated by a variable spacer either 12 ± 1 (RSS12) or 23 ± 1 (RSS23) bp long [4–7].  During V(D)J recombination, RAG1/2 complex binds one RSS12 and one RSS23, bringing them into juxtaposition, and cuts the chromosome between the RSS heptamers and the corresponding V and D, D and J, or V and J coding segments [3,8].  A rule requiring that efficient V(D)J recombination occur between RSS12 and RSS23 is known as the “12/23” rule [1].  Even prior to the discovery of RAG1 and RAG2, it had been suggested that the first two RSSs were originally terminal inverted repeats (TIRs) of an ancient transposon whose accidental insertion into a gene ancestral to BCR and TCR, followed by gene duplications, triggered the emergence of the V(D)J machinery [4].  Later, this model was expanded by the suggestion that both RAG1 and RAG2 might have evolved from a transposase (TPase) that catalyzed transpositions of ancient transposons flanked by TIRs that were precursors of RSSs [9].  This model has received additional support through observations of similar biochemical reactions in transposition and V(D)J recombination [10,11].  Finally, it was demonstrated that RAG1/2 catalyzed transpositions of a DNA segment flanked by RSS12 and RSS23 in vitro [12,13] and in vivo in yeast [14].  In vertebrates, in vivo RAG-mediated transpositions are strongly suppressed, probably to minimize potential harm to genome function.  So far, only one putative instance of such a transposition has been reported [15].  However, given the lack of significant structural similarities between RAGs and known TPases, the “RAG transposon” model [9,12,13,16] remained unproven.  Here we demonstrate that the RAG1 core and RSSs were derived from a TPase and TIRs encoded by ancient DNA transposons from the Transib superfamily [17].  The Transib superfamily is one of ten superfamilies of DNA transposons detected so far in eukaryotes [17].  Like other DNA transposons, Transib transposons exist as autonomous and nonautonomous elements.  The autonomous Transib transposons are 3–4 kb long and code for an approximately 700-aa TPase that is not similar to TPases from any other transposon superfamilies.  Computational analysis of Transib elements, including their numerous insertions into copies of other transposons, demonstrated that Transib transposons are flanked by 5-bp target site duplications (TSDs), which also distinguishes this superfamily from all the others [17].  Transib transpositions are expected to be catalyzed by the binding of the TPase to TIRs of autonomous and nonautonomous transposons [17].  As discussed in this paper, in addition to the fruit fly (Drosophila melanogaster) and African malaria mosquito (Anopheles gambiae) genomes, in which Transib transposons were originally discovered, these genes are also present in diverse animals (Table S1), including other species of fruit fly (e.g., Drosophila pseudoobscura, Drosophila willistoni), yellow fever mosquito (Anopheles aegypti), silkworm (Bombyx mori), red flour beetle (Tribolium castaneum), dog hookworm (Ancylostoma caninum), freshwater flatworm (Schmidtea mediterranea), hydra (Hydra magnipapillata), sea urchin (Strongylocentrotus purpuratus), and soybean rust (Phakopsora pachyrhizi).  Genomes of plants and vertebrates seem to be free of any recognizable Transib transposons (Figure 1).

Schematic Presentation of Transib transposons, RAG1, RAG2, and RAG1-Like Proteins in Eukaryotes

The basic timescale of the evolutionary tree is based on published literature [49–51].  Red circles mark species in which Transib TPases were found.  Gray squares indicate RAG2; orange and blue ellipses show the RAG1 core and RAG1 N-terminal domain, respectively.  Overall taxonomy, including common and Latin names, is reported on the right side of the figure.  A question mark at the lamprey lineage indicates insufficient sequence data.  A lack of any labels means that the Transib TPase and RAG1/2 are not present in the sequenced portions of the corresponding genomes.  Among branches lacking Transib TPases, only lamprey and crocodile genomes are not extensively sequenced to date.  In sea anemone, the RAG1 core–like protein is capped by the ring finger motif, which also forms the C-terminus in the RAG1 N-terminal domain.  In fungi, the Transib TPase was detected in soybean rust only.

Results

Detection of Similarity between Transib TPases and RAG1

Using protein sequences of seven known Transib TPases (Transib1 through Transib4 and Transib1_AG through Transib3_AG from D. melanogaster and A. gambiae, respectively) [17] as queries in a standard BLASTP search against all GenBank proteins, we found that the approximately 60-aa C-terminal portion of the Transib2_AG TPase was 35%-38% identical to the C-terminal portion of the RAG1 core (Figure S1).  However, this similarity was only marginally significant (E = 0.07 where the E-value is an expected number of sequences matching by chance; Table 1).  In another search against GenBank, using PSI-BLAST [18] (see Materials and Methods) with the Transib2_AG TPase as a query, we found that two unclassified proteins (GenBank gi 30923617 and 30923765; annotated as hypothetical proteins) and RAG1s constituted the only group of any GenBank proteins similar to the Transib2_AG TPase (Table 1).  The statistical significance of similarity between the TPase and RAG1s was measured by Ei = 0.025, where Ei is the E-value threshold for the first inclusion of RAG1 sequences into the PSI-BLAST iterations [18] (Materials and Methods).  The observed improvement in significance of the Transib/RAG1 similarity (from E = 0.07 in BLASTP to Ei = 0.025 in PSI-BLAST; Table 1) was due to the fact that both 151-aa and 123-aa hypothetical GenBank proteins were apparent remnants of Transib TPases (approximately 40% identity to the Transib2_AG TPase, E < 10-10 in BLASTP).  The RAG1 proteins appeared to be more similar to the position-specific scoring matrix (PSSM) created by PSI-BLAST based on multiple alignment of the Transib2_AG TPase and two Transib TPase-like proteins, than to the solo Transib2_AG TPase in the BLASTP search.  Given the latter observation, we decided to improve the quality of the PSSM constructed by PSI-BLAST for different Transib TPase sequences.  To achieve that, we combined protein sequences of the seven known Transib TPases with the set of all GenBank proteins.  As a result, Ei-values for matches of RAG1s to a new PSSM based on alignment of nine Transib TPases (the two GenBank TPase-like proteins plus seven added TPases) noticeably dropped in comparison with the Ei-values obtained for the PSSM constructed in the previous step based on alignment of the three TPases (Table 1).  To support the observation that Ei-values of matches between RAG1s and the Transib TPase PSSM decrease as the number of TPase sequences used for construction of the PSSM increases, we identified six new Transib TPases (Transib5, Transib3_DP, Transib4_DP, Transib1_AA, Transib2_AA, Transib3_AA; Figure S2).  During the next step of the PSI-BLAST analysis, the original GenBank set was combined with 13 Transib TPases.  Again, Ei-values of matches between RAG1s and the new PSSM derived from multiple alignment of 15 Transib TPases (the two GenBank proteins plus all our TPases) were much smaller (approximately 10-6–10-3; Table 1) than those obtained based on the PSSM constructed from the nine TPases at the preceding step (approximately 10-3–10-2).  In the final step, we identified one more set of five new Transib TPases (Transib1_DP, Transib2_DP, Transib4_AA, Transib5_AA, and Transib1_SP).  When all 18 TPases were combined with the original GenBank set, the Ei values of matches between RAG1s and the Transib PSSM dropped significantly further (10-9–10-4; Table 1).  During the final revision of this manuscript, we identified an intermediate RAG1-like sequence in Hydra magnipapillata, called RAG1L_HM, which is significantly similar to both RAG1 and Transib TPase, as shown later.  This direct result represents an independent validation of our analysis.  The PSI-BLAST PSSM of Transib TPases approximates conservation/variability of the Transib TPase consensus sequence.  The more diverse the TPases used in determining the PSSM, the more accurate is the approximation; some of the insect Transib TPases are less than 30% identical to each other, as shown in Figure 2.  The RAG1 Ei values decreased as the number of Transib TPases used for the PSSM construction increased due to the fact that RAG1 evolved from a Transib TPase.  In all cases, the E values obtained after several rounds of iterations were less than 10-20 at the point of convergence.  Nearly the entire sequences of several Transib TPases, excluding their 100–140-aa N-terminal domains, converged with an approximately 600-aa portion of RAG1 defined by positions approximately 360–1010 (Figure S3).  This portion of RAG1 corresponds to the “RAG1 core,” hereafter numbered relative to human RAG1 (residues 387–1011), which along with RAG2 is known to be sufficient to perform V(D)J cleavage even after deletions of the 383-aa N-terminal and 32-aa C-terminal portions of RAG1 [19,20].  During studies reported here, we identified 11 additional new families of Transib transposons and TPases (see Figure S2) that are well preserved in the genomes of fruit flies (Transib5 in D. melanogaster; and Transib1_DP, Transib2_DP, Transib3_DP, and Transib4_DP in D. pseudoobscura), mosquitoes (Transib1_AA, Transib2_AA, Transib3_AA, Transib4_AA, and Transib5_AA from A. aegypti) and sea urchin (Transib1_SP).  Transib1_SP is the first Transib transposon identified outside of insect genomes.  A well-preserved 4132-bp Transib1_SP element (contig 7839, positions 376–4506) is flanked by a 5-bp CGGCG TSD, and it encodes a 676-aa TPase (two exons) that is most similar to the Transib2 TPase (34% identity).  Based on the currently available sequence data, we also reconstructed portions of TPases that were missed in previous studies [17] (Materials and Methods; see Figure S2).  Using the Transib1_SP TPase as a query in TBLASTN searches against all GenBank sections (NR, HTGs, WGS, dbGSS, dbEST, dbSTS, and Trace Archives) we also found diverse Transib TPases in silkworm, red flour beetle, dog hookworm, freshwater flatworm, soybean rust, and hydra (Table S1).  At the same time, recently sequenced genomes of honeybee, roundworms, fish, frog, mammals, sea squirts, plants, and fungi (except soybean rust) do not contain any detectable Transib transposons (see Figure 1).  The observed patchy distribution could be caused by horizontal transfers and extinctions of Transib transposons in eukaryotic species.

Significance of Similarities between the Transib TPases and RAG1 Core

The first column lists all 18 Transib TPases used as queries in our analysis, and the shaded areas indicate those added to the original set of all GenBank proteins in subsequent PSI-BLAST searches.  The original GenBank set included two incomplete Transib TPase-like proteins.  Column 2 lists E-values of best matches between RAG1s and Transib TPases detected in BLASTP searches against the original GenBank set.  Column 3 reports Ei-values of best matches between RAG1s and a PSSM derived from the chosen query sequence and the two GenBank TPase-like proteins in PSI-BLAST searches against the original set of all GenBank proteins (see Materials and Methods).  Columns 4–6 report the Ei-values for best matches between RAG1s and a Transib-derived PSSM after adding 7, 13, and 18 Transib TPases to the GenBank set, respectively.  The numbers of the PSI-BLAST iterations after which the entire RAG1 core significantly aligned with the TPases are indicated in parentheses.  Ei-values greater than 1 are indicated by dashes.  Each empty cell indicates that the corresponding TPase query was not used at the particular stage of PSI-BLAST analysis.

Diversity of the Transib TPases and RAG1 Core–Like Proteins in Animals

The phylogenetic tree was obtained by using the neighbor-joining algorithm implemented in MEGA [44].  Evolutionary distance for each pair of protein sequences was measured as the proportion of aa sites at which the two sequences were different.  Its scale is shown by the horizontal bar.  Bootstrap values higher than 60% are reported at the corresponding nodes.  Species abbreviations are as follows: AA, yellow fever mosquito; AG, African malaria mosquito; BF, lancelet; CL, bull shark; DP, D. pseudoobscura fruit fly; FR, fugu fish; HM, hydra; HS, human; NV, starlet sea anemone; SP, sea urchin; XL, frog.  (Transib1 through Transib5 are from D. melanogaster fruit fly.)

Common Structural Hallmarks of the Transib TPase and RAG1 Core

All three core residues from the catalytic DDE triad in the RAG1 proteins (residues 603, 711, and 965) that are necessary for V(D)J recombination [21,22] are conserved in the Transib TPases (Figures 3 and Figure S3).  This includes the distances between the second D and E residues, which are much longer in Transib transposons (206–214 aa) and RAG1 (253 aa) than in DDE TPases from other studied superfamilies (e.g., approximately 35-aa in Mariner/Tc1 [23], 2-aa in P [23], approximately 35-aa in Harbinger [24], with hAT as an exception (325-aa, [25]).  Moreover, each catalytic residue is a part of a motif that is conserved in the Transib TPases and RAG1 (motifs 4, 6, and 10 in Figures 3 and Figure S3).  The RAG1 core is composed of the N-terminal region and the central and C-terminal domains ([26,27].  The N-terminal region includes the RSS nonamer-binding regions (residues 387–480), referred to as NBR [28,29].  The two terminal motifs of RAG1 NBR are conserved in the Transib TPases (Figure S3), which indicates that they may be important for their binding to the Transib TIRs during transposition (the RSS-like structure of TIRs is described below; Figure 4).  The central domain of the RAG1 core (residues 531–763) includes two aspartic acid residues from the DDE triad and is also thought to be involved in binding to the RSS heptamer and RAG2 [30,31].  The C-terminal domain of RAG1 (residues 764–1011) is the portion of RAG1 that is most conserved between RAG1 and Transib TPases.  In addition to the catalytic activity attributed to the last residue of the DDE triad, this domain has a strong nonspecific DNA-binding affinity because it binds to coding DNA upstream of the RSS heptamer, and is thought to be involved in RAG1 dimerization [26,27].  This domain is predicted to function analogously in Transib transposons.  Several other motifs conserved in Transib TPases and RAG1 include aa residues that have been shown experimentally to be important for specific functions in V(D)J recombination (Figure S3).  Based on this information, the function of these motifs in Transib TPases is expected to be similar to that in RAG1.  Among the most conserved motifs, motif 5 (see Figures 3 and Figure S3) is of particular interest because its function is not known yet but is expected to play a role both V(D)J recombination and Transib transposition.  In conjunction with detailed studies of the Transib superfamily, we also analyzed the remaining nine known superfamilies of DNA transposons defined by diverse TPases (see Table 1 in [24]).  Some of these TPases, including Mariner, Harbinger, P, and hAT, also contain the catalytic DDE triad [23].  However, based on PSI-BLAST searches, no significant similarities between these nine TPases and RAG1 protein were found (data not shown).  Therefore, given that the only significant similarity of the RAG1 core was to the Transib TPase, the RAG1 core was re-confirmed as belonging to the Transib superfamily.  In addition to the statistically significant similarity between the approximately 600-aa RAG1 core and Transib TPases, there are two other lines of evidence suggesting evolution of the V(D)J machinery from Transib DNA transposons.  They include the characteristic TSDs and structure of the TIRs discussed in the next two sections.

Multiple Alignment of Ten Conserved Motifs in the RAG1 Core Proteins and Transib TPases

The motifs are underlined and numbered from 1 to 10.  Starting positions of the motifs immediately follow the corresponding protein names.  Distances between the motifs are indicated in numbers of aa residues.  Black circles denote conserved residues that form the RAG1/Transib catalytic DDE triad.  The RAG1 proteins are as follows: RAG1_XL (GenBank GI no. 2501723, Xenopus laevis, frog), RAG1_HS (4557841, Homo sapiens, human), RAG1_GG (131826, Gallus gallus, chicken), RAG1_CL (1470117, Carcharhinus leucas, bull shark), RAG1_FR (4426834, Fugu rubripes, fugu fish).  Coloring scheme [43] reflects physiochemical properties of amino acids: black shading marks hydrophobic residues, blue indicates charged (white font), positively charged (red font), and negatively charged (green font); red indicates proline (blue font) and glycine (green font); gray indicates aliphatic (red font) and aromatic (blue font); green indicates polar (black font) and amphoteric (red font); and yellow indicates tiny (blue font) and small (green font).  The species abbreviations for the Transib transposons are as follows: AA, yellow fever mosquito; AG, African malaria mosquito; DP, D. pseudoobscura fruit fly.  (Transib1 through Transib5 are from the fruitfly D. melanogaster.)

Structural Similarities between the Transib TIRs and V(D)J RSS Signals

The species abbreviations are: AA, yellow fever mosquito; AG, African malaria mosquito; DM, D. melanogaster fruit fly DP, D. pseudoobscura fruit fly; SP, sea urchin.  (Transib1 through Transib5 are from the fruit fly D. melanogaster.)  (A) Frequencies of the most frequent nucleotides at each position of the consensus sequence of the 5' TIRs of transposons that belong to 20 families of Transib transposons identified in fruit flies and mosquitoes.  The RSS23 consensus sequence is shown immediately under the TIRs consensus sequence.  The most conserved nucleotides in the RSS23 heptamer and nonamer, which are necessary for efficient V(D)J recombination, are highlighted.  The 23 ± 1 bp variable spacer is marked by Ns.  (B) Non-gapped alignment of consensus sequences of 5' TIRs from 21 families of Transib transposons.  (C) The 12/23 rule follows from the basic structure of TIRs of the consensus sequences of transposons that belong to the Transib5, Transib2_AG, TransibN1_AG, TransibN2_AG, and TransibN3_AG families.  The 5' TIRs of these transposons are aligned with the corresponding 3' TIRs.  Structures of the 5' and 3' TIRs resemble RSS12 and RSS23, respectively.

Similar Length of TSDs and Target Site Composition in Transib and RAG1/2-Mediated Transpositions

It has been known that RAG1-mediated transposition in vitro, both intermolecular and intramolecular, is most frequently accompanied by 5-bp TSDs [12,13].  In one study [12], 35 of 38 (92%) TSDs generated during RAG-mediated intermolecular transposition were 5 bp long, and the remaining 8% were either 4 or 3 bp long.  Also, 69% of 36 TSDs recovered during RAG-mediated intramolecular transpositions were 5 bp in length; of the remaining ones, 28% were 4 bp and 3% were 3 bp long.  In another study [13], six of six TSDs detected in the intermolecular transposition were 5 bp long.  Intramolecular transposition mediated by murine RAG1/2 proteins was also studied recently in vivo in yeast [14].  Again, 60% of TSDs recovered in 26 events were 5 bp long [14].  Given the predominance of 5-bp TSDs, it is striking that Transib transposons belong to the only superfamily of eukaryotic DNA transposons with 5-bp TSDs generated upon insertions into the genome [17,24].  To illustrate the characteristic 5-bp TSDs, we show copies of Transib transposons with intact 5' and 3' TIRs from diverse families of Transib transposons present in the D. melanogaster, D. pseudoobscura, A. gambiae, and S. purpuratus genomes (Figure S4).  Moreover, some families show high target site specificity, e.g., Transib-N1_AG and Transib-N2_AG integrate preferentially at cCASTGg and cCAWTGc, respectively (TSDs are capitalized).  RAG1/2-mediated transpositions also show significant target specificity, presumably reflecting the original specificity of the Transib TPase [12].  Indigenous properties of the Transib TPase, that were not related directly to RAG1 functions, including those responsible for the precise 5-bp length of TSDs, might have been altered during evolution of RAG1, leading to occasional 4-bp and 3-bp TSDs that are atypical for Transib transposons.  Both RAG1/2-mediated and Transib transpositions show strong preference for GC-rich target sites [12–14,32], even though genomes hosting Transib transposons are AT-rich (Figure S4; Table 2).

Structure of Transib TIRs

The structure and conservation patterns of the 38-bp termini of Transib transposons from 21 different families closely resemble those of RSSs, suggesting that the latter were derived from termini of ancient Transib transposons (Figures 4 and S4).  The 38-bp consensus TIR of Transib transposons consists of a conserved 5'- CACAATG heptamer separated by a variable 23-bp spacer from an AAAAAAATC-3' nonamer.  This corresponds closely to the structure of RSSs, which are composed of the conserved heptamers 5'- CACAGTG separated by a variable 22-bp spacers from ACAAAAACC-like nonamers [1,5–7].  Only bases at positions 1 through 3 in the heptamer and at positions 5 and 6 in the nonamer are universally conserved in RSSs and absolutely essential for efficient V(D)J recombination [5–7].  The corresponding positions are perfectly conserved in all Transib transposons (Figure 4A and 4B; excluding the 85% conserved position 34 in the Transib consensus that corresponds to position 5 in the RSS nonamer).  The probability of the observed match between the RSS and Transib termini to occur by chance is less than 10-3 (see Materials and Methods).  Although most Transib families are represented by transposons flanked by TIRs similar to RSS23 (Figure 4A), several families include transposons with 5' and 3' termini similar to RSS12 and RSS23, respectively (Figure 4C).  Therefore, even the 12/23 rule [1] can be derived directly from the sequence structure of known Transib transposons.

RAG1 Core–Like Sequences in the Sea Urchin, Lancelet, Starlet Sea Anemone, and Hydra Genomes

Using RAG1 proteins as query sequences in a WU BLAST search against sea urchin contigs sequenced at Baylor College (see Materials and Methods), we identified eight proteins approximately 30% identical to portions of the RAG1 core and approximately 50% identical to each other (see Figures 2, 5, and S5).  Only one protein is present in two copies, which are 94% identical to each other at the DNA level (contigs 81987 and 6797).  Both copies appear to be encoded by pseudogenes damaged by a stop codon at the same position of each protein.  Interestingly, the 6,690-bp contig 6797 harbours two additional defective pseudogenes coding for different RAG1 core–like proteins (Figure 5).  We also identified a 597-aa protein sequence encoded by a single open reading frame (contig 29068, positions 1157–2944), which is 28% identical to nearly the entire RAG1 core (positions 461–1002 in the human RAG1, Figure S5).  Extensive analysis of the flanks failed to show any hallmarks of putative transposons that might be associated with this RAG1-like protein, and we did not find any evidence indicating that other RAG1 core–like proteins are encoded by transposable elements (Figure 5).  Using FGENESH [33], we detected that the RAG1 core–like open reading frame (ORF) in the contig 29068 forms a terminal exon (positions 1154–2947) of an incomplete hypothetical gene composed of two exons (internal and terminal; see Figure S6).  The 3' terminal portion of the internal exon encodes a protein sequence that appears to be marginally similar to an approximately 50-aa fragment of the RAG1 core (positions 394–454 in human RAG1; Figure S5).  The RAG1 core–like protein in whole genome shotgun (WGS) contig 12509 (Figure 5) also seems to be encoded by the last exon starting at position 1650 of a hypothetical RAG1-like gene.  Although the two proteins are only 38% identical to each other, they share common features: (1) their N-terminal portions are missing and the RAG1-like sequences start at positions 17 or 18; (2) in both proteins the first aa residue overlaps with the acceptor splice site; and (3) their similarity to RAG1 starts at positions corresponding to position 470 of the human RAG1.  Remarkably, the acceptor splice site positions in the sea urchin RAG1 core–like proteins closely correspond to those in RAG1 from teleosts (i.e., most of the living ray-finned or bony fish), in which RAG1 is split by an intron at position homologous to Gly460 in human RAG1 [34].  Using the same RAG1 query sequences in a TBLASTN search against WGS trace sequences from the lancelet (Branchiostoma floridae) genome recently sequenced at the Joint Genome Institute (see Materials and Methods), we found that the lancelet genome encodes protein sequences approximately 35% identical to the RAG1 core (Figure S5; RAG1L_BF; BLASTP E-value is equal to 10-34).  Again, as in the case of the sea urchin sequences, the lancelet RAG1 core–like elements show no hallmarks of transposons (data not shown).  However, unlike highly conserved RAG1 proteins, the RAG1 core–like proteins are remarkably diverse (see Figure 2).  During the second review of the manuscript of this article, we were kindly informed by Dr.  Hervé Philippe of a RAG1 core–like sequence present the starlet sea anemone (Nematostella vectensis).  After that, we screened all available Trace Archives (Materials and Methods) and detected additional RAG1-like proteins.  In starlet sea anemone, several approximately 1000-bp WGS trace sequences were found (e.g., GenBank Trace Archive IDs 668021618, 558173651, 568641192, and 599572062), which encode protein, called RAG1L_NV, that is approximately 30% identical to the human RAG1 core (positions 284–802, TBLASTN, 10-26 < E < 10-7).  We also found several approximately 1000-bp WGS trace sequences of Hydra magnipapillata (Trace Archive IDs 688654311, 647073738, 666995387, 687186526, 688683890, and 688948453), coding for protein sequences 26%-30% identical to the RAG1 core (positions 753-995, E-value is approximately equal to 10-7 in a BLASTX search against GenBank).  Using these trace sequences, we partially assembled a hydra gene, called RAG1L_NM, which encodes the RAG1 core–like protein.  Remarkably, the hydra RAG1L_NM protein turned out to be significantly similar to the Transib TPase (26% identity; E-value is approximately equal to 10-14 in a BLASTX search against GenBank proteins combined with the Transib TPase sequences).  Therefore, the hydra RAG1 core–like protein provides the first direct link between the RAG1 core and Transib TPase.

Schematic Structure of the Sea Urchin RAG1-Like Sequences

Contig accession numbers are shown in the left column.  Inverted complement contigs are marked by “c” followed by the contig number.  In each contig, RAG1-like proteins (white rectangle) are schematically aligned with the human RAG1 core (top rectangle).  Nucleotide positions of the RAG1-like sequences are shown beneath the white rectangles.  Three pairs of recently duplicated sequences (nucleotide identity is higher than 95%) are underlined by red, green, and black lines, respectively.  Transposable and repetitive elements detected in the flanking regions are marked by painted rectangles.  Names of these elements are shown above the rectangles.  Asterisks denote stop codons in the corresponding RAG1-like sequences.  BLASTP E-values characterizing similarities between the sea urchin and RAG1 proteins are shown above the white rectangles.  Multiple alignment of these protein sequences is reported in Figure S5.

N-Terminal–Like Domain of RAG1 in the Sea Urchin, Lancelet, Starlet Sea Anemone, and Hydra Genomes

A separate analysis of the assembled sea urchin sequences yielded seven sequences encoding three diverse proteins that were significantly similar to the 380-aa N-terminal domain of RAG1 (BLASTX, E < 10-4), excluding the 100-aa N-terminus (Figure 6).  The first 305-aa protein is encoded by contig 1226, and its recently duplicated copies are on contigs 1219 and 1222 (approximately 95% identical to each other at the protein level).  The second, 195-aa protein (contig 83099) is the shortest.  It is only approximately 26% identical to the first protein and more than 90% identical at the DNA level to its duplicate on contig 86231.  We also found a third protein on contig 768 that contains unique motifs in its N-terminal regions that best match the homologous regions of RAG1.  Furthermore, we found that unassembled WGS trace sequences encode two other proteins, P4_SP and P5_SP, similar to the N-terminal RAG1 domain (Figure 6).  By analyzing the lancelet WGS traces, we also found that the lancelet genome encodes five different proteins similar to the N-terminal domain of RAG1 (BLASTP E values in searches against all GenBank proteins were in a range of 10-14–10-7).  DNA sequences coding for these proteins, P1_BF through P5_BF, were manually assembled from overlapping WGS sequences (data available upon request).  The proteins detected in the sea urchin and lancelet genome share a ring finger motif as well as two novel motifs matching the N-terminal RAG1 domain (Figure 6) and remotely resembling C-x2-C zinc finger motifs.  The new conserved motifs are H-x3-L-x3-C-R-x-C-G and D-x3-I-h-P-x2-F-C-x2-C, and their function remains to be determined.  It is thought that the ring finger motif of RAG1 functions as a zinc-binding domain, is involved in dimerization [30,35], and acts as an E3 ligase in the ubiquitylation [36].  It also likely that the N-terminal RAG1 and RAG1-like proteins share an additional conserved motif W-x-p-h-x(3–6)-C-x2-C that resides between conserved motif 2 and the ring finger (Figure 6).  None of the sea urchin and lancelet proteins align to the approximately 100-aa N-terminus of RAG1, which may indicate that this portion is missing from the genome or highly diverged and difficult to detect.  It is also worth noting that this portion corresponds to a separate exon in some teleosts (see Discussion).  The ring finger motif itself is also present in several sea urchin proteins unrelated to RAG1 but significantly similar to diverse proteins associated with immune and developmental systems as well as regulation of transcription.  To test whether the reported sea urchin sequences represent a true RAG1-like match, we cut off the ring finger motif and repeated the BLASTP search against all GenBank proteins.  Even without the finger, the remaining portions of the sea urchin sequences were significantly similar to the corresponding portions of RAG1.  BLASTP E-values were 9×10-9, 7×10-5, and 10-3 for the P5_SP, P4_SP, and 768_SP sequences, respectively; because both the low-complexity filter and composition-based statistics were applied, the corresponding E-values were estimated very conservatively.  BLASTP searches of the sea urchin sequences against all GenBank proteins, excluding RAG1, detected only the ring finger domain of the sea urchin sequences.  E-values of these matches were much higher than the E-values of similarities to the RAG1 proteins (SP_768: 0.04 versus 7×10-7; SP_86231: 3·10-4 versus 7×10-7; SP_1226: 10-4 versus 2×10-7; P4_SP: 10 versus 2×10-7; P5_SP does not have ring finger and matches RAG1 only, E-value == 9×10-7).  Based on the same approach, our study found that the starlet sea anemone and hydra genomes also encode several families of the N-terminal RAG1 domain that appear to be separate from the RAG1 core–like proteins (data not shown).  The only exception was the already mentioned sea anemone RAG1 core–like sequence.  The approximately 90-aa N-terminus of the latter sequence is the ring finger (E < 10-7, multiple BLASTP matches against known ring fingers in GenBank).

Multiple Alignment of the RAG1 N-Terminal Domain and Sea Urchin Protein Sequences

RAG1_HS, RAG1_PD, RAG1_SS, RAG1_RM, and RAG1_LM mark the human (GenBank accession number NP_000439), lungfish (AAS75810), pig (BAC54968), stripe-sided rhabdornis or Rhabdornis mysticalis bird (AAQ76078), and latimeria (AAS75807) proteins, respectively.  The sea urchin and lancelet proteins are marked by “_SP” and “_BF” following the identification numbers of the corresponding contigs.  Protein sequences assembled from the sea urchin and lancelet WGS Trace Archives are denoted as P4-P5_SP and P1-P5_BF, respectively.  Three conserved motifs are underlined and numbered.  The third conserved motif is known as the ring finger.  Distances from the protein N-termini are indicated by numbers.

Discussion

The significant similarity between the Transib TPases and RAG1 core, the common structure of the Transib TIRs and RSSs, as well as the similar size of TSDs characterizing transpositions of Transib transposons and transpositions catalyzed by RAG1 and RAG2, directly support the 25-year-old hypothesis of a transposon-related origin of the V(D)J machinery.  Previously, the “RAG transposon” hypothesis was open to challenge by alternative models of convergent evolution.  Because there were no known TPases similar to RAG1, it could be argued that RAG1 independently developed some TPase-like properties, rather than deriving them from a TE-encoded TPase [24].  These arguments can now be put to rest.  As shown in this paper, the RAG1 core was derived from a Transib TPase, but given the low identity between the Transib TPase and the RAG1 core (14%–17%) it is not clear whether the ancestral transposon was a member of the group of canonical Transib transposons preserved in modern genomes of insects, hydra, and sea urchin (see Figure 1), or a member of an unknown group of Transib transposons that encoded a TPase that was more similar to RAG1 core than to the canonical TPase from the currently known Transib transposons.  Furthermore, after its recruitment, the RAG1 core most likely went through a period of intensive transformations due to diversifying/positive selection, which further decreased its similarity to Transib TPase.  Afterwards, the RAG1 genes continued to evolve at a slow and steady pace under stabilizing selection, as indicated by the observed conservation of the RAG1 core (79% identity between sharks and mammals).  Some of the intermediate stages of RAG1 evolution can be inferred from analysis of the sea urchin in which RAG1-like proteins were recently observed [37], and from analysis of the lancelet, starlet sea anemone, and hydra genomes.  Based on the presence of stop codons disrupting some of the RAG1-like sequences, it has been suggested [37] that the sea urchin sequences represent remnants of transposable elements.  Typically, TPase-coding autonomous DNA transposons are present in only a few complete copies per genome.  At the same time, sequences homologous to their terminal portions, including specific TIRs, are usually abundant due to the proliferation of nonautonomous DNA transposons fueled by the TPase expressed by the corresponding low-copy autonomous elements.  Therefore, even if only 30% of the sea urchin genome has been sequenced to date, it is expected that the regions flanking the TPase portions of potential autonomous elements should be similar to numerous nonautonomous elements.  So far, we have found no evidence of such similarities.  Detailed analysis of regions flanking the sea urchin RAG1-like DNA coding sequences revealed a variety of different transposable elements inserted in the proximity of the coding sequences (see Figure 5).  Nevertheless, based on the orientations and relative positions of these transposons, none of them appears to be associated with the RAG1-like sequences (see Figure 5).  We also could not identify the 5-bp TSDs and TIRs characteristic of the Transib superfamily.  Still, given that only one third of the sea urchin genome is currently assembled as a set of contigs longer than several thousand nucleotides (the remaining portion is represented by short WGS sequences), we cannot rule out the possibility that the sea urchin RAG1-like proteins are remnants of an unknown branch of Transib transposons.  Given that the genomes of lancelet, hydra, and starlet sea anemone are currently available only as unassembled WGS traces, the question whether the corresponding RAG1-like sequences are remnants of transposons or genes/pseudogenes must be left open.  The alternative possibility is that the sea urchin RAG1 core–like sequences represent diverse genes and pseudogenes that belong to a rapidly evolving multigene family.  This opens the tantalizing possibility that the RAG1 core was recruited from a Transib TPase in a common ancestor of Bilaterians and Cnidarians, and subsequently lost in nematodes, insects, and sea squirts (see Figure 1).  Furthermore, given that the sea urchin, lancelet, hydra, and starlet sea anemone genomes harbor several highly divergent N-terminal–like domains, separate from the RAG1 core–like sequences and known transposable elements, it is very likely that the N-terminal–like domains of RAG1 also form a multigene family that can be traced back to a common ancestor of Deuterostomes (see Figure 1).  If so, then both N-terminal and core domains of RAG1 might have been derived from different genes present in a common ancestor of Deuterostomes.  Alternatively, the N-terminal domain of RAG1 might have been derived from a separate, unknown transposon.  The N-terminal domain of RAG1 has long been viewed as distinct from the core domain due to its lack of direct involvement in the V(D)J recombination reaction.  In the sea urchin, lancelet, hydra, and starlet sea anemone genomes, the RAG1 core–like sequences and the N-terminal domain–like sequences do not appear to be linked to each other or to any other proteins.  The only notable exception is the anemone RAG1 core–like protein sequence, which is capped by the 90-aa ring finger motif.  Taken together with the fact that only the RAG1 core is significantly similar to Transib TPase, the data suggest that the vertebrate RAG1 represents a fusion of once separate proteins.  This is consistent with the observation that in teleosts, (bony fish) the RAG1 gene is divided into exons by either one or two introns.  As a result, the RAG1 core is split into separate exons at the aa position that corresponds to position 460 in the human RAG1gene [29,34,38].  The core-like sequences encoded by the sea urchin WGS sequence contigs 29068 and 12509 correspond to either the second or third RAG1 exon in teleosts (depending on the number of introns), which is remarkably consistent with the fusion model.  The same model predicts that the N-terminal domain of RAG1 could also be assembled from two separate domains based on the presence of the second intron in some teleosts, splitting the N-terminal domain into the 102-aa N-terminal subdomain and the rest [34].  As indicated above, this subdomain, corresponding to the first exon in the genes split by two introns, appears to be missing in the sea urchin, lancelet, hydra, and starlet sea anemone N-terminal–like proteins.  It may be encoded by a separate exon that is difficult to detect given its short length and the high level of sequence divergence between these species and vertebrates, or it might have been added in vertebrates.  Similarly, the RAG1 core–like protein in the sea urchin genome is shorter in its N-terminal part than the core domain in vertebrates and the corresponding Transib TPase.  Again, it is unclear if this part is not present in sea urchins or simply undetectable due to its small size and the high sequence divergence.  It is currently believed that both RAG1 and RAG2 proteins were originally encoded by the same transposon recruited in a common ancestor of jawed vertebrates [3,12,13,16].  However, none of the Transib transposons identified so far encode any proteins other than the Transib/RAG TPase.  Also, we could not find any RAG2-like sequences in the recently sequenced sea urchin, lancelet, hydra, and sea anemone genomes, which encode RAG1-like sequences.  Autonomous DNA transposons from the MuDR, Harbinger, and En/Spm superfamilies are each known to encode a second regulatory protein [23,24], whereas some transposons from these superfamilies encode the TPase only.  Therefore, it is in principle possible that an ancient vertebrate Transib that was a direct ancestor of the RAG1 core also encoded a second protein, the direct ancestor of RAG2.  Nevertheless, the apparent lack of RAG2-like proteins in the sequenced portion of the sea urchin, lancelet, hydra, and sea anemone genomes, as well as in Transib transposons suggests that RAG2 was introduced in a separate event in jawless vertebrates.  However, given the low 30% identity between the RAG1 and sea urchin/lancelet/sea squirt RAG1-like proteins, we cannot exclude the possibility that the ancestral RAG2 protein went through a period of strong diversification driven by positive selection, and it can no longer be identified by sequence comparisons but may still be present in invertebrates.  In any case, the origin of the V(D)J recombination system in jawless vertebrates appears to be a culmination of earlier evolutionary processes rather than an isolated event associated with insertion of a single transposon.  If so, detailed studies of individual components, including active Transib transposons and invertebrate proteins homologous to RAG1 elements can bring new breakthroughs in our understanding of evolutionary and mechanistic aspects of V(D)J recombination.  The observed sequence similarity between the RAG1 and Transib TPase protein can help to identify aa residues in the TPase that are crucial for transposition of Transib transposons.  For instance, on the basis of the TPase comparison to RAG1 (see Figures S1 and S3), we were able to identify correct positions of the last two aa residues in the DDE catalytic triad (see Figure 2 in [17]), missed in our previous study due to insufficient data.  Interestingly, only two cysteines of the zinc finger B (ZFB) C2H2 motif in RAG1 (residues 695–761) involved in its binding to RAG2 [30,31] are perfectly conserved in the Transib TPases (motif 7; see Figures 3 and Figure S3).  The remaining portion of the ZFB motif was probably lost in TPases of insect Transib transposons, which do not encode RAG2-like proteins.  Notably, two ZFB cysteines are part of the conserved SxxCxxC motif, and mutations of the serine from the same motif cause severe defects in RAG1 transpositions in vitro [32].  Therefore, the presence of serine in this motif is expected to be crucial to Transib transpositions.  After submission of our manuscript, additional biochemical evidence favoring evolution of V(D)J recombination from transposable elements was reported [25].  Analogously to V(D)J recombination, transposition of the fly Hermes transposon, which belongs to the hAT superfamily, is also characterized by a double-strand break via hairpin formation on flanking DNA and 3' OH joining to the target DNA [25].  However, although the observed biochemical relationship between the hAT TPase and V(D)J recombination is a step forward in our understanding of transposition reaction, several arguments strongly suggest that V(D)J machinery evolved from a Transib rather than from hAT transposon.  First, as we mentioned previously, there is no significant sequence identity between hAT TPases and RAG1, even if one employs a PSI-BLAST search with most relaxed parameters (i.e., E < 10, no filters, no composition-based statistics).  Second, although RAG1/2-mediated transpositions are characterized by 5-bp (sometimes 4-bp) TSDs, all known hAT transposons are characterized by 8-bp TSDs.  Third, unlike in the case of Transib transposons, TIRs of hAT transposons are different from RSS both in terms of DNA sequence similarities and their conservation patterns (Figure S7).  Fourth, hAT- and RAG1/2-mediated transpositions differ dramatically in terms of the GC content of their target sites: Unlike Transib transposons and RAG1 transpositions occurring in GC-rich DNA, hAT transposons tend to be integrated into AT-rich regions (Table S2).  All four arguments strongly favor evolution of V(D)J machinery from a Transib transposon.  Most likely, the Transib transpositions are also characterized by hairpin intermediates formed by the ends of the donor DNA double-strand breaks, as observed during V(D)J recombination and hAT transposition.

Materials and Methods

DNA and protein sequences.

Assembled D. pseudoobscura sequences were downloaded from the Human Genome Sequencing Center at Baylor College of Medicine through the Web site at http://hgsc.bcm.tmc.edu/projects/drosophila/ on 2 March 2004.  Preliminary A. aegypti sequence data were obtained from The Institute for Genomic Research through the Web site at http://www.tigr.org on 4 March 2004.  Assembled D. melanogaster sequences were downloaded from the Berkeley Drosophila Genome Project at http://www.fruitfly.org/sequence/download.html on 17 February 2004.  Partially assembled S. purpuratus contig sequences were downloaded on 12 August 2004 from the Baylor College of Medicine through the Web site at ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Spurpuratus/blast/Spur20030922-genome.  In addition to the assembled contigs, Baylor College of Medicine, Human Genome Sequencing Center (http://www.hgsc.bcm.tmc.edu) produced an approximately 8-Gb set of short unassembled WGS sequences, called “traces”, which cover nearly the entire sea urchin genome.  We downloaded these sequences from the GenBank Trace Archive at the National Center for Biotechnology Information (NCBI; ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/) on 17 November 2004.  Also, we downloaded an approximately 5-Gb set of unassembled traces that cover almost completely the 600-Mb genome of Florida lancelet (ftp://ftp.ncbi.nih.gov/pub/TraceDB/branchiostoma_floridae/; 3 December 2004).  These sequences were produced and deposited in the GenBank Trace Archive by Department of Energy Joint Genomic Institute (http://www.jgi.doe.gov/).  All other DNA and protein sequences were accessed from GenBank (NCBI) through the server at http://www.ncbi.nih.gov/Genbank/ and from Ensembl (EMBL-EBI and Sanger Institute) via the server at http://www.ensembl.org.  Sequences of the Transib1 through Transib4 and Transib1_AG through Transib3_AG transposons [17] were obtained from the D. melanogaster (drorep.ref) and A. gambiae (angrep.ref) sections of Repbase Update [39] at Genetic Information Research Institute (http://www.girinst.org).

Sequence analysis.

Computer-assisted identification and reconstruction of the Transib transposons was done as described previously [17,40–42].  DNA sequence analysis including local sequence alignments, multiple alignments, and reconstruction of the Transib consensus sequences was done using software developed at Genetic Information Research Institute (available upon request) and WU-BLASTN 2.0 (http://blast.wustl.edu).  To avoid background noise introduced by mutations, Transib relics, whose TPase-coding regions contained numerous stop codons and indels, were ignored unless several copies were available.  (We included in the analysis incomplete relics of the Transib2–5_AA TPases represented by single DNA copies.)  Prediction of putative exons and introns encoded by the Transib consensus sequences was done with FGENESH [33] (at http://www.softberry.com).  Multiple alignments of distantly related RAG1 and Transib TPase protein sequences were created by T-Coffee [40].  Shading and minor manual refinements of the aligned sequences were done using Genedoc [43].  Phylogenetic trees were produced by using MEGA3 [44].  Some of the sea urchin sequences encoding the RAG1 N-terminal domain were assembled from traces based on the Baylor BAC-Fisher server at http://www.hgsc.bcm.tmc.edu/BAC-Fisher/ (the results of assembly were verified manually).  All GenBank proteins were downloaded from ftp://ftp.ncbi.nih.gov/blast/DB/fasta/nr (February 2004) and were combined into a single set with the identified Transib TPases.  No Transib TPases had been deposited or annotated previously in GenBank, except for two short hypothetical proteins predicted automatically during annotation of the D. melanogaster genome: 151-aa gi:30923617 and 123-aa gi:30923765.  These proteins are apparent fragments of Transib TPases encoded by relics of Transib transposons, including Transib5_DM.  A standalone 2001 version of PSI (Position-Specific Iterating)-BLAST [18,45] was used for detection of proteins that were significantly similar to TPases encoded by Transib and other superfamilies of DNA transposons.  The PSI-BLAST program [18,45] is much more sensitive than a regular BLAST search due to the use of PSSM) PSI-BLAST first performs a standard BLASTP search of a protein query against a protein database and constructs a multiple alignment of matches exceeding a certain E-value threshold (called Ei value for the inclusion of sequences into PSI-BLAST iterations).  From this alignment, a PSSM is constructed.  The PSSM is a weight matrix indicating the relative occurrence of each of the 20 aa at each position in the alignment.  This new PSSM is used as the score matrix for a new BLAST search in a second iteration.  The process is repeated for a specific number of iterations or until convergence, when no additional proteins are added on successive iterations.  The use of a PSSM in place of a fixed generic substitution matrix such as BLOSUM62 results in a much more sensitive BLAST search [18,45].  Important practical aspects of using PSI-BLAST were recently described [46].  To ensure that a conservation profile for the Transib TPases and RAG1 proteins was not produced by a systematic error, we employed a procedure of “step-wise” PSI-BLAST iterations.  In this procedure we studied dependence of Ei values on the number of the Transib TPases combined with the GenBank proteins.  The following protocol describes the procedure: (1) Use a GenBank set combined with N number of Transib TPases (in our studies, N was equal to 7, 13, and 18), (2) run PSI-BLAST against GenBank combined with TPases using each TPase as a query or seed, (3) select only Transib TPase sequences with E-values less than 10-5 to define the PSSM, (4) take the best E-value (Ei) obtained by PSI-BLAST for RAG1s when PSSM is constructed without RAG1, then (5) repeat these operations for different numbers (N) of TPases.  Significant convergence of RAG1 and Transib TPases was observed to be independent of the particular type of substitution matrix (the same result was observed for both BLOSUM62 and PAM70 matrixes).  To avoid detection of false similarities caused by simple repeats and coiled coils, the PSI-BLAST search was performed using stringent conditions with the SEG [47] and COILS [48] filters masking all low-complexity regions and coiled coils, respectively; composition-based statistics [45] were also employed.  The probability P1 that the 5' terminus of a transposon from a particular Transib family would match by chance an RSS at its most conserved positions (positions 1-3 in the RSS heptamer, and positions 5 and 6 in the RSS nonamer) was estimated based on the following formula: P1 = fC × fA × fC × fA × fA, where fC (0.2) and fA (0.3) are frequencies of C and A in a set of 38-bp 5' termini of Transib transposons from 21 families (see Figure 4).  The value of P1 is 0.001, indicating a significant similarity between Transib TIRs and RSS.  Indeed, given that these five positions conserved in RSS are conserved in all TIRs from 21 families of Transib transposons, and the average identity between these 38-bp TIRs is only 49%, the chance of randomly matching these positions in TIRs from all 21 families is extremely small.  TBLASTN searches against the Trace Archive were performed by using the BLAST client (blastcl3 or netblast at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/, which accesses the NCBI BLAST search engine.  Names of all available Trace Databases were taken from a list of databases at http://www.ncbi.nlm.nih.gov/blast/mmtrace.shtml.

Preferential Insertion of Transib transposons into GC-Rich Sites

Each of the 35-bp insertion sites corresponds to two 20-bp DNA fragments flanking a genomic Transib element at its 5' and 3' termini.  One of the 5-bp TSDs flanking the 3' terminus of a Transib was excluded in each case.  Analogously, the 15-bp insertion sites were composed of two 10-bp flanking fragm.

Supporting Information

Similarity between C-Terminal Portions of the Transib2_AG TPase and RAG1

Two examples extracted from the NCBI BLASTP output illustrate similarity between the approximately 60-aa C-terminal portions of the Transib2_AG TPase (which we used as a query in a BLASTP search against all GenBank proteins) and the RAG1 core.

Multiple Alignment of Transib TPases

The catalytic DDE triad is marked by black rectangles.  Amino acids are shaded on the basis of their physiochemical properties according to the color scheme implemented in Genedoc [43]: Black shading marks hydrophobic residues, blue indicates charged (white font), positively charged (red font), and negatively charged (green font); red indicates proline (blue font) and glycine (green font); gray indicates aliphatic (red font) and aromatic (blue font); green indicates polar (black font) and amphoteric (red font); yellow indicates tiny (blue font) and small (green font).  The species abbreviations are as follows: SP, sea urchin; DP, D. pseudoobscura fruit fly; AG, African malaria mosquito; AA, yellow fever mosquito.  Transib1 through Transib5 are from the D. melanogaster fruit fly genome.

Multiple Alignment of the RAG1 Core and Transib TPase Proteins

The shading scheme is the same as in Figure S2.  The catalytic DDE triad is marked by black rectangles.  RAG1 aa whose replacements resulted in previously detected defects of V(D)J recombination [31] are marked by color rectangles indicated below the alignment blocks; red indicates DNA binding defect; green indicates nicking defect; cyan indicates hairpin defect; blue indicates joining mutants; yellow indicates catalytic mutants; gray indicates joining/transposition.  Presence and absence of corresponding residues in the Transib TPases are indicated by + and -, respectively.  Conserved motifs are marked by lines numbered from 1 to 10.  The species abbreviations are as follows: DP, D. pseudoobscura fruit fly; AG, African malaria mosquito; AA, yellow fever mosquito; GG, chicken; HS, human; XL, frog; CL, bull shark; FR, fugu fish.

TSDs in Transposons from Different Transib Families

For each family, DNA copies of transposons are aligned to the corresponding consensus sequence.  The consensus sequence is shown in the top line.  Dots indicate nucleotide identity with the consensus sequence; hyphens represent alignment gaps.  Internal portions of transposons are not shown and are marked by xxx.  TSDs are highlighted.  Coordinates of the reported elements are shown in the first two columns (sequence name, beginning to end).  (A) TransibN1_AG family from mosquito.  (B) TransibN2_AG family from mosquito.  (C) TransibN3_AG family from mosquito.  (D) TransibN1_DP family from fruit fly.  (E) Hopper family from fruit fly.  (F) TransibN1_DM family from fruit fly.  (G) TransibN1_SP family from sea urchin.

Multiple Alignment of the RAG1 Core and RAG1 Core–Like Proteins Encoded by the Sea Urchin and Lancelet Genomes

The shading scheme is the same as in Figure S2 and S3.  The species abbreviations are as follows: SP, sea urchin; BF, lancelet; HS, human; CL, bull shark; GG, chicken; XL, frog; FR, fugu fish.  The lancelet RAG1L_BF protein is encoded by several overlapping WGS trace sequences (for example, GenBank Trace Archive identification numbers 543943730, 538583629).

RAG1-Like Protein SP_29068 in the Sea Urchin Contig 29068

(A) Exon/intron structure of the SP_29068 gene is reported based on the FGENESH prediction.  (B) Alignment of the predicted protein and human RAG1 (29% identity, E = 10-43.  The intron in SP_29068 is inserted between residues shaded in green and red.  Gly460 that harbors the intron in the teleost RAG1 is shaded in black.

Structure of hAT 5' Termini

Non-gapped alignment of consensus sequences of 5' termini of transposons from 22 different families is shown beneath the RSS23 consensus sequence, composed of the RSS heptamer and nonamer.  The most conserved nucleotides in the heptamer and nonamer, which are necessary for efficient V(D)J recombination, are highlighted.  Among the necessary RSS nucleotides, only one, marked by a + corresponds to a nucleotide that is 100% conserved in hAT transposons.  The critical third nucleotide of the hAT 5' termini is always G, as opposed to C in the RSS heptamer.  It is also clear from the alignment that the hAT termini do not have any second conserved block, which is expected to be preserved if RSSs have evolved from hAT termini.  Hobo (GenBank number X04705), Homer (AF110403), Hermes (L34807), Ac9 (K01904), Tam3_AM (X55078), TAG1 (L12220), Pegasus (U47019) are active hAT transposons from fruit fly, Queensland fruit fly, house fly, maize, snapdragon, thale-cress, and African malaria mosquito, respectively.  HOPPER_BD is from oriental fruit fly (GenBank AF486809).  The consensus sequences of hAT-1N_DP and hAT-1N_DP (nonautonomous transposons from fruit fly, D. pseudoobscura); HAT1N_DR, hAT-2n1_DR, and hAT-N19_DR (nonautonomous transposons from zebrafish); CHARLIE1A and CHESHIRE (human); hAT-N1_SP (sea urchin); ATHAT1, ATHAT7, and ATHAT10 (thale-cress); PegasusA, HATN4_AG, and hAT-2N_AG (African malaria mosquito) were reported in Repbase Update.

Transib TPase in Eukaryotes

Columns 1 and 2 list common and Latin names of species whose genomes contain Transib TPase sequences.  Column 3 shows GenBank sections collecting corresponding sequences: "NR", "WGS", "EST", and "HTGS" are names of GenBank sections; "tr" stands for “Trace Archives.” Column 4 shows a range of E-values of matches between the sea urchin Transib TPase (Transib1_SP) and TPases encoded by the listed species that were detected in TBLASTN searches against corresponding sections of GenBank.  Matches to the Transib TPase observed for Oryza sativa indica (seven sequences from Trace Archives, 10-48 < E < 10-13) were discarded as a likely sequencing contamination, based on the fact that these sequences were over 80% identical to Hydra magnipapillata traces (the hydra Trace Archive dataset contains over 100 sequences matching the TPase, and hydra Transib TPase sequences are also present in the dbEST section of GenBank).  Analogously, matches to the Transib TPase detected in the AC011430 HTGs and AADC01054609 WGS GenBank sequences, which were annotated as portions of the human genome, were discarded as products of contamination (these sequences contain 100% identical copies of the non-long terminal repeat (LTR) retrotransposon G2_DM [17] from D. melanogaster).

GC Content of Target Sites for hAT Transposons

The table shows that hAT transposons are inserted preferentially into GC-rich sites.  Each of the 35-bp insertion sites corresponds to two 14-bp and 13-bp DNA fragments flanking a genomic hAT element at its 5' and 3' termini; one of the 8-bp TSDs (flanking the 3' terminus of a transposon) was excluded in each case.  Analogously, the 15-bp insertion sites were composed of two 4-bp and 3-bp flanking fragments.  (1) GenBank accession number U47019; (2) Repbase Update, the angrep.ref section; (3) GenBank X04705; (4) Repbase Update, the drorep.ref section; (5) Repbase Update, spurep.ref; (6)Repbase Updates, the zebrep.ref section.  Copies of Pegasus, HATN4_AG, and HAT2N_AG were identified in the mosquito A. gambiae genome; Hobo and hAT-1N_DP in the D. melanogaster and D. pseudoobscura fruit fly genomes, respectively; HAT-1N_SP in the sea urchin genome; and HAT1N_DR, HAT-2N1_DR, and HAT-N19_DR in the zebrafish genome.

A Role for Adenosine Deaminase in Drosophila Larval Development

Abstract

Adenosine deaminase (ADA) is an enzyme present in all organisms that catalyzes the irreversible deamination of adenosine and deoxyadenosine to inosine and deoxyinosine.  Both adenosine and deoxyadenosine are biologically active purines that can have a deep impact on cellular physiology; notably, ADA deficiency in humans causes severe combined immunodeficiency.  We have established a Drosophila model to study the effects of altered adenosine levels in vivo by genetic elimination of adenosine deaminase-related growth factor-A (ADGF-A), which has ADA activity and is expressed in the gut and hematopoietic organ.  Here we show that the hemocytes (blood cells) are the main regulator of adenosine in the Drosophila larva, as was speculated previously for mammals.  The elevated level of adenosine in the hemolymph due to lack of ADGF-A leads to apparently inconsistent phenotypic effects: precocious metamorphic changes including differentiation of macrophage-like cells and fat body disintegration on one hand, and delay of development with block of pupariation on the other.  The block of pupariation appears to involve signaling through the adenosine receptor (AdoR), but fat body disintegration, which is promoted by action of the hemocytes, seems to be independent of the AdoR.  The existence of such an independent mechanism has also been suggested in mammals.

Introduction

Adenosine deaminase (ADA) is an enzyme present in all organisms that catalyzes the irreversible deamination of adenosine and deoxyadenosine to inosine and deoxyinosine.  It is a critically important enzyme for human survival because its congenital absence causes severe combined immunodeficiency disease (SCID).  ADA deficiency accounts for about 20% of all types of SCID [1].  It is one of the most severe human immunodeficiencies and is associated with depletion of all three major categories of lymphocytes: T cells, B cells, and natural killer cells, resulting in impaired cellular immunity and decreased production of immunoglobulins [2].  Without intervention, the affected individuals die from opportunistic infections within the first few months of life.  ADA occurs as a soluble monomer in all human cells, but also exists as “ecto-ADA,” bound to the membrane glycoprotein CD26/dipeptidyl peptidase IV, and it has been suggested that this form of ADA regulates extracellular adenosine levels [3].  ADA deficiency is accompanied by greatly elevated levels of the ADA substrates adenosine and deoxyadenosine, both of which are biologically active purines that can have a deep impact on cellular physiology.  Adenosine is not just a metabolite; it is also a signaling molecule that regulates numerous cellular functions by binding to G protein-coupled adenosine receptors (A1, A2a, A2b, and A3 in mammals) that can regulate intracellular cyclic adenosine monophosphate [4].  Deoxyadenosine is a cytotoxic metabolite released by various cell populations that undergo programmed cell death; it can kill cells through a mechanism that includes disturbances in deoxynucleotide metabolism [5].  Extracellular adenosine is now considered an important stress hormone that is released in excessive amounts in the vicinity of immune cells during both systemic and cellular stress [6].  The predominant source of extracellular adenosine during systemic activation of the stress system is the sympathetic nervous system [7].  Specific inflammatory stimuli such as bacterial products are also capable of triggering adenosine release from immune cells [8].  These data are in line with evidence demonstrating a dramatic increase in extracellular adenosine levels under conditions associated with multiple organ failure, which is the cause of 50%–80% of all deaths in surgical intensive care units [6].  ADA is not the only adenosine deaminase in mammalian cells.  Recently, the cat eye syndrome critical region protein 1 (CECR1) gene was identified and shown to encode a protein representing a subfamily of proteins related to but distinct from classical ADAs [9].  The duplication of a small region of chromosome 22 containing this gene is associated with “cat eye syndrome,” a disorder characterized by hypoplastic kidneys, congenital heart malformation, and anomalous pulmonary venous connections.  The founding member of this subfamily is encoded by insect-derived growth factor(IDGF) [10], and homologs have been described in various organisms [11–14].  We have previously found six Drosophila genes with sequence similarity to the CECR1 subfamily [15].  Their products are mitogenic on Drosophila cells, and at least two of them (ADGF-A and ADGF-D) exhibit strong ADA activity, which is necessary for their mitogenic function.  We therefore named them adenosine deaminase-related growth factors (ADGFs).  We also demonstrated that adenosine functions as a negative signal for cell proliferation and concluded that ADGFs stimulate cell growth in vitro by depletion of extracellular adenosine [16].  Drosophila also contains a gene, termed Ada, with sequence similarity to human ADA, but as we have previously shown the product of this gene is most likely not an active ADA [16].  In this report we show that a null mutation in Drosophila ADGF-A gene leads to dramatically increased levels of adenosine and deoxyadenosine in the larval hemolymph.  This increase leads to larval death associated with the disintegration of fat body and the development of melanotic tumors.  We present a detailed analysis of the hematopoietic defects associated with the adgf-a mutation, show a genetic interaction of this mutation with signaling through the Drosophila adenosine receptor (AdoR, encoded by the gene CG9753) and with regulation of premetamorphic changes by ecdysone, as well as a genetic interaction of ADGF-A with a major innate immunity regulator—the Toll signaling pathway.

Results

Mutation in the ADGF-A Gene Causes Larval Death and Melanotic Tumors

We produced mutations in five of the six ADGF genes by homologous recombination mutagenesis [17] and showed that loss of the most abundantly expressed gene, ADGF-A, leads to death in the larval or pupal stage.  Under optimal conditions (20–30 isolated homozygous larvae per vial), about 60% of larvae homozygous for the adgf-a mutation reach the third instar.  Development during the third larval instar is significantly delayed, and wandering homozygous larvae usually appear 2 d after their heterozygous siblings, which start wandering at about 5 d of development.  Some homozygous third-instar larvae can be found alive in the vial even after 10 d of development.  Mutant third-instar larvae show fat body disintegration (Figure 1A and 1B) and multiple melanotic tumors (Figure 1C), predominantly in the caudal part of the body and accompanied by disintegration of the fat body.  Melanization of the lymph glands was never observed in these larvae, and the imaginal discs and brain appear normal.  Less than 30% of homozygotes eventually pupate.  Homozygous pupae usually die soon after pupariation; in some cases they develop normal head and thorax imaginal structures; however, abdominal parts usually do not develop.  There is also an abnormal curvature (to the right) of the pupal abdomen (Figure 1D).  Less than 2% of mutant pupae develop normally and eventually emerge as adults without any obvious abnormalities besides the abdominal curvature; some of them are sterile.  To confirm that the mutant phenotype is caused solely by a mutation in the ADGF-A gene, we created transgenic flies carrying the ADGF-A gene under a heat-shock promoter (HS-ADGF-A).  The adgf-a homozygous flies carrying the HS-ADGF-A construct showed survival rates significantly higher than adgf-a even without heat shock, probably due to leaky expression of the HS-ADGF-A construct (Figure 2A).  However, while non-heat shocked animals still produced many melanotic tumors, only 22% of animals that were heat shocked as late embryos/early first instar developed these tumors (Figure 2B).  This result confirms that the mutant phenotype is caused by the mutation in the ADGF-A gene.  This conclusion is further supported by the even more efficient rescue achieved by expression of transgenically provided ADGF-A in the lymph glands using the Gal4/UAS system (see below).

adgf-a Mutant Phenotype

(A and B) Fat body disintegration visualized by GFP expression driven by Cg-Gal4 driver in the fat body.  While adgf-a/+ heterozygous third instar larvae have normal flat layers of fat body (A),adgf-a mutant showed extensive fat body disintegration into small pieces of tissue (B).  (C) Multiple melanotic tumors present in adgf-a mutant third-instar larva.  (D) An adgf-a mutant pupa with typical abdominal curvature.

Rescue of the adgf-a Mutant Phenotype by Expression of ADGF-A in Different Tissues

(A) Percentage of pupae (blue bars) and adult flies (purple bars) demonstrating the larval and pupal survival, respectively, of the adgf-a mutant flies rescued by expression of transgenic ADGF-A in different tissues.  Along the x-axis (which is shared with [B]), the rescue experiments are shown (marked by the Gal4 driver used for expression of ADGF-A except for first three sets of bars—the first set presents only an adgf-a mutant, the second an adgf-a mutant carrying HS-ADGF-A construct without heat shock, and the third with heat shock) and the y-axis represents percentage of pupae and adult flies out of the total number of transferred first-instar larvae of particular genotype.  Each experiment was repeated at least four times (with 20–30 animals in each vial) and the standard error is shown.  (B) Percentage of late third-instar larvae with melanotic tumors.  The x-axis is shared with (A) (described above).  The y-axis shows the percentage of larvae with tumors out of all larvae of each genotype examined for (A).

The adgf-a Mutant Phenotype Is Associated with Elevated Levels of Adenosine and/or Deoxyadenosine

Using liquid chromatography and mass spectrometry of deproteinated hemolymph samples, we measured adenosine concentrations in hemolymph of mutant and wild-type third-instar larvae.  The adenosine concentration in the adgf-a mutant was 1.14 ± 0.26 μM compared to less than 0.08 μM in the wild type, and the deoxyadenosine concentration in mutants was 1.66 ± 0.99 μM compared to an undetectable level in the wild type.

The Catalytic Activity of ADGF-A Is Required for Its Function

To test whether the function of ADGF-A in vivo is also dependent on its catalytic activity, we produced two versions of the UAS-ADGF-A construct [18]: one carrying wild-type cDNA of ADGF-A and one carrying an ADGF-A cDNA with a mutation causing a substitution of two amino acids (H386G and A387E) in the catalytic domain [16].  Two different lines carrying the wild-type UAS-ADGF-A expression construct together with an Actin-Gal4 driver (providing ubiquitous expression) both completely rescued the mutant phenotype, whereas larvae with UAS-ADGF-A but without the driver showed the typical mutant phenotype.  However, neither of the two lines carrying the mutated version of the UAS-ADGF-A (producing full-length protein detected by anti-myc antibody; see Materials and Methods) showed any rescue of the mutant phenotype.  This result therefore demonstrates that the catalytic activity of ADGF-A is required for its function in vivo.

Hemocyte Development Is Affected in the adgf-a Mutant

We investigated the number and morphology of hemocytes (blood cells) in the hemolymph of the adgf-a late third-instar larvae (Figures 3 and 4).  These larvae contain an average of seven-fold more hemocytes in circulation than wild-type larvae (Figure 3).  In contrast to normal larval plasmatocytes, which remain rounded after settling down on the substrate (Figure 4A), most of the cells in the adgf-a mutant (more than 75%) are strongly adhesive and, after they are deposited in a drop of hemolymph on a microscope slide, develop filamentous and membranous extensions (Figure 4B–4D).  An average of 7% of hemocytes in the adgf-a mutant are lamellocytes (Figures 3 and 4E), large flat cells that are not present in circulation of wild-type larvae under normal conditions [19].  Crystal cells were also detected in excess, with mutant larvae carrying several hundred while there are fewer than a hundred of these cells in the wild type (Figure 5).  The lymph glands normally do not release hemocytes into the hemolymph before metamorphosis [20]; instead, they are released during metamorphosis when the lymph glands disperse [19].  However, the lymph glands of adgf-a mutant larvae are already dispersed in the late third instar.  This process is similar to normal metamorphic changes, in which the hemocytes are first released from the front lobes, and the posterior lobes disperse later.  To analyze hemocytes in living larvae, we used the Hemolectin marker (Hml) [21].  We compared the number and distribution of hemocytes stained by GFP in flies carrying hml-Gal4 UAS-GFP in wild-type and mutant backgrounds.  While there are relatively few hemocytes, mostly free-floating in the hemolymph, in early third-instar wild-type larvae (see Figure 4I), a much higher number of hemocytes, which are mostly attached to the tissues under the integument (described as sessile hemocytes in [19]), was observed in mutant larvae (see Figure 4J).  A similar behavior was detected later in wild-type larvae, toward the end of the third instar (see Figure 4H).  At this stage, the Hml marker disappeared from the most of the hemocytes in mutants (see Figure 4F and 4G).

Number of Circulating Hemocytes in Late Third-Instar Larvae

Genotypes are shown along the x-axis, and the number of hemocytes/larva along the y-axis.  Each bar shows the number of all circulating hemocytes, and the gray part of the bars represent the lamellocyte population.  Each count was repeated five to ten times and the standard error is shown.

Hemocyte Abnormalities in adgf-a Mutant Larvae

(A–E) Differential interference contrast microscopy of living circulating hemocytes (magnification 200×; scale bar, 10 μm).  Round, nonadhesive plasmatocytes from wild-type larva (A).  Hemocytes from the adgf-a mutant developing filamentous extensions (B and C) or membranous extension surrounding the cell (D).  Large flat lamellocyte from the adgf-a mutant (E).  (F and G) Differential interference contrast and fluorescent microscopy (merged image) of living circulating hemocytes stained by the Hml-GFP marker (magnification 100×; scale bar, 10 μm).  While most of the cells from wild-type larvae are GFP-positive (F), just few of the cells from late third instar adgf-a larvae are stained by GFP at this stage (G).  (H–J) Fluorescence microscopy of living larvae with Hml-GFP stained hemocytes (magnification 40×; scale bar, 100 μm).  Posterior part of late third-instar wild-type larva (H).  Middle sections of early third-instar larvae of wild type (I) and adgf-a mutant (J).

Crystal Cells in Late Third Instar Larvae

Crystal cells were visualized by heating larvae of different genotypes at 60 °C for 10 min [46].  (A) Wild-type larva, (B) adgf-a single mutant, (C) adoR adgf-a double mutant (scale bar, 0.5 mm).

The adgf-a Mutant Phenotype Is Rescued by Expression of ADGF-A in the Lymph Glands

To distinguish which tissues require ADGF-A expression for proper development, we tested for rescue of adgf-a lethality by expressing ADGF-A in specific subsets of larval tissues.  A transgenic line carrying the UAS-ADGF-A construct on Chromosome II was crossed to lines expressing the Gal4 driver [18] in different tissues (Table 1).  Since ADGF-A is normally expressed in the larval lymph glands [16], and the mutant phenotype is characterized by abnormal hemocyte development, special consideration was given to lines expressing the Gal4 driver in the lymph glands and/or circulating hemocytes.  No line expressing the Gal4 driver exclusively in the lymph glands has been reported, so we used a combination of lines sharing in common the feature of Gal4 driver expression in the lymph glands.  The results (see Figure 2 and Table 1) clearly demonstrate that expression of ADGF-A in the lymph glands (driven by Cg-Gal4,e33C-Gal4, or c564-Gal4), but not in any other tissue examined, is necessary and sufficient to fully rescue the adgf-a lethality.  In e33C-Gal4/UAS-ADGF-A, strong expression of ADGF-A in all lobes of developing lymph glands (but not in circulating hemocytes) reduces the number of hemocytes in the hemolymph to almost normal levels (see Figure 3).  The number of hemocytes is also reduced, but to a lesser extent in larvae rescued by Cg-Gal4/UAS-ADGF-A.  However, when assayed by survival rate and melanotic tumor formation, the rescue by Cg-Gal4 is full and similar to that of e33C-Gal4 (see Figure 2).  The difference in effectiveness may be explained by the different expression patterns of the drivers.  Cg-Gal4 is expressed only in certain compartments of lymph gland lobes containing relatively mature hemocytes, and strongly in most circulating hemocytes [22, 23].  The C564-Gal4 driver is not expressed as strongly as e33C-Gal4, but is still uniformly expressed in the lymph glands; it also fully rescued the mutant phenotype.  We have tried two different insertions of the Dot-Gal4 construct.  The Dot-Gal411C on Chromosome II, which shows weak expression [24], did not rescue the phenotype, but a Dot-Gal443A insertion on Chromosome X, which shows stronger expression, rescued approximately half of the mutant animals (Figure 2).  Nearly all rescued individuals were males, suggesting that expression of the Gal4 driver was influenced by X-chromosome dosage compensation, and expression in females heterozygous for Dot-Gal4 was not strong enough for rescue.  Expression of ADGF-A in salivary glands and fat body (as well as in other tissues) is not required for full rescue, as demonstrated by use of the Cg-Gal4,Dot-Gal4, but especially by e33C-Gal4 driver, and is also not sufficient to rescue the phenotype at all, as demonstrated by T110-Gal4 and Lsp2-Gal4 (Table 1).  Since ADGF-A is strongly expressed in embryonic mesoderm [16], we have tried to rescue the phenotype by the expression of ADGF-A in embryonic and larval muscle cells using the Dmef2-Gal4 driver [25].  No rescue of the phenotype, including body shape of escaping pupae, was observed.  The only line showing significant (but not complete) rescue of adgf-a survival without expression in the lymph glands was GawB5015 (see Figure 2), which expresses the Gal4 driver very strongly and specifically in the ring gland and salivary glands (as well as very weak and spotty expression in imaginal discs [unpublished data]).  However, expression of ADGF-A driven by GawB5015 does not prevent the formation of melanotic tumors (see Figure 2B).

Ablation of Hemocytes in Mutant Larvae Reduces Fat Body Disintegration and Melanotic Tumor Formation

The l(3)hematopoiesis missing(l[3]hem) mutation reduces cell division in larval proliferating tissues and thus dramatically reduces the number of hemocytes in larvae.  It also suppresses the hemocyte overproliferation and associated defects observed in the hopscotchTumorous-lethal mutant [26].  We therefore used the l(3)hem1 mutation to test whether the reduction of hemocyte number in the adgf-a mutant affects the phenotype.  We recombined this mutation onto the chromosome containing the adgf-a mutation and found that in homozygous l(3)hem1,adgf-a double mutants the number of hemocytes is significantly reduced compared to the adgf-a single mutants (see Figure 3).  Furthermore, while 90% of adgf-a mutant larvae showed disintegration of fat body, only 40% of l(3)hem1,adgf-a double mutants (total number of counted animals was 82) show the disintegration (Figure 6A).  Similarly, melanotic tumor formation is significantly suppressed by l(3)hem1, with only 55% of double mutants showing melanotic tumors compared to more than 83% in adgf-a (Figure 6A).  However, the delay in development and block of pupariation (Figure 6B), as well as the pupal body shape, were not influenced by this mutation.  This shows that the effect on hemocyte development is related to only one other aspect of the adgf-a phenotype—namely, fat body disintegration—and the developmental arrest of adgf-a mutants is probably independent of this process.

   Suppression of the adgf-a Mutant Phenotype by Mutations in Other Genes

(A) Percentage of late third-instar larvae with melanotic tumors (black bars) and fat body disintegration (green bars).  The x-axis (which is shared with [B]), shows the genotype.  The y-axis shows the percentage of larvae with tumors and fat body disintegration.  (B) Survival rate of double mutants compared to single adgf-a mutant.  The y-axis shows the percentage of the pupae (blue bars) and adult flies (purple bars) demonstrating the larval and pupal survival, respectively.  Each experiment was repeated at least four times (with 20–30 animals in each vial) and the standard error is shown.

Block in Activation of Macrophages Suppresses Disintegration of Fat Body

Previous results suggest that fat body disintegration might be caused by the action of hemocytes.  Embryonic macrophages express the scavenger receptor encoded by croquemort(crq), which allows them to bind and remove apoptotic corpses [27].  We therefore tested whether a mutation in the crq gene would block the suggested interaction between hemocytes and fat body in adgf-a mutant larvae.  We used the mutation crqKG01679, caused by a P-element insertion in the first untranslated exon of crq, which leads to pupal lethality.  The number of crystal cells was not increased and lamellocytes were not detected in crq, adgf-a double mutants (see Figure 3).  The double mutants showed a lower number of circulating hemocytes than the single mutant, but there was still a significant increase in this number compared to wild type (see Figure 3), and the cells showed increased clumping.  None of the double-mutant larvae showed either disintegration of fat body or melanotic tumor formation (Figure 6A).  Even the adgf-a mutant larvae heterozygous for the crq mutation (crq/CyO GFP; adgf-a/adgf-a) showed significant suppression of the fat body disintegration, with most of the tissue staying compact in bigger pieces and never disintegrating to single adipose cells; melanotic tumors were rarely observed.  This shows that the block of the putative interaction between fat body and macrophage-like cells (which are still present in double mutants) suppresses the fat body disintegration, further strengthening the hypothesis that the disintegration is caused by hemocytes.  In addition, the absence of lamellocytes and the normal number of crystal cells in the double mutant strongly suggest that the differentiation of these cells and thus melanotic tumor formation is a secondary reaction to fat body disintegration, rather than a primary effect of the adgf-a mutation.

Mutation in a Putative Adenosine Receptor Suppresses the Block of Pupariation in adgf-a

We have identified a putative homolog of the mammalian adenosine receptor family in the Drosophila genome, AdoR, and produced a null mutation in this gene using homologous recombination (adoR; ED, unpublished data).  The adoR mutants are fully viable.  We used this mutant to test the hypothesis that the increased level of adenosine in the adgf-a mutant contributes to the mutant phenotype by its effect on signaling through the adenosine receptor.  The results show that introducing the adoR mutation into the adgf-a background significantly increases pupariation, as well as adult emerging rate, compared to the adgf-a single mutant (Figure 6B).  When the earlier lethality was avoided by picking up larvae after molt to the third instar, the pupariation rate of adoR, adgf-a double mutant was comparable to wild type as well as to the single adgf-a mutant treated with ecdysone (Figure 7A).  Development during the third instar is much less delayed in the double mutant, with most of the larvae pupariating within 1 d after their heterozygous siblings (Figure 7A).  The adoR mutation also significantly reduced melanotic tumor formation in the adgf-a mutant (see Figure 6A), but disintegration of the fat body appeared at the same rate as in the single mutant (see Figure 6A).  While the number of macrophage-like cells in circulation is not significantly changed in the double mutant, the number of lamellocytes is decreased (see Figure 3), but the number of crystal cells is normal (see Figure 5A and 5C).  These results demonstrate that adenosine signaling through the adenosine receptor is involved in the developmental arrest of adgf-a mutant, but that it does not play a role in fat body disintegration and macrophage differentiation.

Ecdysone Regulation of Development in adgf-a

(A) Larvae of different genotypes were collected after L2/L3 molt, and the number of puparia was counted at different time points (x-axis: hours after egg laying).  The y-axis shows the percentage of puparia out of all collected third-instar larvae (three vials each with 30 animals; the standard error is shown).  (B and C) Ring gland morphology in arrested adgf-a larvae.  Approximately 8-d old mutant larva (i.e., 3 d after normal pupariation) with very extensive fat body disintegration (note the transparency of larva in the middle part with small white pieces of fat body) (B).  The ring gland dissected from this larva (C) shows morphology of the normal ring gland before the degenerative changes of prothoracic gland starts (compare to schematic diagram to the left of [C], from [28]).  (D–F) Expression of GFP-marked glue protein (SgsΔ3-GFP) in salivary gland of the adgf-a mutant larvae and pupae.  All late third-instar larvae express the glue protein as shown on dissected salivary gland (D).  Some mutants show typical expulsion from the glands with GFP totally external to the puparial case (E), while others do not expel glue proteins even after puparium formation (F).

Hormonal Regulation in the adgf-a Mutant

The delayed development and low pupariation rate in the adgf-a mutant larvae (see Figures 2A and 7A) could be caused by an effect on hormonal regulation of development.  The main source of developmental hormones in the Drosophila larva is the ring gland, composed of the prothoracic gland, corpus allatum, and corpus cardiacum [28].  The prothoracic gland releases the steroid molting hormone ecdysone, which is converted to an active form, 20-hydroxyecdysone (20E), by the fat body as well as some of the target organs [29].  The block of pupariation in the adgf-a mutant suggested that the level of ecdysone in these larvae might not be sufficient to initiate pupariation.  To test this possibility, we tried to rescue the phenotype by feeding mutant larvae 20E, which can initiate pupariation in the ecd1 mutant, which has an extremely low level of ecdysone [30,31].  The results (Figure 7A) clearly demonstrate that the adgf-a mutant larvae are responsive to ecdysone and that this treatment restores the pupariation frequency to almost wild-type level.  The delay in development is also significantly reduced (Figure 7A).  Since the adgf-a mutant shows certain precocious metamorphic changes (macrophage differentiation and fat body disintegration), we speculated that a reduced ecdysteroid level could be caused by precocious degeneration of the prothoracic part of the ring gland.  However, the overall structure of the ring gland is not visibly affected even in the oldest larvae (10 d, i.e., 5 d after the heterozygous siblings pupariated) with a fully disintegrated fat body (Figure 7B and 7C).  We also used a transgenic line carrying the SgsΔ3-GFP construct, which was previously used to monitor the effects of ecdysteroid levels on glue protein expression in salivary glands [32].  All analyzed adgf-a mutant larvae carrying the SgsΔ3-GFP construct showed normal expression of Sgs-GFP in salivary glands (Figure 7D).  Mutants that pupariated usually showed typical GFP expectoration, indicating the presence of a high premetamorphic peak of ecdysteroids (Figure 7E).  In some cases, GFP was secreted into the lumen of salivary glands, but was not expectorated (Figure 7F), which is similar to the defect seen in animals expressing the dominant-negative form of ecdysone receptor driven by the Sgs3-Gal4 driver [33].  These results demonstrate that the target tissues of adgf-a mutants are normally responsive to ecdysteroids and that they are probably capable of releasing ecdysteroids, although the level of ecdysteroids might vary.?>

ADGF-A Genetically Interacts with Toll Signaling Pathway

The antimicrobial response of Drosophila includes at least two distinct signaling pathways [34]—the Toll signaling pathway, which leads to the activation of two nuclear factor kappa B (NF-κB) factors, Dorsal-related immunity factor (DIF) and dorsal (DL); and the immune deficiency protein pathway activating the third NF-κB factor, Relish (REL).  A zygotic null mutation in cactus (cact; a Drosophila inhibitor of NF-κB) leads to hyperproliferation of hemocytes, melanotic tumor formation, disintegration of fat body, and slower larval development, with 60% larval lethality, as well as a thin body-shape phenotype [35].  All of these phenotypes are strikingly similar to the abnormalities seen in adgf-a mutants, which was our first clue as to a possible interaction of ADGF-A with the Toll signaling pathway.  We hypothesized that the activity of ADGF-A is suppressed by Toll signaling, resulting in similar phenotypes of the adgf-a mutation and constitutive activation of Toll pathway.  To test this hypothesis, we crossed transgenic flies carrying ADGF-A gene under the control of a heat-shock promoter on Chromosome II (HS-ADGF-A) with cactE8 (a lethal allele of cact on Chromosome II, which, in combination with cactD13, results in a zygotic null combination, or, with cactIIIG, results in zygotic hypomorphic combination).  Overexpression of ADGF-A in animals with a hypomorphic cact combination (cactE8/cactIIIG) increased the adult survival rate almost 4-fold (Figure 8A).  The rescue could be increased by multiple heat shocks before pupariation to 7-fold (unpublished data).  The suppression of melanotic tumor formation is also significant (from more than 80% down to 26%, Figure 8B).  The most severe cact null mutation (cactE8/cactD13), leading to developmental arrest in larvae (less than 8% pupate), is partially rescued in animals with overexpression of ADGF-A when the pupariation rate is increased 3-fold (Figure 8A).  These results demonstrate that ADGF-A overexpression can partially rescue the effects of constitutively active Toll signaling in larvae, mainly the developmental arrest, but also the melanotic tumor formation, in the case of hypomorphic cact mutants.

Genetic Interactions of Toll Signaling and ADGF-A

Survival rate and melanotic tumor formation were compared in mutants in the Toll signaling pathway and in similar mutants with overexpression of ADGF-A using the HS-ADGF-A construct.  (A) The bar graph shows the percentage of the pupae (blue bars) and adult flies (purple bars) demonstrating the larval and pupal survival of each genotype.  The x-axis shows the genotypes and is shared with (B).  Flies heterozygous for the cact mutation were used as a control.  (B) Percentage of late third instar larvae presenting melanotic tumor formation.

Discussion

ADA Deficiency in Drosophila Causes Abnormal Hemocyte Development, Melanotic Tumor Formation, Fat Body Degeneration, and Delayed Development

We have established an ADA deficiency model in Drosophila in order to study the effects of altered adenosine levels in vivo.  We produced a loss-of-function mutation in the ADGF-A gene, which produces a product (ADGF-A) with ADA activity.  When homozygous, the mutation causes abnormal hemocyte development, leading to melanotic tumor formation [36], as well as fat-body disintegration associated with death during the larval stage or delayed transition to the pupal stage of development.  In agreement with our previous study using cells cultured in vitro [16], here we have shown that ADA enzymatic activity is essential for ADGF-A function in vivo, when this function is assayed by testing for rescue of the mutant phenotype.  Just as increased levels of both ADA substrates, adenosine and deoxyadenosine, are found in blood of SCID patients [5], adgf-a mutant larvae also have elevated levels of adenosine and deoxyadenosine, indicating that the mutant phenotype is caused by disturbance in the turnover of these nucleosides.  Expression of ADGF-A only in the lymph glands is sufficient to fully rescue the mutant phenotype, indicating that the hemocytes within the lymph glands play a major role in regulation of adenosine levels in the hemolymph.  A similar regulatory role has also been attributed to blood cells in humans [5].  This suggests a function for ADGF-A within the lymph gland.  However, ADGF-A behaves as a soluble growth factor and could be released from the lymph gland to activate targets elsewhere in the larval body.  Our results show that ADGF-A functions by limiting the level of extracellular adenosine, and in this way the protein could have a systemic function even if it were restricted to its tissue of origin.  Although our tests did not exclude a role for ADGF-A in circulating hemocytes (which constitute a separate lineage from the lymph gland hemocytes [20]), we showed that expression of ADGF-A in circulating hemocytes is not required for rescue of the adgf-a mutant phenotype, since e33C-Gal4/UAS-ADGF-A-which expresses ADGF-A in the lymph gland but not in circulating hemocytes-fully rescued the phenotype.

ADGF-A Is Involved in Hemocyte Differentiation in the Lymph Glands

Late third-instar larvae homozygous for the adgf-a mutation contain, on average, seven times more hemocytes in circulation than wild-type larvae, and most of these cells show strong adhesive properties compared to normal larval plasmatocytes, which remain rounded after settling down on the substrate.  Although these cells share other characteristics with plasmatocytes, they are normally not seen in circulation until they are released from the lymph glands at the onset of metamorphosis under the regulation of ecdysone to serve as phagocytes for histolysing tissues during metamorphosis—thus, they are referred to as pupal macrophages [19].  In agreement with the presence of these cells in circulation, at least the first lobes of the lymph glands are usually completely dispersed in late third-instar mutant larvae.  This indication of precocious metamorphic changes [36] in the mutant is further supported by the finding that hemocytes aggregate in a segmental pattern in early rather than late third instar (see Figure 4H–4J), and that the hemocytes lose expression of Hemolectin in late third-instar larvae rather than at the onset of metamorphosis (see Figure 4G) [21].  Recent studies show that the Toll signaling pathway, which is already known to be involved in the control of innate immunity of both Drosophila and mammals [34], may also be involved in the control of hemocyte differentiation in the Drosophila larva.  Constitutive activation of Toll signaling leads to developmental arrest and hematopoietic defects associated with melanotic tumor formation [35], similar to the phenotype of the adgf-a mutant.  Our work also shows that forced expression of the ADGF-A gene can rescue the effects of overactive Toll signaling, suggesting that ADGF-A might function downstream of Toll signaling to control its effects.  This conclusion is consistent with the existence of a putative binding site for Dorsal (one of two known effectors of Toll signaling) in the ADGF-A promoter (Figure 9).  It will be important to explore this connection further, since recent studies suggest an interaction between adenosine signaling and the NF-κB signaling pathway, which is the mammalian counterpart of the Toll pathway [37].

Schematic Map of the ADGF-A Gene with Promoter Analysis

The ADGF-A gene contains four exons and two transcriptional starts [17,47].  We analyzed sequences preceding both transcriptional starts for the presence of known transcriptional factor binding sites using the software program Gene2Promoter (Genomatix Software GmbH).  Selected sites are represented by color bars in approximate positions of promoter regions.  The legend under the sequence show the names of transcription factors binding to matching colored binding sites.

Precocious Fat-Body Disintegration Caused by Mutant Hemocytes

One of the most remarkable features of the adgf-a mutant phenotype is the disintegration of the fat body in third-instar larvae, another indication of precocious metamorphic changes since the disintegration normally occurs much later, during pupal life.  Furthermore, our study of this mutant provides strong evidence that the fat body disintegration is promoted by the action of hemocytes.  Fat body disintegration was significantly suppressed when the hemocyte number was reduced using the l(3)hem1 mutation [26], and fully blocked by the croquemort (crq) mutation [27] which affects a CD36-related receptor (Croquemort) expressed on macrophages and required in phagocytosis of apoptotic cells.  Human CD36 is a scavenger receptor which, in combination with the macrophage vitronectin receptor and thrombospondin, binds apoptotic cells.  A similar role of Croquemort for removing histolysing tissues during Drosophila metamorphosis has not yet been tested, but seems likely since the crq mutant used in this study (crqKG01679) is lethal in pupae.  The idea that hemocytes are involved in fat body dissociation in Drosophila is further supported by work on the flesh fly Sarcophaga.  Natori's group showed that proteinase cathepsin B was released from pupal hemocytes when they interacted with the fat body, and that this enzyme digested the basement membrane of the fat body, causing the tissue to dissociate [38,39].  They also showed that the interaction of hemocytes with the fat body is mediated by a 120-kDa membrane protein localized specifically on pupal hemocytes [40].  This protein was suggested to be a scavenger receptor, but it does not seem to be homologous to Drosophila Croquemort (unpublished data).  Work by Franc et al. [27] is consistent with the idea that more than one scavenger receptor is involved in this process.

Possible Signaling Role for Adenosine

The precocious metamorphic changes that appear to occur in response to elevated adenosine in the adgf-a mutant larvae lead to the suggestion that adenosine may act as a regulatory signal for these processes during normal development.  One possibility is that adenosine acts as a downstream effector of ecdysone-regulated prepupal changes, and that the increase in adenosine concentration is mediated by ecdysone-induced down-regulation of ADGF-A expression.  This is supported by the presence of multiple sites for ecdysone-inducible transcription regulators in the ADGF-A promoter (Figure 9).  Adenosine could serve as a signal for macrophage differentiation, and the lack of adenosine deaminase activity due to the adgf-a mutation could cause precocious differentiation of these cells in mutant larvae.  We are now carrying out direct tests of the idea that the differentiation of hemocytes in mutant larvae is caused by elevated adenosine.  If confirmed, this effect would have general significance, since in ADA-deficient mice, inflammatory changes in the lungs include an accumulation of activated alveolar macrophages [41], and this could also be mediated by elevated adenosine.

Elevated Adenosine Delays Development and Inhibits Pupariation

The elevated adenosine in the adgf-a mutant larvae leads to precocious changes (hemocyte differentiation and fat body disintegration) resembling those normally occurring at the time of metamorphosis, but it also is associated with an apparently opposite effect, in that it causes a significant delay in progress through the third larval instar and a decrease in the frequency of successful pupariation (formation of the puparium from the larval cuticle), which is one of the earliest steps in metamorphosis.  We conclude that the mutation has additional effects on the hormonal regulation of development.  One possible explanation for the developmental delay and failure to pupariate is that the adgf-a mutation affects the production or release of ecdysteroid hormones from the major endocrine organ of the Drosophila larva—the ring gland.  This is supported by the fact that pupariation rate and survival of the adgf-a mutant can be significantly improved by expression of transgenic ADGF-A in the ring gland and salivary glands.  We suggest that this somehow interferes with the regulation of hormone release.  Other mutants with hormonal dysregulation show delayed larval development and failure to pupariate [42,43].  Presumably the elevated adenosine in the adgf-a mutant blocks the production or release of ecdysone from the ring gland by an unknown mechanism.  This idea is supported by our finding that both pupariation rate and survival of the adgf-a mutant can also be improved by feeding the mutant larvae with 20E in the diet (see Figure 7A).  Thus it is clear that the adgf-a mutant is arrested in development due to an effect of the mutation on hormone production from the ring gland.  The arrest of development in the adgf-a mutants was significantly suppressed by loss of the adenosine receptor caused by the adoR mutation: larvae simply homozygous for adgf-a pupated after two or more days, whereas larvae also homozygous for adoR pupated within 1 d after their heterozygous siblings (see Figure 7A).  Therefore, adenosine signaling through the AdoR must play a role in the developmental arrest of the adgf-a mutant, and this is most likely mediated by signaling to the ring gland, where AdoR is expressed (ED, unpublished data).  The mutation in AdoR does not block macrophage differentiation and fat-body disintegration, so this effect must involve another, as yet uncharacterized mechanism independent of AdoR signaling.  Work using adenosine-receptor deficient mammalian cells also suggested the existence of a novel, undefined adenosine signaling mechanism [44].  However, we cannot exclude the role of elevated deoxyadenosine in these effects.  Drosophila, now with the advantage of the well-characterized adgf-a mutant, could serve as an ideal model system in which to investigate this mechanism.

Concluding Remarks

In our previous work using cells cultured in vitro, we showed that, as in mammals, adenosine can block proliferation and/or survival of some Drosophila cell types [16].  In the present work, we have established a Drosophila model to study altered levels of adenosine and deoxyadenosine in vivo, and we have shown that loss of ADGF-A function causes an increase of these nucleosides in larval hemolymph.  Although the adgf-a mutation leads to larval or pupal death, we have shown that this is not due to the adenosine or deoxyadenosine simply blocking cellular proliferation or survival, as the experiments in vitro would suggest.  Rather, this mutation leads to an increase in number of hemocytes at the end of larval development due to the differentiation and release of hemocytes from the lymph glands.  Hemocytes also differentiate and are released from the lymph glands during systemic infection [19].  Together with our result suggesting an interaction between Toll signaling and ADGF-A, this leads to the hypothesis that adenosine controls hemocyte differentiation in response to infection, and that it signals through the adenosine receptor to postpone the next developmental step, metamorphosis.  This would be consistent with the role of adenosine as a “stress hormone” in mammals [6].  A similar process of hemocyte differentiation and release from the lymph glands normally takes place at the onset of metamorphosis, when pupal macrophages remove histolyzing tissues.  The ADGF-A promoter contains consensus binding sites for effectors of both Toll and ecdysone signaling.  This raises the possibility that adenosine plays a role in the control of metamorphosis as well as in the response to stress.

Materials and Methods

Fly strains and genetics

For standard procedures, flies were raised at 25 °C on a standard cornmeal-agar-yeast-molasses diet supplemented with 0.3% Nipagin to retard mold growth.  Oregon flies were used as the wild-type Drosophila strain, but in most cases the y w strain was used as a control since most mutations were carried in the y w background.  A mutation in the ADGF-A gene on Chromosome III was obtained as described earlier [17].  In this study, the mutation described as adgf-akarel was used in all experiments and is referred to here as adgf-a.  A mutation in the adenosine receptor gene on Chromosome III was produced by the ends-out targeting method (ED, unpublished data) and is referred to here as adoR.  Transgenic flies carrying HS-ADGF-A,UAS-ADGF-Amyc, and UAS-mutADGF-Amyc construct (see description below) were produced by a modified P-element transformation method [45].  HS-ADGF-A,UAS-ADGF-Amyc[2A],UAS-ADGF-Amyc[7A],UAS-mutADGF-Amyc[1A], andUAS-mutADGF-Amyc[3B], all insertions on Chromosome II, were isolated and used in this work.  The following markers and mutations were obtained from the Bloomington stock center, accessible at http://fly.bio.indiana.edu/ (stock numbers provided in parentheses): Hml-GFP marker (Hml-Gal4/UAS-GFP) expressing GFP in embryonic and larval hemocytes on Chromosome II (BL-6397), the l(3)hem1 mutation on Chromosome III (BL-6184), and the crqKG01679 mutation in the crq gene on Chromosome II (BL-14900).  Mutants in Toll signaling pathway were obtained from Dr. S. Govind: cactE8,cactIIIG, and cactD13 mutations in the cact gene on Chromosome II.  The Gal4/UAS [18] system was used for protein misexpression.  The following were obtained from the Bloomington stock center (stock numbers in parenthesis): Cg-Gal4 on Chromosome II (BL-7011), Pw+mW.hs=GawB5015 on II (BL-2721), Pw+mW.hs=GawBc564 on II (BL-6982), Pw+mW.hs=GawBT110 on II (BL-6998), Hml-Gal4 on II (BL-6396), Dot-Gal443A on X (BL-6903), Dot-Gal411C on II (BL-6902), and Lsp2-Gal4 (BL-6357) on III.  The Pen2.4-GAL4e33C lethal insertion on Chromosome III was obtained from Dr. N. Perrimon's lab, and the Dmef2-Gal4 driver on II from Dr. A. Michelson.  Expression information of these Gal4 drivers is provided in Table 1.  A stock carrying the ubiquitous actin-Gal4 driver (P-actin-Gal4 UAS-GFP/CyO; lethal insertion on Chromosome II) was obtained from Dr. R. Sousa.  To recognize homozygous larvae, balancer chromosomes with the GFP marker were used: CyO Pw+mW.hs=Ubi-GFP.S65TPAD1 (BL-4559) and TM3 Pw+mC=ActGFPJMR2 Ser (BL-4888).  Transgenic flies SgsGFP-1 (insertion on Chromosome X) and SgsGFP-2 (insertion on Chromosome II) containing the chimeric gene construct SgsΔ3-GFP were obtained from Dr. A. J. Andres.  For expression of ADGF-A using the HS-ADGF-A construct, flies were heat shocked as late embryos/early first instars at 37 °C for 30 min.  In all rescue experiments, 30 freshly hatched homozygous first-instar larvae were selected using a GFP dissecting microscope and transferred into fresh vials (at least four vials for each variant).  They were left to develop at 25 °C and examined as wandering third-instar larvae, pupae, and adults.

Ecdysone treatment

Mutant larvae were raised on plates with yeast paste at 25 °C and transferred to vials with glucose-yeast medium (control) or with glucose-yeast medium containing 20-hydroxyecdysone (H-5142; Sigma-Aldrich, St.  Louis, Missouri, United States) at a concentration of 0.5 mg/ml shortly after the L2/L3 molt.  Numbers of puparia were counted at 12-h intervals after the 120-h time point (when the first control larvae start to pupariate).  The ecd1 flies (Bloomington stock BL-218) served as a control for the functional 20E diet [31]: flies were raised at 22 °C (permissive temperature for the temperature-sensitive ecd1 mutation) and transferred to vials with control or 20E-containing diet and raised at 29 °C (restrictive temperature).

Fat body observation

Living late third-instar larvae were washed and examined in PBS using a standard dissecting microscope with transmitted light.  For finer analysis, the fat body was dissected from larvae in PBS and observed using a dissecting microscope.  GFP-stained fat body was observed in living, etherized larvae in PBS solution on a standard microscopic slide with a coverslip under a fluorescence microscope.

Hemocyte counts and observations

Circulating hemocytes were obtained by opening two late third-instar larvae in 30 μl of PBS.  This allowed us to collect all hemolymph from the larvae in a defined volume.  The solution with circulating hemocytes was mixed by gently pipetting, and part was transferred into the chamber of an improved Neubauer hemocytometer.  Cell number was recounted to one animal equivalent.  Hemocyte morphology was observed by differential interference contrast microscopy of living cells in Shields and Sang Insect Medium (Sigma-Aldrich) obtained by the same procedure as for counting.  To observe hemocyte morphology, samples were analyzed at least 10 min after the deposition of solution with hemocytes, in order to allow the cells to adhere to the surface of the slide.  Crystal cells were visualized by heating larvae at 60 °C for 10 min in a water bath [46].  GFP-stained hemocytes were observed in living, etherized larvae in PBS solution on a standard microscopic slide with a coverslip under the fluorescence microscope or by deposition of hemocytes in PBS as for counting and observing under the fluorescence microscope.

Transgenic constructs

Wild-type cDNA for ADGF-A was amplified by PCR using proofreading DNA polymerase (ProofStart; Invitrogen, Carlsbad, California, United States) from a pOT2 vector containing the ADGF-A EST-clone (GH08276) using the following primers: 5'- CGTCTAGAATGTCGCCAGTCATCCGCC-3' (5' end primer with XbaI tail) and 5'- GCTGATCATCAATCGATCCGTTGACTGGGGGA-3' (3' end primer with BclI tail).  The PCR product was cloned into the pGEM-T Easy vector (Promega, Madison, Wisconsin, United States), and the resulting plasmid (ADGF-A-pGEM) was cut by NotI/SpeI restriction enzymes.  The ADGF-A fragment was then cloned into the pCaSpeR-hs) vector cut by NotI/XbaI to obtain the HS-ADGF-A construct.  The myc tag was added to the C terminus of the ADGF-A protein for detection by anti-c-Myc antibody (Sigma-Aldrich).  To produce a UAS-ADGF-Amyc construct, the ADGF-A fragment was amplified (by ProofStart from pOT2 vector) using the following primers: 5'- AATCTCGAGCTCATCATGTCGCCAGTCATC-3' (5' end with XhoI tail) and 5'- TATCTAGATCGATCCGTTGACTGGGGG-3' (3' end with XbaI tail).  The fragment was cut by XhoI/XbaI and cloned into the pUAST vector modified by MZ.  The sequence encoding the myc-tag 5'- GAGCAAAAGCTCATTTCTGAAGAGGACTTG-3' plus a stop codon was inserted into XbaI site of pUAST (using the XbaI site on the 5' end and the NheI site on the 3' end) cut by XhoI/XbaI.  A mutated version—UAS-mutADGF-Amyc—was prepared in the same way as UAS-ADGF-Amyc, but pBLUESCRIPT containing mutated ADGF-A cDNA was used as a template.  The mutated version of ADGF-A (carrying a mutation causing the substitution of two amino acids—His386 and Ala387 for Glu and Leu, respectively) in the catalytic domain, shown to abolish adenosine deaminase activity [16], was prepared by recombinant PCR using the following recombinant primers: 5'- TCTACTTCGAGCTCGGAGAAACAAACTGGTTCGGT-3' and 5'- CTCCGAGCTCGAAGTAGAAATCAATGTCATCG-3' and the same 5' and 3' end primers as above.

Adenosine and deoxyadenosine concentrations measurement

The detection method used liquid chromatography and mass spectrometry (LC/MS method) of deproteinated hemolymph samples.  Larval hemolymph was collected from several larvae and centrifuged to pellet the hemocytes.  1 μl of hemolymph was diluted in 99 μl of buffer.  The sample was introduced in CH3CN-0.05% TFA (50:50) either via a syringe pump at 3 μl/min or via an RP-C18 150 mm × 1 mm Symmetry C8 column at 50 μl/min employing an LCQ electrospray ion source operated at 4.2 kV.  The peaks were then identified using the electrospray MSN mass spectra obtained by the collision-induced decomposition of the MH+ ion and its product ions in a series of MSN experiments that were performed with the ion trap mass spectrometer.  The sugar moiety was cleaved off the adenosine molecule and produced ion with a molecular weight of 136 (adenine), which was then detected by MS.

A comparison of programmed cell death between species

Abstract

Key components of the programmed cell death pathway are conserved between Caenorhabditis elegans, Drosophila melanogaster and humans.  The search for additional homologs has been facilitated by the availability of the entire genomic sequence for each of these organisms.  Programmed cell death is a conserved, gene-directed mechanism for the elimination of unnecessary or dangerous cells from an organism.  The core cell death pathway was first defined through genetic analysis in the nematode Caenorhabditis elegans and has since been found to be conserved in species as diverse as Drosophila melanogaster and Homo sapiens.  The rapid progress of the various genome sequencing projects has greatly accelerated the discovery of homologs for key components of the cell death pathway as well as for its regulators.  The C. elegans genome sequence was completed two years ago [1], and both the Drosophila [2] and human genomes are essentially completely sequenced at this point.  Although many components of the programmed cell death pathway are conserved between species, there are some important differences as well.

The core cell death pathway

Programmed cell death has long been known to be a part of normal development, but it was not until the discovery of the first genes essential for the phenomenon that our understanding of the events leading up to the deliberate elimination of a cell began to take shape.  Before this time, programmed cell death was defined by a set of specific morphological characteristics, including chromatin condensation, nuclear shrinkage and blebbing of the plasma membrane that could be observed in dying cells; the term apoptosis was coined to distinguish this type of cell death from necrotic deaths resulting from injury [3,4].  Four genes make up the core programmed cell death pathway in C. elegans, three of which, egl-1 (egg laying abnormal), ced-3 (cell death abnormal), and ced-4, are pro-apoptotic and one, ced-9, is anti-apoptotic (reviewed in [5]).  There was considerable excitement in the field when potential mammalian and Drosophila homologs for ced-3, ced-4, ced-9, and egl-1 were discovered.  The CED-3 protein is one of a continuously growing family of specific cysteine proteases, termed caspases, that are thought to be the executioners of programmed cell death.  At least fourteen mammalian caspases have been identified, and they are grouped into two classes on the basis of their proteolytic specificities (reviewed in [6]).  Class 1 caspases are mainly involved in cytokine maturation, while Class 2 caspases act mainly in apoptosis.  Class 2 has been further subdivided into two groups: upstream or initiator caspases (group 1), and downstream or effector caspases (group 2).  Initiator caspases are thought to be at the beginning of a proteolytic cascade that amplifies the cell death signal and results in the activation of the effector caspases.  Initiator caspases usually have long pro-domains, while effector caspases have short pro-domains.  Drosophila has at least eight caspases, five of which have been at least partially characterized (Dcp-1, Dcp-2/Dredd, drICE, Dronc, and Decay) and three uncharacterized ones found encoded in the genome sequence [7].  C. elegans appears to have only a single caspase - CED-3 - that is essential for all developmental cell death, despite having three other caspases in its genome [8,9].  Three Drosophila caspases - Dcp-2/Dredd, Dronc, and one known only as a sequence in the genome database - have long pro-domains and are thus likely initiator caspases; another four - Dcp-1, drICE, Decay, and another one found in the genome database - have short pro-domains and are probably effector caspases.  In addition, Dcp-1 has a substrate specificity that is very similar to that of two effector caspases, mammalian caspase 3 and C. elegans CED-3 [10].  Interestingly, Dronc appears to have a substrate specificity that is so far unique among caspases: while all other known caspases have only been shown to cleave after aspartate residues, Dronc can also cleave after glutamate residues [11].  This unusual substrate specificity may explain why Dronc is resistant to inhibition by the pan-caspase inhibitor p35.  On the basis of current data, it seems that CED-4 functions to help activate the caspase CED-3, and CED-9 blocks this activation through physical interaction with CED-4.  Endogenous CED-4 is normally localized to the mitochondria by CED-9, unless EGL-1 is expressed [12,13].  If EGL-1 is expressed, the interaction between CED-4 and CED-9 ceases, and CED-4 translocates to the nuclear membrane where it activates CED-3, resulting in programmed cell death.  Only one mammalian CED-4 homolog, Apaf-1, has been extensively characterized to date, but it too aids in caspase activation.  Like CED-4, Apaf-1 requires dATP for caspase activation, but Apaf-1 requires cytochrome c in addition [14].  Knock-out studies have shown that mice deficient for Apaf-1 have reduced cell death in certain tissues, further supporting a role for Apaf-1 in programmed cell death.  Drosophila has recently been shown also to have a CED-4/Apaf-1 homolog, named Dark/HAC-1/Dapaf-1 (reviewed in [15]).  The Drosophila homolog is more similar to Apaf-1 than to CED-4: it has the WD repeats found in Apaf-1, which potentially function in binding regulatory proteins such as cytochrome c.  Like Apaf-1 and CED-4, loss of function mutations in dark/hac-1/dapaf-1 result in a reduction in developmental programmed cell death.  Intriguingly, the transcription of the Drosophila CED-4/Apaf-1 homolog is upregulated in response to both X-ray and ultraviolet irradiation, suggesting that death-inducing stimuli can feed into the cell death pathway at this step [16].  CED-9 and EGL-1 belong to a large family of proteins related to the mammalian anti-apoptotic protein Bcl-2.  This family - with at least 19 members, divided into pro- and anti-apoptotic subgroups - has been largely defined by protein-protein interactions among its members (reviewed in [17]).  CED-9 has mammalian counterparts in the anti-apoptotic subgroup of Bcl-2 family members, while EGL-1 has mammalian counterparts in the pro-apoptotic subgroup.  Besides binding to Apaf-1, just as CED-9 binds and thereby regulates CED-4 activity, mammalian Bcl-2 family members have been proposed to regulate mitochondrial homeostasis and the release of pro-apoptotic factors such as cytochrome c (reviewed in [17]).  The rate of mitochondrial release of cytochrome c in mammalian cells undergoing apoptosis was recently measured and found to be rapid [18].  Indeed, most mitochondria released all their cytochrome c within a one minute period in a temperature-independent manner, suggesting that an enzymatic transport mechanism is probably not involved.  There is no clear evidence for cytochrome c release during apoptosis in C. elegans or Drosophila, however.  Nevertheless, a specific epitope is uncovered on cytochrome c in Drosophila mitochondria during apoptosis and this may be sufficient to help activate Dark/HAC-1/Dapaf-1; and death-inducing stimuli result in an increase in cytochrome c in Drosophila cultured-cell lysates [19,20].  There is no apparent need for cytochrome c release in C. elegans, since CED-4 does not require it to activate CED-3.  Recently, a pro-apoptotic Drosophila Bcl-2 family member was identified with the help of the database of genomic sequence; this gene is most closely related to mammalian Bok and potentially regulates apoptosis in the fly [21,22,23,24].  The Drosophila Bok homolog interacts with several anti-apoptotic, but not with several pro-apoptotic, Bcl-2 family members and, therefore, possibly functions by antagonizing pro-survival proteins.  Ectopic expression of Bok protein promotes apoptosis in transgenic flies as well as in cultured cells.  While one group reports that the caspase inhibitor p35 can block apoptosis induced by Drosophila Bok [22], another group did not see p35 inhibition [23].  This difference may be due to the slightly different constructs used in the experiments and the different assay systems.  A third group shows that peptide caspase inhibitors can block Drosophila Bok-mediated apoptosis in cell culture, lending support to a model in which expression of Bok protein leads to caspase activation [24].  Interestingly, ectopic expression of Drosophila Bok sensitizes the developing eye to cell death induced by ultraviolet irradiation [21].  Genetically, this Drosophila Bcl-2 family member appears to function upstream of, or in parallel to, the Drosophila ced-4/apaf-1 homolog and downstream of, or in parallel to, the Drosophila regulators of apoptosis, reaper, hid, and grim [22].  The core components of the cell death pathway are illustrated in Figure 1.

The core pathway of programmed cell death.

Multiple pathways lead to the activation of the executioners of death, the caspases (reviewed in [53]).  IAPs (inhibitors of apoptosis) have been shown to block the conversion of pro-caspases into active enzymes, and Reaper, Hid, Grim, and Diablo/Smac prevent IAPs from carrying out this protective function.  Caspases can also be activated with the aid of Apaf-1, which in turn appears to be regulated by cytochrome c and dATP.  The Bcl-2 family appears to function in regulating the release of pro-apoptotic components from mitochondria as well as by possibly inhibiting Apaf-1 directly.  This pathway integrates knowledge gained in multiple species, showing that apoptosis appears to be regulated in a similar manner regardless of the organism.  One notable exception is that the C.elegans homologs of IAPs do not appear to function in programmed cell death.

Regulation of programmed cell death

Studies of Drosophila have yielded great insights into how programmed cell death is regulated (reviewed in [25]).  In addition to the core pathway, several essential regulators have been characterized in Drosophila.  A region in the genome which contains the genes reaper, head involution defective (hid), and grim has been shown to be essential for virtually all developmental cell death in the Drosophila embryo.  The three proteins encoded by these genes have no significant homology to any other known proteins except for a small stretch of similarity at the amino terminus that is shared by all three but no other known proteins.  Each of these three genes can induce programmed cell death when overexpressed in the developing Drosophila eye or in various tissue-culture assays.  In addition, Reaper, Hid and Grim are able to induce apoptosis in heterologous systems, suggesting that vertebrate homologs of these proteins exist (reviewed in [26]).  Further evidence for the existence of vertebrate homologs of Reaper, Hid, and Grim come from observations that these three proteins are able to bind a class of inhibitors of apoptosis, or IAPs.  IAPs were first discovered in baculovirus, but have since been shown to play a vital role in blocking apoptosis in Drosophila as well as in mammals (reviewed in [27]).  Also, Hid has been shown to be negatively regulated by survival signals mediated by receptor tyrosine kinases through the Ras and mitogen-activated protein (MAP) kinase pathway, which is also well conserved in mammals [28,29].  More recently, reports of a potential functional mammalian analog of Reaper, Hid, and Grim have been published [30,31].  Although Diablo/Smac shares no sequence homology with Reaper, Hid, or Grim, it too can bind IAPs, preventing them from inhibiting caspase activation, and thereby inducing apoptosis.  Furthermore, Diablo/Smac interacts through its amino terminus with the BIR (baculovirus IAP repeat) domains of IAPs, similar to Reaper, Hid, and Grim [32,33].  An important difference between the function of Diablo/Smac and Reaper, Hid, and Grim is that Diablo/Smac is not pro-apoptotic by itself when overexpressed, but instead requires a death-inducing stimulus to promote apoptosis, unlike the three Drosophila proteins.  Table 1 provides a summary of the conserved regulators of cell death across species.  In contrast to cell death effectors such as caspases, the cell death regulators Reaper and Grim are specifically expressed several hours before apoptosis in cells destined to die (reviewed in [26]).  The expression of reaper has been shown to be regulated by distinct stimuli, including X-irradiation, steroid hormone signaling and a block in cell differentiation ([34,35,36] and A.-F. Lambin and H.S., unpublished observations).  Recently, a Drosophila p53 ortholog was identified by searching the genome database (reviewed in [37]), and it was shown to bind a specific region of the reaper promoter, thereby potentially regulating reaper transcription in response to ionizing radiation.  Furthermore, the reaper promoter was also shown to contain a response element for the complex made up of the steroid hormone ecdysone and its receptor [36].  The observation that transcription of both reaper and dark/hac-1/dapaf-1 can be induced by ionizing radiation suggests that these death-inducing signals can feed into the cell death pathway at multiple points.  Unlike reaper and grim, hid appears to be regulated not only transcriptionally but also post-translationally by the Ras-MAP kinase pathway [28,29].  This additional level of control for hid could explain the observation that hid is expressed in cells that are doomed to die as well as in cells that continue to survive [38].  A second class of key apoptotic regulators in Drosophila is the IAP family.  IAPs contain at least one, and up to three, BIR domains, which are important for protein-protein interactions, and often a carboxy-terminal RING domain (reviewed in [27]).  While several IAPs have been implicated in inhibiting apoptosis, proteins with single BIR domains can also have functions in cell cycle regulation and cytokinesis [39,40].  In fact, Saccharomyces cerevisiae contains neither caspases nor an apoptotic program, but it does have an IAP-like protein with a single BIR domain that functions in regulating cell division [41], suggesting that the IAPs are diverse family with only some members being anti-apoptotic: hence this protein family is more accurately referred to as BIR-containing proteins (BIRPs).  Three Drosophila BIRPs have been shown to be inhibitors of apoptosis, Diap1, Diap2, and Deterin [42,43,44,45].  Of these, Diap1 has been most extensively characterized; it can block cell death caused by the ectopic expression of reaper, hid, and grim (reviewed in [26]).  In addition, Diap1 is able to block the cell death induced by the Drosophila caspases Dcp-1 or drICE in Saccharomyces, as well as apoptosis induced by the expression of drICE in insect cell culture [46,47,48].  Loss-of-function mutants in diap1 have a very striking early embryonic phenotype: virtually all cells display an apoptotic morphology, as well as displaying DNA fragmentation seen by TUNEL labeling, four to five hours into development [47,49,50].  This phenotype is especially noteworthy since it manifests itself several hours before the first developmental cell deaths are normally observed in the Drosophila embryo, and it clearly shows that diap1 is vital for normal development.  In general, IAPs are thought to function at the caspase activation step in the cell death pathway, binding to the inactive, pro-domain-containing caspase zymogen and preventing it from being processed into the active enzyme (reviewed in [27]).  Reaper, Hid, Grim, and Diablo/Smac can physically interact with IAPs, thereby inhibiting the anti-apoptotic activity of IAPs and allowing caspase activation.  Further insight into a possible mechanism for IAP function was recently gained when IAPs were observed to have ubiquitin ligase activity [51,52].  It was shown that certain cellular IAPs can be degraded in a proteasome-dependent manner prior to thymocyte apoptosis.  These IAPs are also capable of auto-ubiquitination, an activity shown to require the RING domain [52].  In addition, IAPs can mono-ubiquitinate several caspases in vitro, possibly marking them for degradation, and again, this activity requires the RING domain [51].  The exact role of the ubiquitination pathway in regulating apoptosis is still unclear.  Even the role of the RING domain in IAPs is open to debate: mutations in the RING domain of endogenous diap1 can be either pro- or anti-apoptotic, depending on the death-inducing stimulus ([49] and J. Agapite, L. Goyal, K. McCall and H.S., unpublished observations).

Conservation of key regulators of programmed cell death between species

Strengths and limitations of genomic analysis

The completed genomes of C. elegans and Drosophila have already yielded a wealth of information.  New cell death gene homologs are being described at a very rapid pace, often by several laboratories at once.  Using information from the genome project, C. elegans was shown to contain several potential caspases in addition to CED-3, but none of these has been shown to function in programmed cell death [9].  In addition, while C. elegans has two BIR-domain-containing proteins, neither appears to regulate cell death: BIR-1 has been demonstrated to be essential for cytokinesis instead [39].  Hence, C. elegans seems to have a mechanism for regulating caspase activation that differs from that of mammals and Drosophila, in that it relies solely on the two Bcl-2 family members [53].  For Drosophila, the genome project played a key role in the identification of the Bcl-2 family member(s) and the CED-4/Apaf-1 homolog, showing that the core cell death pathway is conserved in Drosophila as well.  Various other homologs for cell death genes, including three additional caspase-like sequences and a second Bcl-2 family member have been identified in the Drosophila genome sequence [7] and will undoubtedly be published in the near future.  And analysis of the human genomic sequence is sure to lead to a better understanding of the number of mammalian homologs for the various components of the cell death pathway.  While it is clear that the search for homologs of known cell death genes has been greatly facilitated by the availability of the entire genome sequence for C. elegans, Drosophila, and humans, there are limitations to this strategy.  For instance, the small size of the Reaper protein - only of 65 amino acids - makes it extremely difficult to identify homologs on the basis of genomic sequence information in other species: current gene-finding programs typically require an open reading frame of at least 70 amino acids.  Furthermore, there is considerable divergence even among cell death proteins with clearly homologous functions, such as CED-4/Apaf-1 and even CED-9/Bcl-2, and sequence homology between these genes was recognized only after the genes and their products had been functionally characterized.  Nevertheless, in combination with traditional approaches including genetics and biochemistry, knowledge of the genome sequence for model organisms will undoubtedly continue to drive our rapidly increasing insights into the molecular basis of programmed cell death.

Multiple RNA-protein interactions in Drosophila dosage compensation

Abstract

From worms to humans, recognizing and modifying a specific chromosome is essential for dosage compensation, the mechanism by which equal X-linked gene expression in males and females is achieved.  Recent molecular genetic and biochemical studies have provided new insights into how regulatory factors in Drosophila are recruited and assembled on the X chromosome, leading to the essential hypertranscription of its genes.

Distinct mechanisms of dosage compensation

Dosage compensation, which ensures that the expression of X-linked genes is equal in males and females, is an essential process in organisms that have sex chromosomes.  Because sex chromosomes appeared relatively late in evolution, dosage compensation evolved independently in different organisms and is accomplished by distinct mechanisms.  For example, in Caenorhabditis elegans, dosage compensation occurs in the homogametic hermaphrodite (XX) by the down-regulation of virtually all genes on the two X chromosomes by about 50%.  In mammals, dosage compensation also occurs in the homogametic sex (the female) by X inactivation, whereby an entire X chromosome forms a distinct, heterochromatic, transcriptionally inactive nuclear structure known as the Barr body.  Consequently, each gene on the single active X chromosome in female cells and the corresponding gene on the single X chromosome in male cells are expressed at equal levels.  In Drosophila, dosage compensation is achieved in the heterogametic male by a twofold chromosome-wide up-regulation (hypertranscription) of essentially all genes on the single X chromosome.  In C. elegans and Drosophila, several proteins with different molecular functions involved in dosage compensation have been identified.  Much attention has been dedicated to the elusive question of how the compensated or inactivated chromosomes are recognized by these proteins, and three recent papers investigating some of these aspects in Drosophila have provided new insights into this mystery [1,2,3].  At least one common theme has emerged in Drosophila and mammals of how the dosage-compensated chromosomes are marked and eventually recognized by regulatory proteins: the use of non-coding RNAs transcribed from genes located on the X chromosome itself [4,5,6,7].  These RNAs, Xist (X-inactive specific transcript) in mammals, and roX1 and roX2 (RNA on the X) in Drosophila, are structurally unrelated, yet they share the intriguing property of remaining tightly associated with the X chromosome.  In mammals, Xist RNA is transcribed only from the inactive X, with which it associates at its site of synthesis and then spreads over the entire chromosome through an unknown mechanism [8].  It is assumed that Xist RNA provides a mark for specific histones (for example, histone macroH2A1.2), as well as proteins deacetylating histones and methylating many of the X-linked genes at GpC dinucleotides [9,10,11]; all these events are important for maintaining the silenced state of the inactivated X.

The Drosophila dosage compensation complex

In Drosophila, the dosage compensation complex (DCC) is composed of five proteins encoded by the male-specific lethal genes, male-specific lethal-1, -2, -3 (msl1, msl2 and msl3), maleless (mle) and males absent on the first (mof), and at least two non-coding RNAs, roX1 and roX2 (for a review see [12]).  The MSL proteins colocalize to hundreds of sites along the single male X chromosome; they all are essential for the hypertranscription of the X-chromosomal genes in males, as male but not female animals die during development when they are mutant for any one of the five msl genes.  Hypertranscription is a consequence of at least one chromatin modification: the acetylation of histone H4 at Lys16, which is thought to 'loosen' chromatin structure, thereby allowing the general transcription machinery easier access to the regulatory regions of most X-linked genes [13].  The role of the roX genes and roX RNAs in this process is still unclear.  Localization studies of MSL proteins in male nuclei carrying autosomal roX1 or roX2 transgenes showed that the roX genes or RNAs recruit the entire set of MSL proteins to their transgenic location and can lead locally to acetylation of histone H4 [1,14].  Moreover, the DCCs can spread, by several hundred kilobases, into neighboring autosomal DNA.  These experiments indicated that the roX genes might function as nuclear entry sites for the assembly of the MSL proteins on the X chromosome.  Little was known, however, about the specific role of the roX RNAs during the formation of the DCC.  The three recent papers [1,2,3] have now started to address the role of the roX genes in this process and investigated which MSL proteins interact with the roX RNAs.  Meller et al. [1] took advantage of the fact that dosage compensation can be initiated and analyzed in females ectopically expressing the male-specific MSL2 protein.  These females hypertranscribe their two X chromosomes and therefore die during development, but they can be rescued if any other msl mutation is present.  It has been shown previously, using a similar strategy, that the DCC is assembled in an ordered sequence (Figure 1).  For example, MSL1 and MSL2 are the first proteins to bind to the X chromosome; in the absence of all other MSL proteins they associate with about 35 sites, the so-called chromatin entry sites.  MSL1 and MSL2 are interdependent, however: they require each other for binding [15].  MLE appears to be the next protein to join the growing DCC as it is found at reduced levels at some of these 35 sites when the remaining two proteins, MSL3 and MOF, are absent [16].  These latter two proteins require all the other MSL proteins to be present to enable them to associate with the X chromosome.  Importantly, all MSLs are required to generate the normal DCC distribution to the hundreds of sites throughout the entire X chromosome.  Meller et al. [1] have now analyzed and compared roX RNA and MSL protein distribution in females ectopically expressing MSL2 but lacking other MSL proteins.  They found that roX1 and roX2 RNAs are associated with the X chromosome at different stages and sites during the assembly of the DCC.  When the MLE protein, which encodes an RNA helicase, was absent, both roX RNAs were found only at their site of transcription but not in any other of the 35 entry sites where MSL1 and MSL2 were present, suggesting that neither of the roX RNAs can be integrated in a minimal MSL1-MSL2 complex.  When MSL3 was absent in these females, however, roX2 but not roX1 RNA was found at the entry 35 sites.  These findings indicate that roX2 might be incorporated into this partial DCC complex in an MLE-dependent manner and that roX1 RNA is incorporated at a later stage, together with MSL3 and perhaps MOF.  Further evidence that at least roX2 RNA is an integral part of the DCC comes from independent immunoprecipitation experiments from all three groups [1,2,3] using extracts from Drosophila S2 cells that express all the MSLs and roX2 RNA.  Surprisingly, Akthar et al. [2] found that DCC lacking MLE but containing the remaining MSLs still precipitated roX2 RNA.  The ordered assembly data from Meller et al. [1] as well as the fact that MLE is the only protein in the DCC featuring a known RNA-binding domain pointed towards the RNA helicase MLE as the primary candidate for a partner for the roX2 RNA.  Akthar et al. [2] turned their focus towards the MOF protein as a potential candidate for interaction with roX2 RNA.  MOF is the crucial component that links the DCC to the X-chromosome-specific acetylated form of histone H4.  MOF encodes an 827 amino-acid protein containing an amino terminus of unknown function, a chromodomain, a zinc finger and a carboxy-terminal histone acetyltransferase (HAT) domain.  The HAT domain of MOF is responsible for histone H4 acetylation, both in vivo and in vitro, as shown by Smith et al. [3].  Interestingly, immunoprecipitation assays using extracts from SL-2 cells transfected with wild-type or mutant mof genes indicated that the chromodomain is essential for the specific interaction between MOF and roX2, but that the amino-terminal region and HAT domain were not essential for this interaction.  Akthar et al. [2] then tested the putative RNA-binding property of MOF directly using electromobility shift assays.  MOF appeared to bind to RNA rather non-specifically, but preferred RNA to DNA.  In agreement with the immunoprecipitation results, they found that an intact chromodomain is essential for RNA binding, whereas the amino terminus or the HAT domain of MOF was not essential.  Furthermore, they also showed that MSL3, which features two chromodomains on its own, also shows roX RNA binding properties.  Although the specificity of these interactions remains to be investigated, the findings of Akthar and coworkers [2] provide a new possible role for chromodomains involved in RNA binding.  Chromodomains are modules of about 50 amino acids found in a number of proteins from yeast to mammals; the function of these proteins vary, but they are most often associated with gene silencing and chromatin remodeling events; significantly, many chromodomain-containing proteins are associated on chromosomes with heterochromatin or heterochromatin-like regions (reviewed in [17]).  The specific role of the chromodomain is unknown, but chromodomain swapping experiments in Drosophila suggest that they might be protein interaction modules [18].  Thus, the data by Akthar et al. [2] suggests a new and rather unexpected role for these modules in RNA binding.  It is intriguing to speculate that the chromodomain of other remodeling proteins also exert their activity through RNA interactions and that RNAs might be more common than generally appreciated in transcriptional regulation.

Ordered assembly of MSL proteins and roX RNA increases stability of the DCC.

The first step of assembly involves the recognition of about 35 chromosomal entry sites by a preDCC consisting of MSL1 and MSL2.  This recognition does not require any of the known roX RNAs.  In the next step, roX2 and MLE enter the complex to form a more stable primary DCC (PrimDCC) at these 35 sites.  Spreading throughout the entire X chromosome requires the formation of the complete complex by addition of MOF and MSL3.  The sequential addition of new components, particularly the roX2 RNA, might induce changes in the structure of already incorporated components (illustrated by ovals becoming circles), increasing the stability of the DCC.  MLE might be removed in vitro (for example, by elevated ionic strength) without destroying the entire complex, because of the stabilizing presence of MOF-MSL3.  The role of roX1 RNA is not clear, but it is integrated late in this process, together with or after MOF and MSL3; roX1 might provide additional stability to the mature DCC (MatDCC).  So, roX1 and roX2 might be partially redundant.

Multiple RNA-protein interactions within the DCC

How can we reconcile the findings of Akthar et al. [2] and Meller et al. [1]?  First, it might be relevant that the co-immunoprecipitations and the in vitro binding experiments were performed with SL-2 cell extracts, cells in which dosage compensation appears not to be necessary; SL-2 cells, like other Drosophila cell lines, can become aneuploid without reduced viability.  The soluble DCCs from these extracts are therefore not necessarily functional and might differ from those in males.  Second, the interaction between roX RNA and MOF protein appears to lack specificity.  The lack of specificity might be attributed to a number of reasons, such as the absence of other MSL components, the presence of other RNAs interacting with MOF, or worse, it might reflect a property of MOF without functional significance: that is, MOF might not contact RNA at all in vivo.  Keeping this cautious note in mind, a more attractive alternative could be envisioned.  The main proposition of this model is that roX2 can interact with both MLE and MOF, and perhaps even other MSL proteins, as both roX RNAs are certainly large enough to accommodate binding sites for several proteins [6].  If such a scenario is correct, each of the interactions between different components of the DCC might be quite weak, and perhaps not even highly specific, but in the context of additional interactions becomes more stable.  The addition of each new component might therefore strengthen existing interactions by optimizing contacts between binding surfaces through adjustments in the higher-order structure of the components (Figure 1).  Evidence from co-immunoprecipitation experiments indicates that MSL1 and MSL2 interact directly with each other, a notion supported by the MSL1-MSL2 association at the 35 chromosomal entry sites in the absence of any other MSLs (the preDCC; Figure 1).  The addition of both MLE and roX2 RNA might allow the formation of a primary complex (the primDCC), which might be stabilized by the direct interaction of a structural motif of roX2 RNA with the MLE protein, as well as interactions of this complex with MSL1 and MSL2.  The primDCC might then be reinforced by the addition of the remaining two proteins, MSL3 and MOF, one or both of which might recognize other motifs within the roX2 RNA, to form the mature complex (matDCC).  Finally, matDCC might be further stabilized by the addition of roX1 RNA, which could interact with several of the MSLs and perhaps roX2 RNA as well.  Significantly, assembly and disassembly of the complex might not necessarily follow the same order.  For example, once the complex is formed, (removal (for example, by high ionic strength) of a single component brought in relatively early (such as MLE) might be possible without affecting the rest of the complex).  One major task at hand now will be to resolve the specificities of the proposed interactions between the various MSL proteins and the roX RNAs.  This could be done by carrying out detailed in vitro studies, which might not be easy to perform; many MSL proteins are difficult to express stably in bacteria or eukaryotic expression systems.  An alternative might be to establish a heterologous in vivo system such as yeast, where protein-protein and RNA-protein interactions can be readily detected using sensitive reporter assays.  A second important question is whether the roX genes have the same, overlapping or complementing functions.  Although roX1 mutant males are fully viable [7], it was suggested that roX1 mutant males lacking a large number of additional genes, including roX2, have a lethal dosage-compensation phenotype because the DCC fails to assemble on the X chromosome [19].  Unfortunately, no roX2 mutations have been recovered yet; they would be important tools for validating the significance of the roX RNAs in vivo.  Furthermore, roX1 and roX2 are only the first two of about 35 chromatin entry sites, although perhaps they are the most important ones.  The nature of the remaining 33 sites is entirely unclear and there is, as yet, no evidence that they contain other roX-like RNAs.  This brings us to the largest of all mysteries, namely how the DCC is spread along the X chromosome.  One possibility is that the remaining 33 or so sites are 'stations' that serve as spreading facilitators in the form of DNA elements.  Thus, primDCCs and/or mature (mat) DCCs, originating from the roX entry sites, might hop from one 'station' to the next and eventually reach all the entry sites distributed along the entire X chromosome [1].  From these sites, they then reach into the neighboring chromatin regions by yet another mechanism.  One final thought: the dosage compensation machinery in C. elegans acts on the two X chromosomes in the hermaphrodite by twofold down-regulation of gene expression.  Six proteins are essential for this process and they associate with the X chromosomes; four of them are related to factors involved in chromosome condensation during mitosis in other systems.  Not unlike in Drosophila, the DCC complex of C. elegans is also assembled in an ordered sequence, but no RNAs have been identified that might be involved in this assembly process [20]; either there will be yet another surprising turn in this multifaceted process or the CroX (C. elegans roX) RNAs wait to be discovered.