School of Mathematics, Statistics and Applied Mathematics, National University of Ireland Galway, Galway, Ireland

Department of Statistical Sciences, Faculty of Science, University of Cape Town, Cape Town, South Africa

Centre for HIV and STIs, National Institute for Communicable Diseases of the National Health Laboratory Services, Johannesburg, South Africa

School of Pathology, University of the Witwatersrand, Johannesburg, South Africa

Institute of Infectious Diseases and Molecular Medicine, Division of Medical Virology, Faculty of Health Sciences, University of Cape Town and NHLS, Cape Town, South Africa

Beth Israel Deaconess Medical Center, Division of Viral Pathogenesis, Harvard, USA

Biomedical Informatics Research Division, eHealth Research and Innovation Platform, Medical Research Council, Tygerberg, South Africa

Computer Science Division, Department of Mathematical Sciences, University of Stellenbosch, Stellenbosch, South Africa

Los Alamos National Laboratory, Los Alamos, NM, USA

Vaccine Research Center, National Institute of Allergy and Infectious Diseases, NIH, Bethesda, MD, USA

Department of Surgery, Duke University Medical Center, Durham, NC, USA

Centre for the AIDS Programme of Research in South Africa, University of KwaZulu-Natal, Durban, South Africa

Centre for the AIDS Programme of Research in South Africa, Columbia University, Columbia, USA

Santa Fe Institute, Santa Fe, NM, USA

Abstract

Background

Identification of the epitopes targeted by antibodies that can neutralize diverse HIV-1 strains can provide important clues for the design of a preventative vaccine.

Methods

We have developed a computational approach that can identify key amino acids within the HIV-1 envelope glycoprotein that influence sensitivity to broadly cross-neutralizing antibodies. Given a sequence alignment and neutralization titers for a panel of viruses, the method works by fitting a phylogenetic model that allows the amino acid frequencies at each site to depend on neutralization sensitivities. Sites at which viral evolution influences neutralization sensitivity were identified using Bayes factors (BFs) to compare the fit of this model to that of a null model in which sequences evolved independently of antibody sensitivity. Conformational epitopes were identified with a Metropolis algorithm that searched for a cluster of sites with large Bayes factors on the tertiary structure of the viral envelope.

Results

We applied our method to ID_{50} neutralization data generated from seven HIV-1 subtype C serum samples with neutralization breadth that had been tested against a multi-clade panel of 225 pseudoviruses for which envelope sequences were also available. For each sample, between two and four sites were identified that were strongly associated with neutralization sensitivity (2ln(BF) > 6), a subset of which were experimentally confirmed using site-directed mutagenesis.

Conclusions

Our results provide strong support for the use of evolutionary models applied to cross-sectional viral neutralization data to identify the epitopes of serum antibodies that confer neutralization breadth.

Background

A successful HIV-1 vaccine is likely to require the induction of neutralizing antibodies that can prevent infection. HIV-1 entry into host cells is mediated by the HIV-1 envelope glycoprotein, which forms a trimeric structure on the surface of the virus. Each of these envelope “spikes” consists of three identical, non-covalently associated heterodimers of surface gp120 and transmembrane gp41. Antibodies that bind the envelope can be detected within eight days of infection

The HIV-1 envelope has evolved an array of mechanisms that hinder binding by neutralizing antibodies. The envelope glycoprotein is genetically variable, conformationally flexible and heavily glycosylated, resulting in either escape from antibody recognition or shielding of neutralization sensitive sites

Broad serum neutralization could potentially be mediated by a polyclonal set of neutralizing antibodies that accumulate over time and target several distinct regions of envelope

B cell epitope prediction is complicated by the conformation-dependent nature of antigen-antibody binding. Although more than 90% of antibody epitopes are estimated to be conformational in nature, most experimental and computational methods are designed to identify only linear epitopes

Our method incorporates neutralization sensitivities directly into a phylogenetic model of molecular evolution. Amino acid sites at which the pattern of evolution correlated with changes in neutralization sensitivity across the phylogenetic tree were identified. We hypothesized that many of the sites that were associated with changes in the neutralization sensitivity of multiple viruses lie within antibody epitopes. In order to identify sites in the alignment that were most likely to influence neutralization sensitivity, we used Bayes factors to compare the fit of a model that allows the amino acid frequency at each site to depend on neutralization sensitivities (epitope model) to that of a model which assumes independence (non-epitope model). A contiguous set of sites on the primary sequence that favored the epitope model provided evidence for a B cell epitope (which could be linear or conformational), while a set of sites with large Bayes factors that were clustered in three-dimensional space provided evidence of a conformational epitope. We used this approach to predict epitopes targeted by the broadly neutralizing antibodies present in the sera of seven HIV-1 subtype C-infected individuals enrolled in the CAPRISA Acute Infection study

Results

We have previously identified sera from seven HIV-1 subtype C-infected women in the CAPRISA 002 cohort that showed substantial neutralization breadth at 3 years post-infection against a panel of 42 viruses from subtypes A, B and C _{50}). The neutralization breadth, as measured by the overall percentage of viruses neutralized (ID_{50} > 20) by each serum, ranged from 82% (CAP255) to 94% (CAP206) (Figure _{50} exceeding 10, 000 against some viruses (Figure

ID_{50} titers for each serum sample.

**ID**_{50 }**titers for each serum sample. (A)** Heat map of ID_{50} titers clustered by the viral phylogeny. The percentages below the heat map indicate the percentage of panel viruses that were neutralized (ID_{50} > 20) by the antibodies in each serum. **(B)** Histograms of the natural logarithm of ID_{50} titer. The distribution for CAP256 was notably different from that of the other sera.

Identification of amino acid residues targeted by the broadly neutralizing serum of CAP256

We first analyzed the data from CAP256 which showed the highest neutralization titers (Figure

Given a coding sequence alignment and a phylogenetic tree for the virus panel, we used evolutionary models to identify sites in the alignment at which the pattern of evolution along branches of the tree was dependent on the neutralization titer of the virus at the tip of each branch (see Methods for details). Scaled Bayes factors (denoted as _{
k
} for HXB2 amino acid position

Our model provided striking evidence of an association between neutralization sensitivity and the amino acids present at sites 166 (_{166} = 25.4) and 169 (_{169} = 10.7) (see Figure

Scaled Bayes factors for CAP256.

**Scaled Bayes factors for CAP256.** Neutralization titers were strongly associated with sites 166 and 169 when the ConC reference sequence was used. These sites have previously been shown to contribute significantly to the epitope targeted by CAP256 antibodies _{50}: white indicates no or negligible evidence, light grey indicates moderate evidence and dark grey indicates strong evidence _{50} are annotated with their HXB2 position and the amino acid found to be enriched among sensitive (high titer) viruses at that site.

Model predictions for CAP256. Scaled Bayes factors using the (A) autologous CAP256, (B) CAP210 and (C) CAP45 reference sequences. (D) Bayesian evolutionary-network model. (E) Posterior probabilities of a conformational epitope using the three-dimensional model with a ConC reference sequence. The posterior probabilities are shaded as described in the legend for Figure **Figure S2.** Model predictions for CAP8. Scaled Bayes factors using the (A) ConC, (B) Q23 and (C) TRO reference sequences. (D) Bayesian evolutionary-network model. (E) Posterior probabilities of a conformational. **Figure S3.** Model predictions for CAP257. Scaled Bayes factors using the (A) ConC and (B) Q842 reference sequences. (C) Bayesian evolutionary-network model. (D) Posterior probabilities of a conformational epitope using the three-dimensional model with a ConC reference sequence. The posterior probabilities are shaded as described in the legend for Figure **Figure S4.** Model predictions for CAP255. Scaled Bayes factors using the (A) autologous CAP255, (B) TRO and (C) Q23 reference sequences. **Figure S5.** Model predictions for CAP177. Scaled Bayes factors using the (A) ConC, (B) Q23 and (C) TRO reference sequences. (D) Bayesian evolutionary-network model. **Figure S6.** Model predictions for CAP206. Scaled Bayes factors using the (A) ZM197, (B) autologous CAP206, (C) CAP45, (D) Q23, (E) COT6 and (F) TRO reference sequences. (G) Bayesian evolutionary-network model. **Figure S7.** Model predictions for CAP248. Scaled Bayes factors using the (A) ConC, (B) CAP45 and (C) DU156 reference sequences. (D) Bayesian evolutionary-network model. **Figure S8.** Scaled Bayes factors for the CAP256 serum obtained after imputing ancestral titers with the median observed titer and titers reconstructed with Felsenstein’s method.

Click here for file

**Serum**

**HXB2 position**

**Reference residue**
^{
‡
}

**Scaled Bayes factor (LFDR)**

**Backbone**

**ID**
_{
50
}**fold effect**

^{†}Sites with ^{‡}Amino acid found to be significantly enriched among sensitive (high titer) viruses based on our evolutionary model. *Fold effect ≥ 2. NT = Not tested.

CAP256

166^{†}

Arg

25.4 (0.0001)

CAP45

261.3*

ConC

22.2*

169^{†}

Lys

10.7 (0.083)

CAP45

365.8*

ConC

15.3*

CAP8

24

Met

6.4 (1.000)

NT

295

Asn

11.9 (1.000)

TRO

0.4

316^{†}

Thr

9.1 (1.000)

ConC

1.8

535

Ile

6.0 (1.000)

NT

CAP257

166

Arg

6.3 (0.738)

ConC

1.8

295

Asn

7.0 (0.665)

Q842

0.1

648

Glu

6.1 (0.757)

Du156

1.2

RHPA

2.9*

702

Leu

6.2 (0.747)

NT

CAP255

332

Asn

8.0 (0.622)

Q23

4.0*

TRO

2.9*

ConC

0.2

334

Ser

6.8 (0.750)

TRO

>12.4*

ConC

0.3

351

Glu

6.8 (0.750)

NT

837

Phe

10.1 (0.366)

NT

CAP177

209

Thr

8.8 (0.388)

TRO

0.4

332^{†}

Asn

7.3 (0.573)

ConC

1.7

Q23

11.0*

TRO

>2.8*

334^{†}

Ser

7.8 (0.511)

ConC

2.1*

TRO

0.2

683

Lys

6.3 (0.689)

NT

CAP206

150

Met

6.7 (0.457)

NT

655

Lys

7.3 (0.384)

NT

CAP248

85

Val

6.5 (1.000)

NT

340

Glu

6.0 (1.000)

ConC

2.2*

CAP45

0.6

651

Asn

7.4 (1.000)

ConC

2.0*

CAP45

1.3

Du156

2.4*

659

Asp

8.9 (1.000)

ConC

2.9*

CAP45

0.5

We also found weak evidence of an association with neutralization sensitivity at sites 162, 163, 167, 176, 177, 181, 183, 187, 193 and 194 in V2 (2 < _{
k
} < 6). The fact that our model assigns moderately large scaled Bayes factors to this cluster of sites is encouraging. However, there were many other sites throughout gp120 and gp41 with scaled Bayes factors in this intermediate region (2 < _{
k
} < 6), as might be expected when multiple model comparisons are performed. To account for the fact that multiple model comparisons were carried out, we computed the local false discovery rate (LFDR) associated with the Bayes factor at each site (see Methods). As expected, the large Bayes factors at sites 166 and 169 were highly likely to be true positives with low LFDRs of 0.0001 and 0.083, respectively (Table

Prediction of sites targeted by PG9/PG16-like antibodies

In order to explore the utility of our model, we tested two additional sera predicted to have a similar specificity to CAP256, but with considerably lower titers. Four sites were found to be strongly associated (_{
k
} ≥ 6) with neutralization by CAP8 serum antibodies (see the scaled Bayes factors in Table _{50} titers. Site 316 in V3 was also found to be significantly associated with neutralization sensitivity (_{316} = 9.1). Mutation of this residue in the ConC backbone resulted in a large drop in ID_{50} titers from 11,000 to 6,000. Mutations in the V3 region are known to modulate neutralization sensitivity of the conserved V2 epitope recognized by PG9/PG16-like antibodies. Therefore, although no sites within the conserved C-strand in V2 were detected, the detection of residue 316 in V3 was consistent with a known neutralizing specificity in the CAP8 serum, which was previously shown to target a PG9/PG16-like trimer-specific epitope _{24} = 6.4) in the envelope signal peptide and site 535 (_{535} = 6.0) in gp41 were also found to be associated with titer, though we do not expect these to contribute to CAP8 antibody neutralization and therefore did not test them experimentally.

For CAP257 serum, we identified signals at positions 166 (_{166} = 6.3) and 295 (_{295} = 7.0) in the V2 and V3 regions of gp120, respectively, and at positions 648 (_{648} = 6.1) and 702 (_{702} = 6.2) in gp41. Sites 166, 295 and 648 were tested with site-directed mutagenesis and the substitution of alanine at site 648 in the RHPA backbone was found to reduce ID_{50} titers by more than 2 fold (Table

Prediction of sites associated with N332-dependent broadly neutralizing antibodies

Two CAPRISA sera that targeted another broadly neutralizing antibody epitope were tested with our model. The Bayes factors for CAP255 serum identified three sites in the C3 region of gp120 (sites 332, 334 and 351) (see Figure _{
k
} ≥ 6) with neutralization titer, namely positions 332 and 334. For site 332, all of the reference sequences contained an asparagine at this position and yielded a large scaled Bayes factor of 8.0, suggesting that viruses with this amino acid were sensitive to CAP255 antibody neutralization. This result was supported by site-directed mutagenesis in the Q23 (4.0 fold) and TRO (2.9 fold) envelope backbones but not in ConC. Our model predictions also showed a large scaled Bayes factor of 6.8 at site 334, which forms part of the same N-linked glycosylation motif, when the reference residue was serine. The involvement of this site was confirmed experimentally in the TRO envelope backbone (>12.4 fold) only. The Q23 reference sequence has a threonine at this position which yields a Bayes factor close to zero, suggesting that this amino acid is not enriched among the sensitive viruses in the panel. This is perhaps surprising, given that a threonine at this position also permits the attachment of an N-linked glycan at site 332. The identification of 332 and 334 as contributing to the CAP255 epitope confirms previous mapping data showing that these antibodies are dependent on the N332 glycan in the C3 region

Scaled Bayes factors for CAP255.

**Scaled Bayes factors for CAP255.** There was strong evidence of an association with ID_{50} titer at sites 332, 334 and 351 when the ConC reference sequence was used. The predictions at sites 332 and 334 were tested and validated experimentally (see Table

In addition to sites 332 and 334, we also obtained a large scaled Bayes factor of 6.82 at position 351 when this site contained an isoleucine in the reference sequence for the CAP255 data. Our model also predicted site 837 in gp41. While we do not expect that this position lies within a CAP255 epitope, it is possible that amino acid changes in gp41 contribute to neutralization sensitivity by influencing epitope accessibility through conformational changes to the envelope complex

In addition to the three sites in the C3 region with _{
k
} ≥ 6, several surrounding sites were weakly associated with CAP255 neutralization sensitivity and had 2 < _{
k
} < 6 for at least one reference sequence (see Figure

Three-dimensional predictions for CAP255.

**Three-dimensional predictions for CAP255. (A)** Amino acid residues that were weakly (2 ≤ _{k} < 4), moderately (4 ≤ _{k} < 6) and strongly (_{k} ≥ 6) associated with ID_{50} titer using the ConC reference sequence are shown with light, intermediate and dark green, respectively. There is evidence for a cluster of sites with moderately large Bayes factors on the three-dimensional surface, as might be expected of a B cell epitope. **(B)** Posterior probabilities of a conformational epitope using the three-dimensional Metropolis algorithm. The surface of the protein (PDB ID: 2B4C) was shaded from dark blue (posterior probability = 0) to red (posterior probability = 1) according to the posterior probability assigned to each amino acid residue. There was evidence for a conformational epitope involving the C3 region (residues in the light blue region have posterior probabilities of approximately 0.2).

It is possible that the association with neutralization sensitivity at some of these spatially-proximal sites might be a consequence of compensatory mutations. To investigate this, we used the Bayesian graphical model of Poon et al. _{50} titer. A network graph indicating the direct associations (with posterior probabilities > 0.75) of all sites with _{
k
} > 4 using the CAP255 data is shown in Figure _{50} titer and therefore that none of our predicted associations were likely to be due to compensation for resistance-imparting mutations elsewhere. Resistance to CAP255 serum could therefore be attributed to mutations at several sites that either constitute the binding interface of a single antibody or represent independent targets of multiple antibodies. This was the case for all sites with _{
k
} > 6 across all sera (see the supporting information for each serum).

Bayesian evolutionary-network model for CAP255

**Bayesian evolutionary-network model for CAP255 [****].** The red node corresponds to ID_{50} titer and all other nodes represent sites in the HIV-1 envelope. Nodes with scaled Bayes factors > 6 are shaded in dark green, while nodes with scaled Bayes factors between 2 and 6 are shaded in light green. An edge connecting two nodes indicates that there is a direct association between the two nodes. Edges are labeled with the estimated posterior probability of an interaction between the nodes they connect. Only sites with scaled Bayes factors > 4 or posterior probabilities of an association with such a site > 0.75 are shown. Since all of the sites with Bayes factors > 4 are directly connected to the ID_{50} node, none of the predicted associations could be attributed to compensatory mutations.

For CAP177 serum, our model predicted four sites that influenced neutralization sensitivity with strong support (_{
k
} ≥ 6). These included sites 209 in the C2 region, 332 and 334 in the C3 region and 683 in the membrane proximal external region of gp41 (Table _{332} = 7.3). This result was experimentally validated in the Q23, TRO and, to a lesser extent, ConC envelope backbones (see Table _{50} titer, although this same mutation produced a five-fold increase in ID_{50} titers with the TRO backbone. Nonetheless, these experimental results confirmed the importance of this site as a determinant of neutralization titer. This is in line with experimental mapping studies in CAP177, which, like CAP255, showed the presence of antibodies recognizing 332-dependent PGT-like epitopes, though these differ subtly from those in CAP255 in their dependence on variable loops _{209} = 8.8) influenced neutralization sensitivity to CAP177 sera could not be experimentally validated in the envelope backbone tested here. We did not attempt to validate site 683 (_{683} = 6.3).

Overall, our model was very effective in predicting sites associated with sensitivity to PGT-like antibodies that are dependent on the glycan at position 332 in both CAP255 and CAP177. These specificities appear to be relatively common among broadly neutralizing sera

Identification of sites forming part of epitopes in gp41

For CAP206, our model predicted two amino acid positions that were strongly associated with antibody neutralization across six different reference sequences (see Table _{150} = 6.7), was in the V1 region, while the other prediction, site 655 (_{655} = 7.3), was near the membrane proximal external region in gp41. The latter prediction is consistent with available mapping data, showing that neutralization breadth in CAP206 is mediated largely by antibodies targeting this region

For the final serum examined, CAP248, mapping studies have shown that the serum antibodies target a quaternary epitope that has not yet been defined, but is not PG9/PG16-like _{651} = 7.4) and 659 (_{659} = 8.9) adjacent to the membrane proximal external region of gp41 and neutralization sensitivity (Table _{85} = 6.5) and 340 (_{340} = 6.0) in gp120 as significantly associated with titer. The prediction at position 340 was confirmed by mutagenesis, while site 85 was not tested experimentally.

Discussion

We have developed a novel computational approach to identify amino acid residues in HIV-1 envelope glycoproteins that are targeted by serum neutralizing antibodies. The method can be used to identify neutralizing antibody epitopes when neutralization sensitivity to the serum is available for a large panel of sequenced viruses. Such data will become increasingly available as large-scale efforts to investigate HIV-1 neutralization serotypes are undertaken. The method is an extension of the evolutionary model of Lacerda et al.

This method was applied to neutralization datasets of seven HIV-1 subtype C serum samples that were previously shown to have neutralization breadth _{50} were available from a multi-clade panel of 225 pseudoviruses. Our model identified two to four sites per sample and 24 predictions across all sera that were strongly associated with neutralization sensitivity. We were able to confirm ten of the fifteen sites that we tested using site-directed mutagenesis. In many cases, these corresponded to sites that had previously been linked to antibody neutralization in these and other broadly neutralizing sera. This included two positions in the V2 region (166 and 169) that contributed to a trimer-specific PG9/PG16-like epitope

Our model is similar to the one developed by Gnanakaran et al.

For comparison with our results, we applied the method of Gnanakaran et al.

Reference sequences. **Table S2.** Estimates used to compute LFDRs. **Table S3.** Significant associations obtained with the method of Gnanakaran et al. [15]. **Table S4.** Sites with scaled Bayes factors ≥ 6 using reconstructed titers.

Click here for file

The effects of experimental amino acid mutations in different backbones were highly variable, as has been observed in numerous previous studies

To account for multiple testing, we computed the local false discovery rate (LFDR) associated with each of our predictions; that is, the probability that a site is incorrectly found to be associated with titer given the codon data at the site and the prior probability of no association (see Methods). To compute the prior probability of no association, we first estimated the probability of obtaining a positive scaled Bayes factor when no association exists by randomly shuffling the assignment of titers to sequences. For computational reasons, we were only able to perform 100 such permutations. We then made the conservative assumption that a site that is truly associated with titer will always have a positive scaled Bayes factor. Consequently, the LFDR estimates reported in Table

To identify amino acid positions that collectively provide support for the existence of a conformational epitope, we introduced a model in which all sites within a sphere on the tertiary structure evolved according to the epitope model, while the evolution of all other sites was described by the non-epitope model. A Metropolis algorithm was used to explore the posterior density of the center and radius of the sphere and thereby determine the most likely location and size of a conformational epitope on the envelope structure. This approach predicted a known epitope in the C3 region targeted by the CAP255 neutralizing serum. No three-dimensional clusters with high posterior probability were predicted in CAP177, CAP248 or CAP206. The three-dimensional prediction algorithm did, however, find evidence for an epitope targeted by the CAP256, CAP8 and CAP257 antibodies in the V3 region of gp120 (largest posterior probability of 0.367, 0.654 and 0.361, respectively; see Additional file

The unobserved titers at all ancestral nodes represent a missing data problem. Unfortunately, integration over all of these unknown variables is not a computationally feasible option for large phylogenies. Instead, one must resort to an imputation procedure. In the model presented here, the ancestral nodes were all assigned the median value of the observed titers at the tips of the phylogeny. We obtained very similar results when codon evolution along the internal branches was modeled with a standard, titer-independent MG94×HKY85 model. Although the internal branches alone cannot provide evidence of an association with neutralization sensitivity in our current model implementation, they are informative about the parameters of the evolutionary model and may therefore increase the power of our method to detect associations based on evolution along the terminal branches. This further distinguishes our approach from other phylogenetically-corrected methods, such as that of Gnanakaran et al.

To our knowledge, this is the first attempt to identify conformational B cell epitopes on a tertiary structure based on the evolutionary history of a panel of viruses. In principle, our model could be combined with other structure- and mimotope-based methods that assign scores to residues based on their structural properties and peptide-binding affinities.

Conclusions

Our method is an effective tool for detecting sequence positions that contribute to neutralizing antibody sensitivity, even within quaternary epitopes on the trimeric envelope complex. However, in the absence of information on the host immune responses experienced by each of the viruses in the panel, we cannot determine whether the correlations between amino acid frequencies and neutralization phenotype at these sites are a consequence of selective immune pressure or random genetic drift. Furthermore, sites that influence neutralization sensitivity through insertions and deletions that alter epitope binding affinity or through shifts in glycosylation patterns will not be detected by our approach

Methods

Ethics statement

The CAPRISA Acute Infection study was reviewed and approved by the research ethics committees of the University of KwaZulu-Natal (E013/04), the University of Cape Town (025/2004) and the University of the Witwatersrand (MM040202). All participants provided written informed consent for study participation.

Neutralization and sequence data

We analyzed neutralization datasets from seven HIV-1 subtype C-infected women from the CAPRISA 002 Acute Infection Study _{50}). The data for CAP256 was used to decide on the optimal modeling strategy.

HIV-1 gp160 sequences from 225 panel viruses were codon aligned with the hidden Markov model implemented in the HIVAlign tool of the Los Alamos National Laboratory (LANL) HIV database (

HIV-1 gp160 alignment.

Click here for file

Genbank accession numbers and neutralization titers for the virus panel.

Click here for file

The alignment was screened for recombination using GARD

A single, maximum likelihood phylogeny for the panel of viruses was inferred with PhyML

Evolutionary models

Our computational approach to identifying sites targeted by broadly neutralizing antibodies is an extension of the method of Lacerda et al.

We adapted the evolutionary model of Halpern and Bruno ^{
k
}
_{
ij
} at site _{
ij
} and the probability of fixation ^{
k
}
_{
ij
} relative to that of a neutral mutation; that is, ^{
k
}
_{
ij
} ∝ _{
ij
} × ^{
k
}
_{
ij
} / (1/

where

We parameterized the site-specific equilibrium frequencies as

where _{
k
} is the equilibrium frequency of the reference amino acid at site _{
j
} is the equilibrium frequency of codon _{
Γ
} = ∑ _{
i ∈ Γ
}
_{
i
}. The factor involving the _{
j
} terms distributes the reference amino acid frequency among the codons that encode it in such a way as to maintain the codon usage bias observed over the entire alignment.

We used an HKY85 model for the mutation rate _{
ij
}, with the nucleotide equilibrium frequencies estimated empirically from the full alignment. Because we fitted this model to coding sequences, our estimates will not only reflect the mutational process, but will also capture selection induced through the genetic code. We do not expect that our model results will be sensitive to this misspecification. Ideally, the mutation parameters should be estimated from non-coding nucleotide sequences _{
j
} expected in the absence of selection.

Assuming a time-reversible mutation process, substituting (2) into (1) yields ^{
k
}
_{
ij
} = 1/_{
k
} of the reference amino acid at site _{
k
} × 1/_{
k
} = 1 indicates that nonsynonymous and synonymous mutations are fixed with the same probability at site _{
k
} < 1 and _{
k
} > 1 imply purifying and diversifying selection at site

A similar codon model that distinguishes between diversifying and directional selection was recently considered by Murrell et al.

To identify codon sites associated with sensitivity to antibodies, we allowed the equilibrium frequency _{
k
} of the reference amino acid to depend on neutralization titers. More specifically, let _{
0k
} be the equilibrium frequency of the reference amino acid at site _{50} titer) viruses and let _{
1k
} denote this frequency for resistant (low ID_{50} titer) viruses. We set _{
k
} = _{
0k
} + (1 – _{
1k
}, where _{50} neutralization titer. Since the frequency of the reference residue among sensitive viruses was expected to be at least as large as that among resistant viruses, we constrained _{
0k
} ≥ _{
1k
} where equality implied that a site was not associated with antibody neutralization.

For the model defined in Equation (3), nonsynonymous substitutions toward the reference amino acid occur at a higher rate if the relative equilibrium frequency of the reference residue _{
k
}/(1-_{
k
}) is larger than its expected value in the absence of directional selection, _{
Γ
}/(1-_{
Γ
}). Substitutions in the opposite direction are favored if the converse is true. Thus, a neutralization-resistant virus (_{
0k
} > _{
1k
}. Similarly, substitutions in favor of the reference residue will be more likely at these sites among neutralization-sensitive viruses (

The value of _{50} titer was treated as a continuous variable and mapped onto the [0,1] scale to obtain values for _{50} titer distribution, we defined

Although Equation (3) defines a time-reversible model along each branch of the phylogeny, the process is not time reversible when considered over the entire tree. Consequently, the location of the root node and initial codon frequencies must be specified for valid likelihood-based inference. This specification is trivial when the median titer is used to define

Given an inferred topology and F1 × 4 codon equilibrium frequencies, all model parameters, except _{
0k
} and _{
1k
}, were estimated by fitting a GY94 × HKY85 model in HyPhy _{
k
} drawn from a three-category general discrete distribution. Fixing the set, _{
k
}| _{
0k
}, _{
1k
}) of the codon data _{
k
} at site _{
0k
}, _{
1k
}): 0 ≤ _{
1k
} ≤ _{
0k
} ≤ 1 using Felsenstein’s pruning algorithm. To compute the likelihood of our epitope model, we treated _{
0k
} and _{
1k
} as nuisance parameters with a flat joint distribution and integrated them out of the likelihood function using the adaptive numerical integration routine available in the cubature R package. We also computed the likelihood of a non-epitope model that does not allow for an association between an alignment site and neutralization sensitivities by setting _{
0k
} = _{
1k
} = _{
k
} and integrating _{
k
} out of the likelihood function with _{
k
} ~

The data-based evidence in support of the epitope model relative to the non-epitope model at site

where _{
E
}(_{
k
}) and _{
N
}(_{
k
}) are the integrated likelihoods under the epitope and non-epitope models, respectively. Kass and Raftery _{
k
} < 6 can be interpreted as positive, but weak evidence against the null model, while _{
k
} ≥ 6 indicates strong evidence against the null model. We used the latter criterion to identify sites for experimental validation.

By performing model comparisons at each of 818 amino acid positions per serum, we anticipated that some sites with _{
k
} ≥ 6 would be false positives. To address this issue, we computed the local false discovery rate associated with each of our scaled Bayes factors. The local false discovery rate (LFDR) is defined as the probability of the null (non-epitope) model given the codon data _{
k
} at site _{
k
} and the prior probability _{0} of the null model. The LFDR for site

and rearranging terms to obtain

Use of Equation (4) required an estimate of _{0} which we obtained from the following relation

where _{
N
}(_{
k
} > 0) and _{
E
}(_{
k
} > 0) denoted the probability of a positive scaled Bayes factor if the data was generated under the non-epitope and epitope models, respectively. The probability _{
k
} > 0) can be approximated as the proportion of observed scaled Bayes factors greater than zero for any particular serum sample. To approximate _{
N
}(_{
k
} > 0), we permuted the assignment of ID_{50} titers to the pseudoviruses and recorded the proportion of scaled Bayes factors greater than zero. In this respect, we considered only sites at which the frequency of the most prevalent amino acid was <95%, since invariant sites do not provide evidence for either model. The median proportion over 100 such permutations was used as an estimate of _{
N
}(_{
k
} > 0). (The mean proportion was not substantively different.) These estimates are reported in Additional file _{
E
}(_{
k
} > 0), _{0} may be computed from Equation (5) and used to obtain the LFDR for a given Bayes factor. We set _{
E
}(_{
k
} > 0) = 1 to obtain an upper bound on the LFDR. Our reported false discovery rates should therefore be regarded as conservative.

Conformational epitope prediction

The method described above can identify individual sites associated with neutralization titer. A set of such predictions in close proximity on the primary sequence provides evidence for a linear B cell epitope. Similarly, a set of predictions clustered on the three-dimensional structure provides evidence for a conformational epitope. For any set of amino acids,

We introduced a simple spherical model of an epitope that can identify sets of sites that support the epitope model and are clustered in three-dimensional space. A Metropolis algorithm was used to explore the posterior density of the location and size of the sphere (see

We implemented the Metropolis algorithm in Mathematica 8, using the 2B4C Protein Data Bank structure for gp120 and the likelihoods computed for each site with HyPhy

A subsequent sphere was then proposed as described above and the process repeated until the Markov chain converged to its stationary distribution. We used a burn-in period of 10 000 iterations, after which spheres were sampled from their posterior distribution at every 100th iteration for 500 000 iterations.

Identification of coevolving sites

We used a modified version of the evolutionary network model of Poon et al.

Phylogenetically-corrected Fisher’s exact tests

We compared the predictions of our evolutionary model to those of the signature detection method of Gnanakaran et al. _{50} titers were less than the first quartile, less than the median or less than the third quartile. Ancestral sequences were reconstructed by maximum likelihood using a general time reversible model of nucleotide evolution with site-specific rate heterogeneity. For each possible ancestral amino acid at each site, a contingency table was constructed to test whether the proportion of neutralization sensitive viruses was significantly different between extant viruses that contained the ancestral amino acid at the site versus those that did not. Due to the large number of tests performed,

Experimental validation

A subset of the residues predicted to be associated with neutralization sensitivity were tested for their effect on neutralization sensitivity in one to three sensitive envelope backbones per serum sample using site-directed mutagenesis. Residues of interest were mutated to alanine in all cases except for position 169, which was mutated to glutamic acid as described previously _{50}) causing 50% reduction of relative light units. The effect of mutations on neutralization sensitivity was calculated as a fold change in neutralization titers of the mutant virus compared to the unmutated parental clone. A model prediction was regarded as validated if a mutation at the corresponding site produced at least a twofold reduction in ID_{50} titer in at least one backbone.

Software availability

The likelihoods of the evolutionary models were computed on a computer cluster using the parallel processing capabilities of HyPhy and the R Language and Environment for Statistical Computing. We are currently working on an online tool to implement this methodology. All computer code is freely available from the corresponding author.

Availability of supporting data

The data set supporting the results of this article is provided in Additional file

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

CW and SSAK conceived, implemented and led the CAPRISA 002 Acute Infection study. LM, CW and CS conceived and designed the experiments. PLM, ESG, MN, MM, CKW, DS, MS, RTB and JM performed the experiments. ML, NKN, BM, BTMK and MK analyzed data. SSAK, HG ad KG contributed reagents and materials. ML, PLM, LM, CW and CS wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank the participants in the CAPRISA Acute Infection cohort, the CAVD HIV Specimen Cryorepository (HSC), the clinical and laboratory staff at CAPRISA for providing specimens and SCHARP for data management (Blake Wood, Mark Bollenbeck, Linda Harris). The authors would like to thank Dr AFY Poon for use of his HyPhy scripts to implement the graphical model of co-evolution.