Wednesday, December 3, 2014

Pathway analysis

The goal of this post is to utilize the Ingenuity pathway analysis (IPA) to evaluate our gene of interest ACADM.  In previous posts I have described this gene locus and the disease associated with it.  The IPA will be used to search for drugs that may interact with the product of the gene locus and review pathways associated with the locus and its products.

Using the "Genes and Chemicals" search tool and the identifier "ACADM" the entry of our gene of interest was located.  Unfortunately there were no drugs associated with the gene as seen below in the far right cell:
 Additionally linking out to the gene information page also showed the missing drugs table.  This is not surprising given the mechanism of action of the gene and it disease Medium-chain acyl-coenzyme A dehydrogenase deficiency or MCADD.  The disorder manifests itself when periods of fasting occur and the body attempts to utilize fatty acids as a source of energy.  The failure to produce the ACADM protein results in a failure to generate energy and the metabolism starts to fall apart.  This can be prevented through the use of glucose and simple carbohydrates.  As diet has been show to be enough to manage the disease the would be little incentive for a drug targeting this site to be developed.


To begin the analysis of the pathways ACADM was added to a new pathways construct, and using the "build" panel, and selecting the "grow" tools, molecules were limited to on those that have direct interactions, found in humans, and had the molecule types [biologic drug, chemical(8)] excluded.  All molecules both upstream and downstream were allowed, the limit was left at 10 for the first pass of the analysis, but this proved not to be a limiting factor as only two molecules were returned, PPARGC1A and ESRRA, (no trimming was required) as seen here on the left using the "auto-layout".  On the right we see the sub-cellular layout indicating that both molecules act from the nucleus into the cytoplasm on ACADM:

Both of the observed molecules were of the [E] type or expression, meaning that activity by these two protiens act to increase the RNA expression, and the blocking of the activity of these proteins reduced the amount of RNA seen from this locus.  No activation, no inhibition, and no Protein-Protein interaction was seen.
Switching to the "Overlay" tab and selecting Canonical Pathways showed a number of pathways recognized, as seen below:
However only 2 were from ACADM, the Fatty Acid B-oxidation I, and Leucine Degradation I.  As failure of the Fatty Acid B-oxidation I is what actually causes the disease MCADD it was chosen for further analysis.  The view of the pathway can bee seen here:
With ACADM as the purple triangle.  Both of the pathways passing through ACADM are of the type RE, meaning they are enzymatic reaction, part of the break down of fatty acids into energy.  The report for this pathway it is indicated that "Although enzymes of the pathway handle both short and long chain fatty acids, it is the long chain compounds that induce the enzymes of the pathway. Each turn of the cycle removes two carbon atoms until only two or three remain. When even-numbered fatty acids are broken down, a two-carbon compound remains, acetylCoA. When odd number fatty acids are broken down, a three-carbon residue results, propionylCoA. link".  As indicated with the gene itself, no drugs are indicated that interact with this pathway. 

Monday, December 1, 2014

Genome Analysis Part:2

The disease our patient will present with the genome testing, with the disease ACYL-CoA DEHYDROGENASE, MEDIUM-CHAIN, DEFICIENCY OF; ACADMD and with the variant dbSNP:rs77931234 (as discussed in previous posts) in the gene ACADM.

In the NCBI browser the variant of interest can be seen here at position 75761161 in GRCh38:

here a larger view can be seen of the exon the variant is in:

The location for GRCh37 and GRCh38 can be seen here from the dbsnp entries:


The VCF entry for a homozygous mutation would look as follows:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CB00001
1 75761161 rs77931234 A G 25 PASS NS=1;DP=1;5;DB GT:GQ:DP 1/1:52:35




Genome Analysis Part:1

The goal of this post is to view the current state of clinical whole genome sequencing (wgs) and clinical whole exome sequencing (wes) from the point of view of an individual suffering in a disease state and with a diagnosis that does not completely explain their symptoms.   

The patient in this example has been previously diagnosed with a Medium-chain acyl-coenzyme A dehydrogenase deficiency, through phenotype characterization which was confirmed with molecular diagnosis (PCR) at a reference lab to be homozygous for the mutation rs77931234.  The diagnosis was made when he was young, and his disease has been well managed.  Around puberty new symptoms began to appear which are not normally associated with the known disease phenotype.  The patient and family has tried a variety of medications with mixed results often resulting in reappearance of symptoms following six months or so of stability.  A battery of both metabolite and genetic tests have been run with no clear diagnosis, but metabolite values are often outside of normal ranges. The family has been on a classic "diagnostic odyssey" with no clear explanation.  


Fortunately for this family are patients at a regional medical center in a large metropolitan area and have been  referred to the Individualized Medicine Clinic.  After being seen in the clinic, and returning for a follow up the family is presented with 4 options for genome scale testing in an attempt to find possible genetic causes for the additional symptoms:

1 - Participation in a clinical trial offering full exome analysis for the patient and the parents at no cost
2 - Run a full genome for the patient only and try to get costs covered by insurance(4-6 months)
3 - Run a full genome on the patient and pay out of pocket ($5-$10k)
4 - Use a direct to consumer service and perform analysis of the raw results

The family unsure of which option is best, has sent an email describing the situation and asking for advice on what they should do.  

My recommendation will be colored by my experience as a bioinformatician working in a genome center who has had experience with research protocols, clinical genotyping, WGS and WES analysis.  I would strongly recommend the family go with option 1 "clinical trial offering full exome to the family", for the following reasons.

While we are quickly approaching a future of genome enabled medical care, with deep understanding of how the genome works and changes affect health, we have not reached that point.  This is the primary problem with options 2 and 3.  These options both have the additional problem of "who owns the data", if the patient pays for it and how would a sequencing center deliver this information for interpretation?  If the insurance pays for it, do they have a right to dig through the data and adjust your rates based on the findings? Sending a hard drive to a patient (or a clinician) who is unfamiliar with the standard file formats or tools used to evaluate data is not a real solution.  The additional complication of a genome returning on the order of 4-5 million variants or more, makes manual interpretation impossible.  Additionally as wgs tools advance and becomes more common a deeper understanding of the genome will emerge, however we currently rely on the data compiled in the previous century which focused heavily on gene, and pathway analysis and how single variants would impact this system, making most of the variants associated with pathogenic conditions contained inside of the gene sequence itself, leaving much of the intergenic sequence regions dark and unexplored, and variation in these regions poorly characterized.  Option 4 (direct to consumer) while possibly cheaper, means you have to find an informatician and possibly contract them to do the analysis, and then follow up with finding a clinician to return some interpretation of the variants found.  The problem is compounded by the variety of vendors, price and quality of data produced.  23andME is actually a very reliable data producer, unfortunately interpretation of results for medical care is hamstrung by the FDA's recent threats against the company.  Those companies that provide wgs or wes as a service require careful evaluation as many cut costs (and corners) by providing a lower threshold of data returned (say 10x coverage instead of the industry accepted 25-40x for wgs) reducing the confidence of calls in complex gene regions.   

Many of these problems are solved by option 1, the clinical exom trial.  First it limits data analysis to gene exons which is where most of our variant knowledge exists meaning that variants found through this tool will have likely been seen and characterized in a publicly available database such as OMIM, dbsnp, or Exome variant server ect. The recommendation of exome trio analysis is also part of the presentation.  Having the parents sequenced in parallel with the child allows for the removal of the variants which were inherited from two unaffected a parents leaving those that are likely the source if the disease is genetic in origin.  While trio analysis also has a long way to go for the detection of hetrerozygous gene drop out, it remains one of the most powerful tools in genetic analysis. As part of a clinical study implies that there will be other sequencing runs done and having the data be part of a well quality controlled pool and workflow is just as important as the trio analysis.

While I would love to recommend that everybody get a wgs analysis done, for a single patient trying to make sense of the complex and rapidly changing world of genetic analysis the current state of the art would suggest exome analysis as the most reasonable method with the greatest clinical application



Sunday, November 9, 2014

Molecular Diagnostic test design

The goal of this post is to review the known clinical variants as annotated in ClinVar, review the pathogenic variants assigned to my pet gene ACADM.  Following the review of all possible pathogenic variants, the previously discussed variant of interest LYS304GLU, created by the genomic variation dbSNP:rs77931234, will be used to design a diagnostic assay using a restriction enzymes and simulation of the resulting gel profile in DNAStar.

Using the NCBI ClinVar search tool and "ACADM" as a gene symbol returned 36 variants.  The clinical significance varied with 2 having "conflicting interpretations", 8 being "benign", 5 of "uncertain significance", 1 identified as a "risk factor", and the remaining 23 all tagged as "pathogenic".  Of the 23 "pathogenic" variants, 20 were single nucleotide polymophisms(SNP) and 3 were variations of large structural deletions of chromosome 1 (1p32.1-31.1, 1p32.3-31.1,1p31.1-13.3) and are not unique to this gene or location and variants tagged as pathogenic and associated with the disease Medium-chain acyl-coenzyme A dehydrogenase deficiency, a disease known to be associated with this locus; while the remaining had no associated condition provided.  As for the recognition of these variants, no variants have been marked (as of 11/8/2014) by either a professional society or by an expert panel.  Most were marked only as been submitted by a single submitter but 8 were tagged as being submitted from multiple sources.

Using Genetests.org and searching for the gene ACADM returned the associated disease "Medium Chain Acyl-Coenzyme A Dehydrogenase Deficiency" and the 1p31 locus.  A large number of tests were returned by in the US and globally.  Limiting results to the US only still returned a large number of results, both prenatal and carrier testing, a variety of types, 2 biochemical, 43 molecular, and 18 test for this gene which are part of a larger panel.  The methods varied a great deal from pcr, array based deletion, duplication, and copy number variation, 15 capillary sequencing, 4 genotyping assays by microarray and bead based assays, and 13 "Next Gen" sequencing which are single assays that cover many genes.  There were many core facilities but also a large number of Children's Hospitals all across the US.  Given these results I would characterize testing for this gene and its disease widely available.

For the construction of the simulated assay, the reference sequence was obtained for genome build GRCh38 for the gene locus for ACADM.  This was obtained from the NCBI gene finder, the Fastq sequence was downloaded for NC_000001.11:75724347-75763679.  The sequence was uploaded into DNAStar's SeqBuilder tool, the location of the variation was found by using DBsnp locating the page for dbSNP:rs77931234 , the variation of interest.  The variation flanking sequences "tctggtaactcattctagctagttcaactt" and "cattgccatttcagccagcataaatgatat" were used to locate the base of interest, the location of the variant can be seen in the image here:

The reference sequence of interest is"GCAATGAAAGTT" with the base of interest highlighted in blue and its variation of A->G "GCAATGGAAGTT" with the variation seen in orange.  In order to find an enzyme which would cut the sequence of interest, the NEBcutter web tool as well as Restriction Mapper were both used to try and find restriction enzyme would would be affected by the variation.  Two enzymes were returned by both tools, they are BsrDI and AgsI, BsrDI cuts right at the site of the variation however the site of recognition is not impacted by the variation, the location of cutting and the location of recognition can be seen in the image here:


By using the tools in SeqBuilder to display all of enzymes that cut a specific region I was also able to find MspJI, AbaSI, AgsI, and many others as seen here:

Unfortunately none of these were specific to cutting only this location and many of these enzymes cut multiple lotions with in the same 20 base pair region eliminating them as useful for detection of the variation. To simulate a gel, an enzyme which only cuts the sequence in fewer than 7 locations will be used, the cutting locus will be altered.  In order to simulate the gel the mRNA sequence for ACADM NP_000007.1 will be used.  It is much smaller than the genomic sequence containing only ~2600 bases.  The enzyme we will use for the simulation will be the previously identified BsrDI, while it isn't useful for detecting our variation, it cuts near it and it only cuts the sequence in one other place, meaning that the loss of one site should be observable in a gel. 
The digestion of the reference strand results in three bands, two are very close to each other and a tech would need to be sure to run the gel long enough to see these bands separate.  Destruction of the BsrDI site reduced the number of bands to 2 with the two similar sized pieces appearing as one large band at roughly 375 bases.  

While it was unfortunate that the pathogenic variation was unable to be captured using and RFLP a taq primer probe design could also be tried for the detection of this specific mutation.  







Saturday, October 11, 2014

Variation evaluation using 3D modeling - ACADM with CN3D

In a previous post the impact of a single amino acid substitution was evaluated using the DNAstar software tools.  By loading a reference protein sequence and looking at the predicted structure, then simulating a substitution and evaluating the predicted structural changes that the variation would cause, we had hoped to observe a tangible change in the predicted secondary structure.  Unfortunately, substituting the amino acid LYS304GLU(LYS329GLU) into the ACADM sequence did not appear to cause a significant change in the protein.  In this post we will look at the 3 dimensional structure of the protein as predicted by crystallography, compare the prediction of DNAstar secondary structural yo determine the accuracy of the algorithmic structural prediction as well as philosophize about the possible impact of the variant.

To begin, an appropriate crystal structure needed to be selected.  Using the NCBI "structure" search tool and using 'ACADM' as the search term, 8 structures were returned.  By limiting to Taxonomy: Homo Sapiens, the first sequence was selected as it was the only option which did not contain a variant.  The PDB ID for the sequence is 1T9G, and has available structures for Cn3D and PDB.  The Cn3D file was download as was the Cn3D software version 4.3.1 .

An Image of the protein (space filling model) can be seen here:

Because the protein is large and takes up a great deal of space a tube model can be seen here:


The protein appears to have 7 subunits each with its own sequence.  Assuming the sequence is correct, this would help explain the difficulty in predicting a folding change caused by a single amino acid substitution.  Because of the complexity of many subunits interacting together to catalyze a reaction, a small change which may not appear to affect a single subunit, could have a subtle impact on the interaction on how the subunits folded into each other, resulting in the larger failure of the enzyme and the clinical phenotype presentation.

Seven subunits come together to from the larger enzyme complex, of these 7, 4 are identical and have sequence matching the protein sequence of ACADM.  A space filling model of a single unit can be seen here:

To evaluate the location of our amino acid substitution, the original sequence, without the substitution was used to find the identical region.  Here we see the original sequence on top and the same sequence below with the amino acid substitution added, the amino acid substitution which was evaluated previously can be seen below in red.


The reference sequence in yellow was used to identify the same location in the protein structure and it can be seen here highlighted in yellow:

To view the secondary structure of the protein the tube model can be seen below.  On the left is overall view of the protein as a tube model and on the right is a zoomed in view of the region of interest.

To refresh our memory from the last post here is the image of the DNAstar prediction of the secondary structure, with the amino acid of interest highlighted in black:

When comparing the structural prediction algorithms of DNAstar to the crystal structure, we can see that the algorithms, for the most part, are correct.  The Garnier-Robson, and the Chou-Fasman correctly predicted the alpha-turn helix structure at the location of interest while also correctly predicting the absence of the Beta-sheet or flex region.  The Eisenberg algorithm, however, both did not predict Alpha-helix and did predict a Beta-sheet in the area of the variation.  

The prediction of the protein secondary structure by DNAstar appeared to be correct when comparing to the crystal structure, however no change in the secondary structure was observed when the variation was substituted in and the structure was reevaluated.  As stated earlier, the active protein is actually composed of 7 subunits and 4 of these subunits are composed of our protein of interest ACADM.  By highlighted the region where our variation would occur across the same 4 subunits we can see it highlighted here in yellow (yellow arrows are used to point to the region of interest).

These regions appear to be in close proximity to the other proteins in the final form of the active protein complex.  Given the proximity of these residues, and the change in charge of the substitution it is possible that this substitution actually interfere with the forming of the complex itself.  It would be interesting to design an assay which would label the subunits while persevering the subunit interaction with each other.  A gel could be run to check for size separation, with the reference sequence appearing in two positions, a smaller size of the single subunits which have yet to be incorporated into a protein complex, and a second signal from the larger protein complex.  The assay could be repeated with cells containing the genetic mutation, which results in the amino acid substitution of interest, the smaller signal should still be detected, but if the substitution interferes with the complex formation there should be no signal from the larger protein complex.  

There are many ways the amino acid change could affect the protein, it could block binding of the substrate itself, or interfere in some complex way with the catalytic site of the larger protein complex, or there may even be other more abstract interaction which could cause a problem.  Based on the DNAstar results and the observations of the structural view, it doesn't appear to be structural changes in the single subunit.




Monday, October 6, 2014

ACADM gene and it's disease

My pet gene for this semester is ACADM . It is a located at 1:76,190,042-76,229,354 (39,313 base pairs(bp)), containing 12 exons (1263 bp), which translates to a 421 amino acid (45 kilodaltons) protein. Because only 1263 bp are exonic and adding 6 bp per exon ( 3 for each side ) for splicing bases totaling 1335 (=1263+(6*12)) means that only a sparse 0.033% of the bases inside of this genomic region actually 'do something' to make the protein.

The protein itself encodes a dehydrogenase enzyme that degrades medium-chain fatty acids. Mutations resulting in a deficiency of the enzyme cause the, cleverly name, disorder Medium-chain acyl-coenzyme A dehydrogenase deficiency. I could find no reports of over production of the enzyme interestingly. The deficiency results in an "intolerance to prolonged fasting, recurrent episodes of hypoglycemic coma with medium-chain dicarboxylic aciduria, impaired ketogenesis, and low plasma and tissue carnitine levels. The disorder may be severe, and even fatal, in young patients" (Matsubara et al., 1986). For the assignment we will be modeling the variation for allele .0001 MCAD DEFICIENCY LYS304GLU created by the genomic variation dbSNP:rs77931234. This mutation may also be known as LYS329GLU (K329E) because the protein sequence itself is a precursor protein, in fact it will be the K at 329 we will need to change to an E in the sequence we use.

We will start by modeling the secondary structure of the reference protein sequence NP_000007.1, then the amino acid change LYS304GLU will be inserted where appropriate and again model the secondary change and compare the differences. The rs77931234 mutation occurs towards the beginning of of exon11 as seen in these UCSC screen caps (with rs77931234 highlighted in black). 

The protein sequence was loaded into DNASTAR's Editseq applications and Leucine(K) at position 304 (highlighted in black below) was changed to a Glutamic Acid(E). 

Here is a quick screen shot of the sequence alignment following the sequence change.

It isn't particularly surprising that this single amino acid change has a significant impact on the protein function;  Lysine is a strongly basic amino acid and Glutamic acid, the substitution, is a strongly basic amino acid.  The first step in comparing the two sequence is to load them into SeqBuilder and generate some statistics. To do this, the protein sequence was loaded into SeqBuilder, the entire protein sequence was selected, then by opening the "sequence" menu and clicking on "Statistics", we can determine the Isolectric Point, and the Charge of the protein at a PH of 7.0. For the reference sequence the Isolectric Point is 8.369, and the Charge is 5.546, this changes for the protein following the substitution of the Leucine with the Glutamic Acid to an Isolectric Point of 8.055, and the Charge at pH 7.0 is 3.550. Using these simple stats we can see that the mutation would have an impact on the enzyme at a basic bio chemical level. Unfortunately these high level statistics were the only metric I found that differentiated between the sequences.  Below are the sequences as the appear when loaded into the Protean tool in DNASTAT.  The black bar highlights the location of the mutation in each view.   The first is the reference sequence, the second is the mutated sequence.

NORMAL

MUTATED(LYS329GLU)

Using the algorithms available in the Protean tool sweet, no detectable impact on proteins structure was observed.  In both a high probability of the region of the mutation being part of an Alpha structure in the reference was no impacted by the presents of the mutation.  The Kyte-Doolittle Hydrophobicity plot (and probability) was unchanged by the mutation as well.  Because of the change in the charge of the overall protein, a further inspection of the hydorphobicity using a the Kyte-Doolittle algorithms and plots was carried out with the following plots observed, again with region of change highlighted in black, reference on top and variation on the bottom.

Again no appreciable change was observed.  The Chou-Fasman algorithm was used to inspect the region as well with the results send here:
While subtle probability shifts appear at adjacent locations no major structural change can be detection using these algorithms.   

While clinical evidence has repeatedly observed this mutation in the presence of low enzymatic activity and pathologic phenotype, none of the tools or algorithms were able to detect a major change in the protein structure caused by this variant. 

There are many other possibilities as to how this mutation could impact the function of this protein. As mentioned previously this protein sequence is actually a precursor protein and requires further editing and manipulation, this amino acid change could impact this reaction by blocking the catalytic site meaning the protein is never able to take on its fully functional form.  It is also possible that the way the protein binds to the fat molecule itself may be altered just enough to prevent the catalicataly activity of the enzyme.

As I learn to use more of the DNAstar tools I will continue to evaluate this mutation and attempt to untangle is impact.


Monday, September 1, 2014

First Post

This post is intended to test the interface for generating as well as test the layout for when posts appear. As well as test my ability to insert raw html into the page! While I started this blog in response to a class but this has been something I have been interested in doing for awhile. My undergrad in clinical lab science and masters in bioinformatics taught me a lot of technical skills, however the ability to communicate my ideas, outside of formal science writing, has been left wanting. I am excited to use this as a forum for discussion and debate about the a variety of topics as well as to respond to posts around the web. My spelling and grammar are always a work in progress, so I will apologize in advance. Thanks for reading and please join in the conversation.