Posts

Custom Exomiser reports

Users can now run Exomiser using parameters of their choosing. For example, users can set the allele frequency , variant type, inheritance mode, and model organisms. To use Exomiser, from the menu Reports -> Exomiser. Results will be emailed to you when completed.

gnomAD annotated search: De novo & Mendelian inheritance filters

Similar to the pipeline annotated search, the gnomAD annotated search now supports de novo and mendelian inheritance filters.

v2.0 and v2.1 dropped from CPI28 release

As of Jan 29, 2020 (cpi28 relaese), datasets generated from pipeline version v2.0 and v2.1 have been dropped from the database in favor of the newer pipeline versions v2.3 and v2.38. Those that would still like to access the older pipeline versions can do on a request basis. Or can request to their data reanalyzed using newer versions of the pipeline.

Truncated alignment (BAM) files for offline viewing

We've introduced a more efficient way of downloading the alignment files without having to download the entire BAM by focusing on the variant of interest. We've added a packaged download of all the needed files (VCF, BAM, indexes) for offline viewing. The original BAM is truncated to include 1000 bps before the start coordinate and 1000 bps after the end coordinate of the variant. These files can be found in the 'Datasets' section under the heading of "Variant alignment files". Rather than waiting for hours/days, you can download in seconds/minutes.

gnomAD annotated search

We've introduced an exciting new way to browse and filter variants using annotations provided by gnomAD for exomes. The existing search, which is based on the pipeline provided annotations, is unchanged and can still be used in the usual way. This format of the search looks at the same data, but viewed in a different way. There are several differences between 2 search mechanisms. The primary difference is that the pipeline annotated search is never updated and represents the annotations that were current at the time they were analyzed. On the other hand, the gnomAD annotated search is updated quarterly with every new release of gnomAD across all previously analyzed patients. By using the latest frequencies and gene names, we can go back and review cases that were previously unsolved. There are several other differences that should be noted: The pipeline annotated search displays ONE variant linked to ONE transcript. With the gnomAD annotated search, we are displaying transcripts …

Exomiser analysis available

What is Exomiser? Taken from the website: "The Exomiser is a tool that finds potential disease-causing variants from whole-exome/genome data. Starting from a VCF file and a set of phenotypes encoded using the Human Phenotype Ontology (HPO) it will annotate, filter and prioritise likely causative variants. The program does this based on user-defined criteria such as a variant's predicted pathogenicity, frequency of occurrence in a population and also how closely the given phenotype matches the known phenotype of diseased genes from human and model organism data. " Our tests indicate that Exomiser is able to rank the correct gene, In the top spot #1, 52% of the time. In the top 5, 78% of the time. In the top 10, 91% of the time. How to use? Download the attachment, Unzip/decompress the attachment, Open the file ending with HTML or TSV, The genes are listed from highest ranking downwards. Currently we use some default settings in exomiser, but it can be re-analy…

GRCh38 assembly now supported

We have finally made the switch to the latest GRCh38 assembly for bioinformatics analysis. The search now includes the option of choosing assembly versions GRCh38 and the older GRCh37. This means users can still browse our database for both GRCh37 and GRCh38 assemblies. We have done some comparisons between the 2 assemblies for some previously analyzed samples and have noticed differences in pathogenicity scores and inclusion/exlusion of variants.

Data storage and compression using CRAM

As our database expands in scale, managing and storing large genomic sequence data has become a challenge, particularly with the large BAM files. As we head towards cheaper sequencing costs we are anticipating a tsunami of data as researchers switch from exome to whole genome sequencing. In preparation, we've taken early steps of further compressing our BAMs using CRAM compression (lossless). Our database system manages its disk space autonomously such that if our allocated disk space reaches a threshold of 80%, it will automatically convert the oldest BAMs to CRAMs and archive them to tape storage, making way for newer datasets. Our testing has shown that the CRAM compression format saves roughly 30% and will provide significant costs savings. Users that wish to have access to the archived BAMs can click a button from our web interface and the system will automatically restore the CRAM from tape and convert them back to BAM.

Search filter: Clinical diagnosis and provisional variants

We've added 2 new search filters for Clinical diagnosis and Provisional variants. By combining these 2 filters together, we can look for all patients that fall under the same disease category and look for provisional variants they may have in common. This is useful in a research context where we may be able to find a pattern of variants that may be influential in disease pathogenesis. Both the Clinical diagnosis and Provisional variant information is pulled in from our related Patient Database. Therefore, for these filters to be useful, they must first be specified in the Patient database prior to use in the search.

Predictive relatedness, sex and ancestry reports

We've recently added a new step in our pipeline to use the genetic data to predict the degree of relatedness, sex and ancestry. This is particularly useful as a quality check to spot potential sample mix ups, poor DNA quality, contamination, errors in patient details provided etc. In the event of a possible error, users are automatically notified with the reports attached in the email for further investigation. We are currently running the reports retrospectively for all of our previous data sets and have already found a some data entry errors. In such cases, we may want to rerun the pipeline analysis as such errors can affect the variant prioritization. These reports are also available as downloads in the 'datasets' section.

Prediction filtering can be separated by logical OR/AND

Previously when combining filters on predictions and scores, the search automatically separated each filter on a conditional AND by default. The change we've made recently, is to allow users to specify the logical operator (AND/OR) between the prediction and scores filters such that you can query the database by saying give me all the variants that have polyphen prediction 'probably damaging' OR clinvar prediction 'pathogenic' in a single query. Previously, if you had to do this, you would run separate searches for each polyphen and clinvar.

Search profiles - New filters

Previously only gene lists were supported in the Search Profile feature. Recently, we've added support for storing the lists of mutations types (exon, splice, missense, nonsense), ExAC frequency, Gnomad frequency.

Improved structural variation prioritisation

Matt Field has made some significant improvements to the the prioritisation of structural variants (SV) and we've updated our database to reflect those changes which include combined report for both SV callers, prioritise SVs where exons are most likely to be impacted, max length filter applied to most SV types, and whether event is novel/known. These changes dramatically reduced the number of high priority SVs from >3000 to around 90 and 449 medium priority SVs. Please note that we do not retrospectively re-analyse and update the SV reports for any of the previous records. This only affects any new data.

Handling control samples

We don't necessarily want to see variants from our control samples in the database, but at the same time we still want to be able to download the VCFs and do SNP validation to ensure we don't have sample mix ups. We've created a separate page of control samples and their corresponding VCF for download.

Automated archiving of BAM files

Our capacity to keep BAM files available for download is a real challenge and we are faced with the constant pressure to free up diskspace as more projects come onboard. We've come up with a way to automatically archive BAM files that are older than 1 year to tape storage without any human intervention.

Health reports: GWAS

In addition to Clinvar and Snpedia, we've recently added GWAS Catalog to the health reports based on the the rsNumbers for a patient. GWAS is particularly useful in a research context by comparing variant frequencies in the affected population against a control (healthy) population using statistical analysis to establish a hypothetical link between variants and disease traits. In the health report under GWAS, we've added the following columns: disease traits, studies, risk allele, initial sample size, replication sample size, p-value and risk allele frequency. In GWAS it has been shown that false positives are not uncommon (false association between variant and disease) due to uncontrolled biases and so it's important to take into consideration whether any replicate studies were done to give more confidence to the hypothesized association.

Health reports: Clinvar & Snpedia

We've added a new feature where users can generate health reports downloadable in a Excel format from multiple datasources including Clinvar and Snpedia based on the patient's rsNumbers/variants and genotype. The health reports indicate the patient's risk factor associated with a particular disease/trait. It can take up to 20 mins to generate and an email is sent with the attached health report. Magnitude is a subjective measure of interest ranging between 0-10. The higher the number the more significant. A magnitude score of 2 or higher is probably worth investigating. A magnitude score of 4 or higher is definitely worth investigating. More info at: https://www.snpedia.com/index.php/Magnitude.

Excluding variants from search based on Patient study codes

When doing our own variant analysis, we often seek variants that shared between affected individuals, and we already provide this capability using the 'shared' filter. We recently added a new filter to take this search one step further by removing variants found in the unaffected individuals (usually from the same family). There is a new textbox called 'Exclude variants' where users can add patient study codes to exclude the variants found in these individuals from the variants found in the other individuals in a single search operation. Keep in mind, that each person will carry thousands of variants, so filtering in this way can be quite slow if no other filters are applied. So it is recommended that users apply as many filters as possible to narrow the search before using this functionality.

GnomAD ethnic frequencies exportable

We've added a new option for users to export GnomAD ethnic frequencies to excel which includes south asian, east asian, african american, jewish, non-finnish european, finnish and other minor allele frequencies (MAF). It's optional because we don't actually store the gnomAD frequencies in our database and have to fetch them from elsewhere making export slower especially when exporting thousands of variants. It's best to filter as much as you can before enabling this option.

Affected statuses: Database vs Pipeline

We recently introduced a new filter called 'Pipeline affected status'. This is not be confused with the other 'Affected status' or 'Disease status' filter which is taken from our Patient Database. The 'Pipeline affected status' differs such that you can reconfigure the pipeline to use a different affected status from what is set in the database to produce different cohort reports. This is useful in cases when the affected status applies to multiple phenotypes or diagnoses and you want to to do repeated cohort analysis under different conditions.

Phenotype to Genotype based variant searching

Expand your variant search based on known phenotype-genotype relationships. This filter only works if you have specified patient ids in the filters. The phenotypes collected from the specified patients are used to query OMIM for gene relationships. A new tab called 'Phenotype-Genotype' is displayed in the results showing the relationships between phenotypes and genes. This only works well for patients that have a good number of phenotypes captured in our databases.

RS number filter

Users can now search by rsNumbers in our search fitlers

Variants from cohort reports are now included

Previously only the variants from the SNV, INDEL and SV reports were included into our database. We've recently rebuilt our database to include all variants, even the questionable ones of poor quality, found in the cohort report because there are some suggestions of an inheritance pattern discovered during the pipeline pedigree analysis. This means more variants for you to browse than there was before.

Gene interactions - Genes don't work in isolation, and your gene lists shouldn't either

Genes don't work in isolation, and your gene lists shouldn't either. Researchers will often have a list of known genes to look for when prioritizing variants based on the patient's clinical diagnosis. but what should you do if no candidate variants can be found solely based on your gene list? There are many approaches, but one option is to expand the gene list based on known gene interactions and pathways. We rely on the highly curated database called BioGRID to expand the gene list to include the network of genes known to interact either directly or through protein-to-protein interactions. To use this new feature, there is a new checkbox called 'Gene interactions' which users can tick to expand their gene-based search in this way.

Search profiles

Users can now create their own search profiles as a way of storing commonly used search filters without having to repeatedly choose the same options over and over again. One example is to include your gene lists in a search profile. The search profiles are associated with the user only and are not shared.

BAI - BAM index files downloadable

The BAM index files, known as BAI files, are now available for download along with the BAM file. This is particularly useful when using IGV on your desktop.

New search filters: gnomAD frequency and INDEL ExAC frequencies

The bioinformatics pipeline has been updated to include gnomAD frequencies and added INDEL exac frequencies. Any new data generated from Dec 2017 onwards will have these new fields. However, none of the previously analyzed datasets will have them. They will have to be reanalyzed if you want these new fields populated. To go along with these new fields, we've added the new gnomAD frequency filter to our search page.

Exon coverage search

The sequencing and alignment process isn't perfect and often there are regions of poor coverage as a result of the pipeline analysis. Previously we made the coverage reports available for download as part of our datasets as 'exonReports'. We've taken it a step further by allowing users to search through these coverage reports based on gene, patient ID and coverage type (NO_COVERAGE, POOR_COVERAGE, PARTIAL_COVERAGE). To use this new feature, in the menus, choose 'Search exon coverage'. Furthermore, we added a new tab to display exon coverage to go along with the variant search results. The tab will only have results if users search by Patient ID and Gene. This way users can browse variants and the coverage results side-by-side providing a broader view over the quality of the variants being presented. In particular, this will be useful for difficult to diagnose patients for which no causal variants have been identified, where potentially disease-causing variants …

CACPIC Frequencies exportable

Chinese frequencies using our healthy chinese controls are now exportable to Excel as an optional column. We've made it optional because these frequencies are calculated at runtime during the export process and can delay the completion of export. For those not interested in the CACPIC frequencies, leave the checkbox unticked.

Supplementary information available for download

As part of our datasets for download, we've added some supplementary information (generated as TXT files by the pipeline) to go along with the VCFs and BAMs files. The files contain information such as the cutoffs used to qualify variants as a PASS, which include things like read depth cutoffs, median quality score cutoffs and so on. Furthermore, the files also contain general statistics about total variants passed, the proportion of variants that are exonic, splice sites, number of distinct genes and averages of read depth and median quality. Another file called the 'readReport.summary' includes information about how many were paired, mispaired, aligned, and unaligned. The supplemantary files can be found under the 'Datasets' section.

Measuring Variant Conservation with GERP Score and Siphy

The more annotations, the better! We've recently added 2 new annotations to assist with variant prioritisation as a measure of variant conservation and these are GERP scores and Siphy. GERP stands for Genomic Evolutionary Rate Profiling. Conceptually, GERP is a method for the identification of slowly evolving regions in a multiple sequence alignment, defined as ‘constrained elements’. "Constrained elements are identified by comparing the observed to the expected rates of evolution for each window, and defining all those regions whose collective observed rates of evolution are significantly lower than would be expected under a null model." More simply, it is a score used to calculate the conservation of each nucleotide in multi-species alignment with ranges from -12.3 to 6.17, with 6.17 being the most conserved. Positive scores (observed fewer than expected) indicate that a site is under evolutionary constraint. Negative scores may be weak evidence of accelerated rates o…

Allele Frequency filtering

The pipeline does the best it can in assigning variants a zygosity based on the allele frequencies and counts. Usually the cutoff is around 90%. However, once we reach below this threshold, it becomes less clear. Hence we now allow users to filter by allele frequency particularly useful in cases where zygosity is not always clear. Users can filter on the VARIABLE allele frequency as well as the REFERENCE allele frequency.

gnomAD frequencies based on ethnicity

Previously the gnomAD frequency shown reflected the european frequency. We now display the gnomAD frequency from all ethnicities under the 'Latest annotations' tab.

Gene synonym searching

Previously searching by gene was based on the exact match of the provided gene name without consideration of the evolution of gene names over time. Gene names are often given synonyms or replace old gene names with new ones. With this change users will have the option to expand gene name searching to include synonyms. The expanded list of synonyms is shown in the returning page. Our testing has shown this to greatly affect the results returned.

CACPI control frequency calcuation correction

Previously there was an error in which the calculation was made for the CACPI control frequency. This has now been corrected.

Provisional Variants

Users can now save variants against the particular individual as a provisional variant to ranked and prioritised. Once added to the patient, these provisional variants can be used to nominate suspected variants for sanger sequencing confirmation. When confirmed, the variant can then move on to the next stage and be marked as the 'Genetic Diagnosis'.

Shared variants as a percentage

Previously when filtering for shared variants between individuals, the variants returned were always in 100% of the specified individuals. This has now been changed to allow users to specify the sharing of variants between individuals AS A PERCENTAGE. For example, if users wanted to know of all variants shared between 2 out of 3 individuals, users can use a percentage value of 66%. The results returned will be AT LEAST 2 out of 3 individuals will share the variant. This is especially useful in large cohorts of unrelated individuals known to share similar phenotypes where there may be a suspicion of multifactorial gene expression or other complex gene interactions.

Zygosity exportable

The Zygosity status assigned to each variant is now exported to Excel

Login Changes: Australian Access Federation integration

We've recently changed our authentication to use the Australian Access Federation. The main benefit is that users from other universities across Australia can use their own institutional credentials to access our database, without having to create a new username/password. However, we still have support for generic username and passwords to support our friends overseas. The other benefit of making this switch is to provide a better user experience in terms of data integration from our ecosystem of databases via single-sign on. We can seamlessly pull information from various places to provide an aggregated view of data. Australian users should use the [Login via AAF] option, while others should use [Basic Login].

IGV viewer for whole family

Our first release of the IGV viewer only allowed a single individual to be displayed at a time, with a single VCF and single BAM file loaded up. With the latest release, IGV viewer now loads all family members.

SNP Validation

The process of sequencing and analysis goes through a series of steps with much human involvement, and therefore is prone to error. In order to verify that a patient indeed does have particular variant, we can go through a step of SNP Validation for a batch of patients. We've created a tool that can be invoked through our web portal to determine the smallest combination of SNPs required to uniquely identify a patient within a batch for a predefined SNP panel. This process requires that the BAM file be available and can take a while, depending on the size of the batch.

Chinese frequencies

Displays the frequency of a variant in the Chinese population using our own Chinese healthy controls database. Currently only supports SNVs. INDELS & Structural Variants UNSUPPORTED (at this stage) and will show 0%

Coverage reports available

The sequencing and variant calling process can sometimes be selective in the regions covered in the genome. Therefore it's also important to know about what regions were not well covered to consider false negatives. These coverage reports are now included as part of the 'summary report' downloads. Look for the file names with 'exonReport'.

On-demand recent annotations

At the push of a button, users can request for the retrieval of the most recent annotations for a variant from Ensembl VEP. This is used to supplement the annotations provided by the pipeline which may not be completely up-to-date with regards to it's source of information. Some of the new annotations include things like Clinical significance, Pub med ID and links, Mutation Taster predictions, rsNumbers and much more ... Because of it's dynamic nature, unfortunately these annotations are NOT searchable, but are there to supplement the annotations from the pipeline.

Restore BAMs files

BAMs files are difficult to manage because of their large size. We've employed a new technique that will conserve the use of expensive disk space by pushing BAM files onto archived tape drives. However, that doesn't mean you can't retrieve them. We've devised a clever way to automatically restore archived BAM files at the will of your command through the web portal. Simply click on the 'Restore from archive' link when viewing a variant. Users will get an automated email to notify them that the BAM file has been restored and ready for download or IGV viewing.