NATIONAL CANCER INSTITUTE - CANCER.GOV

FAQ-Bioinformatics

Frequently Asked Questions  

What types of bioinformatics services does the CCR-SF Bioinformatics group offer?
What analyses does the CCR-SF Bioinformatics group perform?
Sequencing depth and experimental design questions?
What is required to assure timely processing and delivery of my data?
What types of analysis workflows does the CCR-SF use to perform analyses?
What types of data formats will I receive from CCR-SF?
How do I analyze the data?
How large are the data delivery files?
How are the data files delivered?
How long is the data made available to download?
What is the yield per lane for different sequencing platforms?

Answers  

What types of bioinformatics services does the CCR-SF Bioinformatics group offer?

Here at CCR-SF, our mission is to provide the highest quality of sequencing data to our customers.  We work closely with investigators to help get their NGS projects off the ground,  the services we provide including:

  • Provide experimental design consultation including sequencing technology recommendation, library protocol consultation, sequencing coverage and cost estimate, etc.
  • Perform QC, secondary and tertiary data analysis for sequencing data from different platforms, including Illumina, PacBio, Oxford Nanopore, and BioNano.
  • Develop robust and reproducible analysis workflows/pipelines based on application types and sequencing technologies.
  • Support adaptive new sequencing protocols and new technology development.
  • Provide training to customers for NGS technology and data analysis.

New Services:

  • Single cell Analysis – support both whole transcriptome, and 3’ and 5’ capture based technologies such as 10x Genomics, and BD Rhapsody scRNAseq . Analysis support includes cell subpopulation identification, differential analysis cross conditions. Single-cell genomic analysis and epigenetic markers detections.
  • Structural Variations Detection and Genome Assembly – utilize the long reads technologies such as PacBio, Oxford Nanopore, or BioNano optical mapping technology to detect genetic variations or rearrangements in the structure of chromosomes.
  • Full Length Transcriptome Analysis – utilize PacBio Iso-seq for full length transcripts and novel splice variants discoveries; direct RNA-sequencing and analysis using Oxford Nanopore technology.

What analyses does the CCR-SF Bioinformatics group perform?

Currently we offer primary and secondary analyses for all NGS projects, including initial base-calling, demultiplexing, data quality control, and reference genome alignment of NGS reads.  ­­­­­We also offer tertiary analyses on a limited basis for certain R&D projects, which may include de novo assembly, structural variant analysis, isoform detection, and single cell analysis.  For all projects, we insure that every sequence run we deliver meets our QC standards for yield, base-call quality.  Additional QC metrics are based on application specific protocols we have established at CCR-SF.

Sequencing depth and experimental design questions?

Coverage requirements vary by application, library protocol, sequencing platform, and project specific considerations. In order to provide the best approach for your project, a meeting is setup between you and representatives from our sequencing facility in order to make recommendations in sequencing platform, library protocol, and other needs.

For assistance in planning your experiment or to discuss specifics of your project please contact Bao Tran tranb2@mail.nih.gov for bioinformatics consultation please contact Yongmei Zhao yongmei.zhao@nih.gov. Please go to the following web links for experimental design best practices:

ATAC-seq Best Practices: https://informatics.fas.harvard.edu/atac-seq-guidelines.html

ChIP-seq, RNA-seq, and Exome and Whole Genome-seq: https://ccbr.ccr.cancer.gov/project-support/experimental-design-best-practices/

Whole genome sequencing and Structural Variation Detection Best Practices: coming soon

Single cell RNA-seq: coming soon

What is required to assure timely processing and delivery of my data?

We recommend an initial consultation with the CCR-SF Bioinformatics group in order to discuss data analysis requirements and to establish expectations. It is also important to specify the reference genome version and annotation build for projects with human or mouse genome mapping requirements. For other reference-based sequencing projects, you will need to provide us with the reference sequences  (FASTQ file format or weblink).

If you have any questions regarding your preferred data processing options, please contact Yongmei Zhao yongmei.zhao@nih.gov or SF Bioinformatics Team via email CCRSF_IFX@nih.gov.

We currently provide analyses based on sequencing application type. We have designed and implemented in-house data analysis pipelines that integrate platform/vendor specific data analysis tools with popular open-source software.

What types of analysis workflows does the CCR-SF use to perform analyses?

Currently available custom data analysis pipelines:

  • PacBio Long-read Sequencing
  • Single Cell Analysis

What types of data formats will I receive from CCR-SF?

For projects using the Illumina sequencing platform, a PDF report containing a summary of the sequencing project (i.e. library and sequencing protocols, sequencing result summary, application-based QC metrics, and software details) and an excel file containing the detailed data analysis results.  Depending on the application,  you will also receive a html QC report file contains detailed QC statistics and plots for analysis workflows included for that specific application.  In addition, you will receive the pass-filtered raw sequence reads in FASTQ format and the reference alignment data in BAM format.  BAM files contain base-call and quality score information for all pass-filtered reads, as well as alignment information for reads that have mapped to the reference genome. Additional application specific data files were specified in the deliverable data files types.

For projects using the PacBio sequencing platform, the data delivery choice is driven by the specific needs of the project. For example, when circular consensus processing is performed, the raw subreads bam file, run definition xml files,  and the consensus reads (CCS) are included in the data  delivery package. If alignment and variant calling are performed, the resulting data are provided within BAM and VCF files. There are also files containing the intermediate results of pipeline processing (such as the read-to-cluster mapping for IsoSeq) are sometimes included. Beyond that, we are happy to deliver any of the files produced by our processing upon request. The content of the data delivery package should be discussed at project definition time.

For standard projects, the deliverable data file types are:

  • Sequencing FASTQ/FASTA files
  • Alignment BAM files or Assembly files
  • Data QC Statistics Reports
  • Mapping or Assembly Statistics

For projects with secondary and application specific analysis, the deliverable data file types are:

Exome-seq or WGS Structural Variants Discovery:

  • Raw FASTQ files
  • Alignment BAM files
  • SNP/Indel and SV variant call VCF files
  • Structural variant call BED file
  • Variant annotation files
  • QC and Variant analysis statistics reports

PacBio Iso-Seq:

  • Raw data: subreads BAM, CCS reads BAM or FASTQ files
  • QC and statistics reports: MultiQC report, Squanti report and Kraken contamination check report files
  • Analysis data: high quality clustered isoforms, full length cDNAs, Squanti filtered results including BAM files,  GTF files as well as classification table

PacBio De novo Assembly:

  • Raw data: subreads BAM, CCS reads BAM or FASTQ files
  • QC and statistics reports: MultiQC report, assembly report and Kraken contamination check
  • Analysis data: polished assembly contigs files

PacBio Long Amplicon Sequencing:

  • Raw data: subreads BAM, CCS reads BAM or FASTQ files
  • QC and statistics reports: MultiQC report, Kraken contamination check report files
  • Analysis data: Clustered long amplicon consensus, Phasing and variant analysis file

Pacbio Structural Variant Sequencing Analysis:

  • Raw data: subreads BAM, CCS reads BAM or FASTQ files
  • QC and statistics reports: MultiQC report, Kraken contamination check report files
  • Analysis data: mapped BAM file, structural variant VCF file

PacBio HLA Genotyping:

  • Raw data: subreads BAM, CCS reads BAM or FASTQ files
  • QC and statistics reports: MultiQC report, Kraken contamination check report files

Analysis data: mapped BAM files, standard HLA typing report

Single Cell RNA:

  • Cell Ranger output
  • Seurat clustering
  • SingleR annotations
  • Nozzle Report

Single Cell ATAC:

  • Cell Ranger ATAC output
  • Signac clustering.

Single Cell CNV:

  • Cell Ranger DNA output

How do I analyze the data?

The SF typically provides primary, secondary and sometimes tertiary data analysis, which includes delivery of the FASTQ pass-filtered raw read files and alignment BAM files, gene quantification counting files, or variant analysis VCF files to the customer.  Investigators are expected to provide for their own downstream analyses not offered by the SF bioinformatics group.  For investigators interested in performing their own bioinformatics in-house, there are several commercial software options from Illumina, PacBio, and third party vendors.  In addition, a large number of open-source NGS software tools are freely available from Biowulf and other online computing sources.

For investigators interested in need of assistance for downstream NGS data analyses, the CCR Collaborative Bioinformatics Resource (CCBR) provides expert bioinformatics data analysis for the Center for Cancer Research at the NCI free of charge.  To contact the CCBR, please submit a request through the CCBR Project Submission Form at https://ccbr.ccr.cancer.gov/project-support/.

How large are the data delivery files?

Because NGS sequencing is still a rapidly evolving field, this answer changes regularly. Please contact the bioinformatics group for current data delivery file size information.

How are the data files delivered?

Please contact the bioinformatics group to discuss your options. The original sequence, alignment, and analysis files are available to download through CBIIT DME system. In order to access your project data at DME system, please email CCRSF_IFX@nih.gov to get your NIH account linked to DME system.  You will need to register an account for each lab member planning to log in. Please follow DME tutorial to access and download your project data.

If you or your collaborator does not have NIH account, we recommend you to register an account at GlobusFTP (https://www.globus.org/) in order to transfer data via the GlobusFTP site. Please see the following tutorial on registering an account and transferring data:
https://helix.nih.gov/Documentation/globus.html

If you have any issues setting up a Globus account or transferring data via the shared endpoint, please contact us at CCRSF_IFX@nih.gov.

How long is the data made available to download?

The data files located on CBIIT DME system currently is depending on the data life cycle defined by data policy implemented at CBIIT.  It is available online within 2 years after the initial project data generation. For data files uploaded to Globus system, we make data available for up to 2 weeks starting from the date of our data delivery email announcement.  It is the responsibility of the investigator laboratory contact, or bioinformatics contact to ensure that they have retrieved their data promptly. To maintain sufficient data storage for upcoming projects, the analysis files are then archived and stored for an additional four weeks for Globus

If your data is no longer available for download, please contact the SF bioinformatics group and we can re-run the data processing and alignment as necessary. However, please note that it may take longer to receive the re-analyzed data due to resource conflicts with current production runs. Whenever possible, it is best to download the data in a timely manner after receipt of the delivery notice.

What is the yield per lane for different sequencing platforms?

Please reference Illumina Sequencing Platform website for the specification: https://www.illumina.com/systems/sequencing-platforms.html

The actual sequencing performance parameters may vary based on sample type, sample quality, and clusters passing filter.

For further questions, please contact us directly CCRSF_IFX@nih.gov.