logo



Project report : Hybrid genome assemblies of 26 strains : Escherichia sp and Pseudomonas sp .

To the attention of : First name LAST NAME , Company - CITY, Country

Analysis/Writing : First name LAST NAME, Bioinformatics Engineer

Corrections/Validation : First name LAST NAME, Ph.D, Operations manager

Main results

Figure 1 - Summary of the quality of the results

Figure 1 - Summary of the quality of the results

Scores out of 5 of the results according to (Data) data quality and contamination; (Contiguity) number and size of contigs; (Completion) assembly completeness; (Correctness) assembly errors; (Annotation) annotation.


The goal of the project was to assemble the genomes of 26 strains of bacteria using high-throughput sequencing data from Oxford Nanopore and Illumina technologies.

Overall, the assembly metrics are good for your samples. We observe a good accuracy and contiguity. For the samples D46, we obtain a more fragmented assembly.

Due to the completion metrics, we can estimate that the assemblies are complete but 2 of your samples, D22 and D24, have lower results.

Moreover, the total sizes of the assemblies are close to the size of the reference genomes.

Project description

This report describes all bioinformatics analyses that were performed following high-throughput sequencing using Oxford Nanopore and Illumina technologies. The objective was to perform de novo hybrid assemblies of 26 strains : Escherichia sp and Pseudomonas sp and annotation of these assembled genomes.

This process was carried out in the following stages:

  • First, a quality check of the raw data was performed before and after the cleaning steps, including both detection and removal of adapters as well as trimming of low quality bases.
  • Following the cleaning, the genomes of the 26 strains, Escherichia sp and Pseudomonas sp were assembled.
  • The advantage of using a hybrid approach with short and long reads is that it allows to correct assemblies obtained with polishing methods and also to correct sequencing errors.
  • Then, the quality of these assemblies was evaluated with four categories of metrics, which are: contiguity, correctness, completion and contamination.
  • Finally, an annotation was performed.

Below is a representation of the key steps from the bioinformatics pipeline use to obtain de novo assemblies.

Figure 2 - Key steps from the bioinformatics pipeline leading to the hybrid assembly and annotation of the strains

Figure 2 - Key steps from the bioinformatics pipeline leading to the hybrid assembly and annotation of the strains

Overview of the data sent

Summary tables of results :

  • Appendix 1: Statistics on Illumina data (short reads)
  • Appendix 2: Statistics on Oxford Nanopore data (long reads)
  • Appendix 3: Information and statistics on final assemblies
  • Appendix 4: Taxonomic assignment of contigs
  • Appendix 5: Annotation Information

For each of the samples :

  • Fasta sequences after assembly and corrections from long and short reads.
  • The result of the annotations for each assembly and each of their contigs. Each folder contains annotation files in different standard formats: generic feature format (.gff), genbank (.gbk), protein sequences (.faa), nucleic sequences (.ffn) and a simplified tabulated version (.tbl)
  • A depth of coverage graph along each contig.
  • Reads assignment report.

Below is an example of the structure of each sub-folder corresponding to the samples :

Sample/
├── Sample_assembly.fasta
├── Sample_coverage_plots
│   ├── 500kb.png
│   └── contig_1.pdf
│   └── …
├── Sample_k2_report.txt
└── Sample_prokka
    ├── mygenome.faa
    ├── mygenome.ffn
    ├── mygenome.gbk
    ├── mygenome.gff
    └── mygenome.tsv


Results

Quality control and data cleaning

Illumina Data

Figure 3

Illumina data cleaning ensures that we have excellent quality reads (>Q30), thus, ensuring the best possible quality for downstream analysis. Read adapters are removed and then the reads are filtered based on their quality.

Details of the data are available in Appendix 1.