Assembly of Sample FAM24227-i1-2

Contact: simone.oberhaensli@bioinformatics.unibe.ch

Institute: Interfaculty Bioinformatics Unit, University of Bern

Read Quality

Trimming statistics with fastp

For quality and adapter trimming, the tool fastp was used. The summary statistics are shown in the tables below.


Overall statistics of the read trimming step.
Statistics Value
passed_filter_reads 7’641’750
low_quality_reads 7’278
too_many_N_reads 656
too_short_reads 556’416
too_long_reads 0


Read statistics before and after trimming.
status total_reads total_bases q20_bases q30_bases q20_rate q30_rate read1_mean_length read2_mean_length gc_content
before_filtering 8’206’100 1’224’527’189 1’157’004’931 1’072’956’716 0.94 0.88 149 149 0.37
after_filtering 7’641’750 1’022’957’128 1’008’624’679 967’460’953 0.99 0.95 143 123 0.37

FastQC

FastQC was used to check the quality of the reads after filtering. The tool checks for example if platform specific adapter sequences were removed.

Results from FastQC. Deduplicated refers to the percentage of reads remaining if optical duplicates are removed. Optical duplicates occur when the exact same fragment has been sequenced more than once.
Sample Reads Deduplicated Adapter direction
FAM24227-i1-2_1 3’820’875 68.83 pass forward
FAM24227-i1-2_2 3’820’875 76.87 pass reverse

Assembly Statistics with Qast

The assembly is evaluated using Quast. The tool computes several statistics that help inferring if the assembly was successful.

Particularly the following metrics are important:

  • Largest Contig: Length of the largest contig of the asembled contigs.

  • N50: To compute this metric, the contigs are first sorted by size. The sizes of the largest contigs are summed up until the sum of these is larger than half of the total size of the assembly. The N50 describes the length of the smallest contig that is needed to get the sum that is larger than half of the total assembly size. Example: Given an assembly with 5 contigs of length 9,6,5,4,3; then the total assembly size is 27 and the N50 cutoff is 13.5. Thus the N50 of this assembly would be 6, because 6 + 9 > 13.5.

  • L50: L50 is the number of contigs equal or larger than the N50 contig. In the example above, the L50 value is 2.

See here for further explanations of quast metrics and plot descriptions.


Some basic metrics evaluated by Quast.
Assembly FAM24227.i1.2_assembly
# contigs (>= 0 bp) 171
# contigs (>= 1000 bp) 100
# contigs (>= 5000 bp) 76
# contigs (>= 10000 bp) 61
# contigs (>= 25000 bp) 35
# contigs (>= 50000 bp) 16
Total length (>= 0 bp) 2’522’881
Total length (>= 1000 bp) 2’495’320
Total length (>= 5000 bp) 2’436’384
Total length (>= 10000 bp) 2’319’684
Total length (>= 25000 bp) 1’878’330
Total length (>= 50000 bp) 1’238’243
# contigs 171
Largest contig 195’691
Total length 2’522’881
GC (%) 36.44
N50 49’148
N75 23’438
L50 17
L75 36
# N’s per 100 kbp 3.96


BUSCO

From the BUSCO website:

“BUSCO attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs.

BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis. Finally, focusing on specific tissues or specific life stages and conditions in a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome. In this case, consistency across your samples is what you will be aiming for."


Genome assembly assessment using BUSCO4. The figure shows how many expected genes are complete, complete and duplicate, fragmented or missing.

Genome assembly assessment using BUSCO4. The figure shows how many expected genes are complete, complete and duplicate, fragmented or missing.

confindr

confindr analyses genes that are known to be single-copy and conserved across all bacteria and evaluates whether there is more than one allele present in the tested genes. This may indicate intra-species contamination. The tool uses the raw reads for evaluating this and therefore, the results are given for forward and reverse reads independently.


confindr results.
Sample Genus NumContamSNVs ContamStatus PercentContam PercentContamStandardDeviation BasesExamined DatabaseDownloadDate
FAM24227-i1-2_2 ND 0 False 0 0 20’937 2020-6-5

Annotation using Prokka

Prokka is used to annotate the assembly with useful features, such as CDS, genes, or rRNAs.


Summary statistics of the annotation with Prokka.
Feature Count
contigs 171
bases 2’522’881
CDS 2’363
rRNA 7
tRNA 48
tmRNA 1

Taxonomy

In order to obtain a taxonomic classification of the assembly, the tool GTDBTK compares the assembly with a reference database. The tool infers the taxonomy by computing the Average Nucleotide Identity (ANI) against the reference database and by placing the genome into a phylogenetic tree.


Taxonomic placement of the assembly. The classification method is either ANI/placement, placement or ANI. ANI/placement means that both methods agree on the taxonomic output. Otherwise the taxonomy of the better match is returned.
Sample Domain Phylum Class Order Family Genus Species classification_method
FAM24227-i1-2 d__Bacteria Firmicutes Bacilli Lactobacillales Aerococcaceae Facklamia_A Facklamia_A sp003521095 ANI/Placement

Software Versions

fastp_version,0.20.1,
fastqc_version,0.11.7,
quast_version,4.6.0,
SPAdes genome assembler v3.14.0
  SPAdes version: 3.14.0
ConFindr 0.7.2

Config

{
  "Project_Title": "Test Project",
  "author": "Simone Oberhaensli",
  "email": "simone.oberhaensli@bioinformatics.unibe.ch",
  "institute": "Interfaculty Bioinformatics Unit, University of Bern",
  "DataFolder" : "/data/projects/p446_Dialact_Phoenix/2_analyses/A_illumina/202204_reassemblies_emmanuelle/test/genome-assembly-pipeline/src/data/",
  "DataFolder_testing": "/data/datasets/E_coli_testdata/genome_assembly_test/",
  "extension" : ".fastq.gz",
  "mates" : {
    "mate1" : "_1",
    "mate2" : "_2"
    },
  "SampleSheet" : "sample_sheet.tsv",
  "Mode" : "Illumina",
  "fastp" : {
    "fastp_version" : "0.20.1",
    "fastp_threads" : 10,
    "fastp_hours": 4,
    "fastp_mem_mb": 5000
  },
  "fastqc" : {
    "fastqc_version" : "0.11.7",
    "fastqc_threads": 5,
    "fastqc_hours": 4,
    "fastqc_mem_mb": 4000,
  },
  "spades" : {
    "spades_hours": 72,
    "spades_mem_mb": 30000,
    "spades_mem_gb":28,
    "spades_threads": 10,
    "spades_min_scaffold_length": 200
  },
  "quast" : {
    "quast_version" : "4.6.0",
    "quast_threads": 10,
    "quast_hours": 4,
    "quast_mem_mb": 5000,
  },
  "busco" : {
    "busco_threads": 5,
    "busco_hours": 3,
    "busco_mem_mb": 20000,
  },
  "confindr" : {
    "confindr_threads": 5,
    "confindr_hours": 4,
    "confindr_mem_mb": 30000,
  },
  "prokka" : {
    "prokka_threads": 10,
    "prokka_hours": 8,
    "prokka_mem_mb": 5000,
  },
  "gtdb" : {
    "gtdb_threads": 10,
    "gtdb_hours": 4,
    "gtdb_mem_mb": 300000,
  },
  "short_sh_commands_threads": 16,
  "short_sh_commands_hours": 1,
  "short_commands_mb": 2000,
  "testing" : {
    "testing_threads": 8,
    "testing_hours": 1,
    "testing_mem_mb": 16000
  },
  "eof": "true"
}