Test Project

Read Quality

Trimming statistics with fastp

For quality and adapter trimming, the tool fastp was used. The summary statistics are shown in the tables below.

Overall statistics of the read trimming step.
Statistics	Value
passed_filter_reads	7’641’750
low_quality_reads	7’278
too_many_N_reads	656
too_short_reads	556’416
too_long_reads	0

Read statistics before and after trimming.
status	total_reads	total_bases	q20_bases	q30_bases	q20_rate	q30_rate	read1_mean_length	read2_mean_length	gc_content
before_filtering	8’206’100	1’224’527’189	1’157’004’931	1’072’956’716	0.94	0.88	149	149	0.37
after_filtering	7’641’750	1’022’957’128	1’008’624’679	967’460’953	0.99	0.95	143	123	0.37

FastQC

FastQC was used to check the quality of the reads after filtering. The tool checks for example if platform specific adapter sequences were removed.

Results from FastQC. Deduplicated refers to the percentage of reads remaining if optical duplicates are removed. Optical duplicates occur when the exact same fragment has been sequenced more than once.
Sample	Reads	Deduplicated	Adapter	direction
FAM24227-i1-2_1	3’820’875	68.83	pass	forward
FAM24227-i1-2_2	3’820’875	76.87	pass	reverse

Assembly Statistics with Qast

The assembly is evaluated using Quast. The tool computes several statistics that help inferring if the assembly was successful.

Particularly the following metrics are important:

Largest Contig: Length of the largest contig of the asembled contigs.
N50: To compute this metric, the contigs are first sorted by size. The sizes of the largest contigs are summed up until the sum of these is larger than half of the total size of the assembly. The N50 describes the length of the smallest contig that is needed to get the sum that is larger than half of the total assembly size. Example: Given an assembly with 5 contigs of length 9,6,5,4,3; then the total assembly size is 27 and the N50 cutoff is 13.5. Thus the N50 of this assembly would be 6, because 6 + 9 > 13.5.
L50: L50 is the number of contigs equal or larger than the N50 contig. In the example above, the L50 value is 2.

See here for further explanations of quast metrics and plot descriptions.

Some basic metrics evaluated by Quast.
Assembly	FAM24227.i1.2_assembly
# contigs (>= 0 bp)	171
# contigs (>= 1000 bp)	100
# contigs (>= 5000 bp)	76
# contigs (>= 10000 bp)	61
# contigs (>= 25000 bp)	35
# contigs (>= 50000 bp)	16
Total length (>= 0 bp)	2’522’881
Total length (>= 1000 bp)	2’495’320
Total length (>= 5000 bp)	2’436’384
Total length (>= 10000 bp)	2’319’684
Total length (>= 25000 bp)	1’878’330
Total length (>= 50000 bp)	1’238’243
# contigs	171
Largest contig	195’691
Total length	2’522’881
GC (%)	36.44
N50	49’148
N75	23’438
L50	17
L75	36
# N’s per 100 kbp	3.96

BUSCO

From the BUSCO website:

“BUSCO attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs.

BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis. Finally, focusing on specific tissues or specific life stages and conditions in a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome. In this case, consistency across your samples is what you will be aiming for."

Genome assembly assessment using BUSCO4. The figure shows how many expected genes are complete, complete and duplicate, fragmented or missing.

confindr

confindr analyses genes that are known to be single-copy and conserved across all bacteria and evaluates whether there is more than one allele present in the tested genes. This may indicate intra-species contamination. The tool uses the raw reads for evaluating this and therefore, the results are given for forward and reverse reads independently.

confindr results.
Sample	Genus	NumContamSNVs	ContamStatus	PercentContam	PercentContamStandardDeviation	BasesExamined	DatabaseDownloadDate
FAM24227-i1-2_2	ND	0	False	0	0	20’937	2020-6-5

Annotation using Prokka

Prokka is used to annotate the assembly with useful features, such as CDS, genes, or rRNAs.

Summary statistics of the annotation with Prokka.
Feature	Count
contigs	171
bases	2’522’881
CDS	2’363
rRNA	7
tRNA	48
tmRNA	1

Taxonomy

In order to obtain a taxonomic classification of the assembly, the tool GTDBTK compares the assembly with a reference database. The tool infers the taxonomy by computing the Average Nucleotide Identity (ANI) against the reference database and by placing the genome into a phylogenetic tree.

Taxonomic placement of the assembly. The classification method is either ANI/placement, placement or ANI. ANI/placement means that both methods agree on the taxonomic output. Otherwise the taxonomy of the better match is returned.
Sample	Domain	Phylum	Class	Order	Family	Genus	Species	classification_method
FAM24227-i1-2	d__Bacteria	Firmicutes	Bacilli	Lactobacillales	Aerococcaceae	Facklamia_A	Facklamia_A sp003521095	ANI/Placement

Software Versions

fastp_version,0.20.1,
fastqc_version,0.11.7,
quast_version,4.6.0,
SPAdes genome assembler v3.14.0
  SPAdes version: 3.14.0
ConFindr 0.7.2

Config

{
  "Project_Title": "Test Project",
  "author": "Simone Oberhaensli",
  "email": "simone.oberhaensli@bioinformatics.unibe.ch",
  "institute": "Interfaculty Bioinformatics Unit, University of Bern",
  "DataFolder" : "/data/projects/p446_Dialact_Phoenix/2_analyses/A_illumina/202204_reassemblies_emmanuelle/test/genome-assembly-pipeline/src/data/",
  "DataFolder_testing": "/data/datasets/E_coli_testdata/genome_assembly_test/",
  "extension" : ".fastq.gz",
  "mates" : {
    "mate1" : "_1",
    "mate2" : "_2"
    },
  "SampleSheet" : "sample_sheet.tsv",
  "Mode" : "Illumina",
  "fastp" : {
    "fastp_version" : "0.20.1",
    "fastp_threads" : 10,
    "fastp_hours": 4,
    "fastp_mem_mb": 5000
  },
  "fastqc" : {
    "fastqc_version" : "0.11.7",
    "fastqc_threads": 5,
    "fastqc_hours": 4,
    "fastqc_mem_mb": 4000,
  },
  "spades" : {
    "spades_hours": 72,
    "spades_mem_mb": 30000,
    "spades_mem_gb":28,
    "spades_threads": 10,
    "spades_min_scaffold_length": 200
  },
  "quast" : {
    "quast_version" : "4.6.0",
    "quast_threads": 10,
    "quast_hours": 4,
    "quast_mem_mb": 5000,
  },
  "busco" : {
    "busco_threads": 5,
    "busco_hours": 3,
    "busco_mem_mb": 20000,
  },
  "confindr" : {
    "confindr_threads": 5,
    "confindr_hours": 4,
    "confindr_mem_mb": 30000,
  },
  "prokka" : {
    "prokka_threads": 10,
    "prokka_hours": 8,
    "prokka_mem_mb": 5000,
  },
  "gtdb" : {
    "gtdb_threads": 10,
    "gtdb_hours": 4,
    "gtdb_mem_mb": 300000,
  },
  "short_sh_commands_threads": 16,
  "short_sh_commands_hours": 1,
  "short_commands_mb": 2000,
  "testing" : {
    "testing_threads": 8,
    "testing_hours": 1,
    "testing_mem_mb": 16000
  },
  "eof": "true"
}