align2subsetgbk¶
This script is a little more tricky. What does it do?
It takes a genbank file and a (Nucleotide) fasta file
It will align the fasta to the (contig) sequences in the genbank file
Then, it will filter the alignments based on the given thresholds and take the its coordinates
Finally, using these coordinates it will filter the genbank file and output only a subset of the annotation features that is found inside these alignments
CLI help message¶
A script meant to subset a genbank annotation file based on alignments against a query (Nucleotide) FASTA file
---
Copyright (C) 2020 Felipe Marques de Almeida (almeidafmarques@gmail.com)
License: Public Domain
Usage:
falmeida-py align2subsetgbk [ -h|--help ]
falmeida-py align2subsetgbk [ --gbk <in_gbk> --fasta <fasta> --out <out_gbk> --minid <int> --mincov <int> --culling_limit <int> --extension <int> ]
Options:
-h --help Show this screen.
-g --gbk=<in_gbk> Gbk file for subset
-f --fasta=<fasta> FASTA (nucl) file for querying the gbk
-o --out=<out_gbk> Gbk filtered output file [Default: out.gbk].
--extension=<int> Base pair length to extend the flank regions in the alignment [Default: 0].
--minid=<int> Min. Identity percentage for gene annotation [Default: 80].
--mincov=<int> Min. Covereage for gene annotation [Default: 80].
--culling_limit=<int> Blast culling_limit for best hit only [Default: 1].
falmeida-py: a package to the simple distribution of my custom scripts.
Copyright (C) 2020 Felipe Marques de Almeida (almeidafmarques@gmail.com)
License: Public Domain
usage:
falmeida-py [ -h|--help ] [ -v|--version ] [ --license ]
falmeida-py <command> [ -h|--help ] [ <args>... ]
options:
-h --help Show this screen
-v --version Show version information
--license Show LEGAL LICENSE information
commands:
tsv2markdown Command for rapid convertion of tsv or csv to markdown tables.
splitgbk Command to split multisequence genbank files into individual files.
align2subsetgbk Command to subset genbank files based on alignments to a FASTA file.
gbk2fasta Command to convert genbank files to fasta files.
blasts Command to execute automatized blast commands.
replace_fasta_seq Command to replace strings in a FASTA using defitinitions from a BED file
mpgap2csv Command to summarize main mpgap multiqc assembly statistics into a CSV file
bacannot2json Command to summarize main bacannot annotation results into JSON file
Use: `falmeida-py <commmand> -h` to get more help and see examples.
Usage¶
The usage is super simple:
falmeida-py align2subsetgbk \
--gbk sample.gbk \
--fasta genomic_islands.fna \
--extension 50 \
--culling_limit 2
Processing file: annotation/sample.gbk!
qseqid qstart qend qlen sseqid sstart send slen evalue length pident gaps gapopen bitscore
0 NC_016845.1_1287077_1326610 1 39533 39533 NC_016845.1 1287078 1326610 5333942 0.0 39533 100.0 0 0 73004
1 NC_016845.1_1778390_1811349 1 32959 32959 NC_016845.1 1778391 1811349 5333942 0.0 32959 100.0 0 0 60864
2 NC_016845.1_2286181_2305760 1 19579 19579 NC_016845.1 2286182 2305760 5333942 0.0 19579 100.0 0 0 36156
3 NC_016845.1_4049987_4083106 1 33119 33119 NC_016845.1 4049988 4083106 5333942 0.0 33119 100.0 0 0 61160
4 NC_016845.1_4821157_4858652 1 37495 37495 NC_016845.1 4821158 4858652 5333942 0.0 37495 100.0 0 0 69241
Note
This command will align the genomic islands to the genbank to find the annotation features that are found inside the regions that align to these genomic islands.
In the example, the alignments coordinates will be “extendend” 50nt in both directions and, the two best alignments will be used for each sequence (from fasta).