JBS ChimericSeq™ (free)
DOWNLOADS
Item | Description | File size |
JBS ChimericSeq™ Software | Main Program (Windows) | 102.526 MB |
JBS ChimericSeq™ Software | Main Program (Mac) | 9.62 MB |
hg38.fa | Human Reference Genome | 960 MB |
HBV_Complete.fa | HBV Reference Genome | 862 KB |
Sample reads | Mate pair of reads (HBV/Human Sequences) | 24.896 MB |
Human GTF | Human GTF file for gene lookup | 1.3GB |
HumanRef.zip | Human Reference Genome Index Files | 3.8 GB |
ViralRef.zip | HBV Reference Genome Index Files | 10.2 MB |
ChimericSeq-Source.zip | ChimericSeq Source Code | 29 KB |
ChimericSeq_User_Guide_for_Windows | ChimericSeq Quick User Guide | 1.1 MB |
MANUSCRIPT
INTRODUCTION
JBS ChimericSeq™ brings an easy to use interface to Windows and Mac users who are looking to discover viral integration events in host DNA. However, JBS ChimericSeq™ can be used to identify and analyze any chimeric sequences, not necessarily just viral/host, as long as the reference genomes are provided by the user. Firstly, in order to use JBS ChimericSeq™ to discover virus-host integration sites, you must have a few things ready. You must have both organisms’ (host and viral) fasta (.fa/.fasta) files, which are the reference sequences for each organism. If you would like to discover integrations from multiple viruses at once, you can do so by combining the references for each virus into one fa file. This also is how you would screen for the same virus but with different genotypes. Simply put, these files are the complete genome for the particular organism(s) you are studying. To find these, you can navigate to any server that hosts it and download from there (you can get it from Refseq, Ensemble, NCBI, UCSC, etc). If you would like gene information about the integration sites, you must also have a GTF file for the host.
For instance: The human genome can be retrieved from several databases, and in our tests we used the Dec. 2013 (GRCh38/hg38) assembly provided by UCSC.
Human Reference
The viral genome can be found for your particular virus through many of these databases as well. For ours we use the HBV complete genome (listed under ALL) viral assembly from:
HBV Reference
The human GTF file can be retrieved from several databases. For ours we use the human GTF provided by Ensembl:
Human GTF
We also provide these genomes and gtf directly in our downloads section for your convience.
Once you have obtained your reference sequences for your reads, and you have your NGS data, you are ready to perform first time setup. To begin, please download the main program for Windows or Mac in the Downloads section. Install to your directory of choice by unzipping the download there. To make things easier for you, although not required, place your reads in the Reads folder of the main program’s root directory and your host.fa and viral.fa files in the Host_Reference and Viral_Reference folders, respectively. Make sure all files have been uncompressed.
ChimericSeq™ does require that you have 3 dependencies which are included in the main program. These are Bowtie2, Python, and Perl. These do not need to be installed by you, as ChimericSeq™ already manages adding their directories to the path. Python and Perl are only included because they are Bowtie2 dependencies; ChimericSeq™ does not use them directly. If you would like to update your Bowtie2 version manually, the link can be found below.
Latest Bowtie2 release (optional)
Latest Perl release (optional)
Latest Python release (optional)
For Windows: Download and execute the self-extract ChimericSeq_zip.exe. To begin, click ChimericSeq.exe. If you have troubles, see the note about running as administrator. Windows version includes Python and Perl packages, so there’s no need to set those up. However, your system may require that you properly set the paths for these packages. Instructions to set the paths are here.
For Mac: Download and unzip ChimericSeq_Mac.zip. Then, click the shell script JBS_ChimericSeq.tool to install required tools. To begin, type python ChimericSeq.py. If you have troubles, see the note about running the Mac version. The Mac version only requires setup of Python, as Perl is already installed on Macs to begin with.
System memory: The minimum system memory requirement is 8G bytes.
Hard disk free space:
(a) 10G bytes for Human Reference genome and ChimericSeq application.
(b) 2 times of test data size space for holding original test data and alignment data. If you need to split test data into smaller files, it will need 3 times of test data size space for holding original test data, split files, & alignment data.
FIRST TIME SETUP
Firstly, you must tell JBS ChimericSeq™ to build the index databases for Bowtie2 to align your reads to. This is because Bowtie2 doesn’t actually align itself to the files you just downloaded, but rather to a database the program creates based off those files. This is just how the program works. Once you have built the index files for a particular reference file, you DO NOT have to perform this step again. Note, this process can take several minutes, and for the human reference can take well over 30 mintues (even hour+). Do not touch the program until you see the message “Build Complete” in the output log of the program. Follow these steps to build the index for each reference fasta file:
Steps
1) Navigate to Options->Set Locations. Make sure the Bowtie2 Directory field is pointing to either the bowtie2 folder included in the package, or one already on your computer.
2) Press the “…” buttons under Viral reference *.fa and select your viral reference file (the .fa file).
3) Type any string of characters in the Viral Bowtie2 Prefix. For example, the word “viral” without the quotes would be fine. This name does not matter, but is merely here to allow personal preference on the name of the index database for future use.
4) Press the “Build” button, and the program will automatically tell Bowtie2 to build the index in the directory specified by the Viral Index Directory field.
5) Watch the log in the main window to verify completion of the build before moving on to the next step. This step can take up to
several minutes, and is typically the longest step in the setup process. You will see “Build complete” once finished. Then you can
move on.
6) Repeat steps 1-5 for the host files, assuming it has not been completed already. If the host is large, such as the human genome, this can take hours!
If you already have Bowtie2 indecies on your computer for the organisms you want to evaluate, just select the location using the Viral/Host Index Directory field and make sure the Viral/Host Bowtie2 Prefix matches up with the name of the corresponding index files.
PLEASE READ!! NOTE:
There is a naming convention for mate paired reads. It is so that the forward read ends in 1.fq or 1.fastq, and the reverse read ends in 2.fq or 2.fastq. There is no naming convention for unpaired reads.
Reads limit: for better performance, the maximum number of reads is 4 million for a system with 32G memory, 2 million for a system with 16G memory, and 1 million for a system with 8G memory. For large files, the user may use File’s “Split Large File” to split paired files into smaller files and process.
Process all the files under created folder SRR854550__split
Windows note: if you are getting errors with Bowtie2 saying anything about permissions or that it can not edit/find files due to permissions, run ChimericSeq as administrator by right clicking and selecting “Run as Administrator”.
Mac note: Mac requires additional setup. Please download and install Python 3.X for Mac from Anaconda. if you are getting errors with permissions or the .tool file is not launching correctly, you will have to change the permissions. To do this, find the ChimericSeq_Mac folder and select it, and then click finder>services>new terminal at folder in the top left corner of your screen. If you dont have the “new terminal at folder” option, then click “services preferences” instead. Under “shortcuts” tab, check the box for “new terminal at folder”. Close out of this. Right click the ChimericSeq_Mac folder again and select “New Terminal at Folder” at the bottom of the menu. Once you have opened up terminal, type each command and hit enter
cd ../
sudo chmod -R 0777 ./ChimericSeq_Mac
sudo chmod a+x ./ChimericSeq_Mac/JBS_ChimericSeq.tool
You can now close terminal and run the application normally by double clicking JBS_ChimericSeq.tool file.
OPTIONS SETUP
The options setup should be performed upon opening JBS ChimericSeq™.
1) Change any configurations you may need by navigating to Options->Configurations
a. Clipped Sequence Min Length: When the viral portion is aligned to the reference, this program looks for reads that aligned only partially to the viral reference file. This field is the threshold length of the unaligned portion. For example, a 50 base pair read with the base pairs 7-45 aligning to the viral reference file would have clipped sequences of lengths 6 and 5. This read would not
qualify to be aligned to human under the default value (10).
b. Salt Concentration (mM): This is the value of the salt concentration for determining the salt adjusted melt temps of each DNA segment (host, viral, and overlap), as given by the DNA melt temperature formulas. This can be useful in primer design. Default value is 115 mM.
c. Trimming 5′ (bp): Takes off this number in base pairs from 5′ end for trimming reads of adapter sequences. Default 0.
d. Trimming 3′ (bp): Takes off this number in base pairs from 3′ end for trimming reads of adapter sequences. Default 0.
e. Gene Distance Threshold: The closest upstream or downstream gene to the breakpoint (if not inside a gene) will be reported. This determines how far to look in each direction in terms of base pairs.
f. Similarity Max%: When two reads contain greater than the set percentage of homologous (or identical) sequence, only 1 read will be retained. Default 95.
g. nt stretch count max: When a run of a particular nuceotide stretches past the set amount, the read will be discarded. Default 8.
h. Microhomology Max%: When the identified viral sequence has more than the set, maximum allowed percent overlap with the host sequence, the read is discarded. Default 90.
i. Microhomology nt: When the identified viral sequence has more than the set, maximum allowed number of nucleotide overlap with the host sequence, the read is discarded. Default 25.
j. Overlap TM Max (C)*: Filter value specifying maximum overlap melt temperature – helps get rid of artifact junctions.
k. Overlap Length Max*: Filter value specifying maximum overlap length – helps get rid of artifact junctions.
l. Host TM Min (C)*: Filter value specifying minimum host region melt temperature.
m. Host Length Min*: Filter value specifying minimum host region length.
n. Viral TM Min (C)*: Filter value specifying minimum virus region melt temperature.
o. Viral Length Min*: Filter value specifying minimum virus region length.
p. Filter with Basic Temperature*: Switch for filter to either use salt adjusted or basic melt temps for filtering.
* These filtering mechanisms must be turned on individually using Options>Filtering and are not enabled by default.
2) Make sure location preferences are set by navigating to Options->Set Locations
a. Bowtie2 Directory: Folder location of the Bowtie2 directory. (Default value is PATHJBS_ChimericSeqbowtie2-2.2.5)
b. Viral/Host Reference *.fa: Location of the viral/host reference Fasta file.
c. Viral/Host Index Directory: Location of the Bowtie2 index for the viral/host reference file. If not viral index exists, follow First
Time Setup to create.
d. Viral/Host Bowtie2 Prefix: Name of the Bowtie2 index built from the reference file and located in the index directory. (Default
value is viralRef/hostRef)
e. Host GTF file: Location of the host GTF file. Needed if gene information is to be included.
f. Output Directory: Directory where subfolders will be create for each run performed by JBS ChimericSeq™. (Default value is PATHJBS_ChimericSeq)
3) Set Filtering by navigating to Options->Filtering. This will bring up a menu of the previously mentioned filters and you can enable or disable them here.
4) Set Thread Count by navigating to Options->Thread Count. This is the number of CPU cores to engage for alignment. Higher value = uses more CPU power and performs faster alignments. (Default =2)
5) Set Prompting to either on or off in the Options menu. This will ask for yes or no input before continuing each alignment phase
when enabled (Default=On)
6) Set Load Alignments When Possible to true if you want to skip the initial Bowtie2 alignments if you already have alignment files generated to the reference indexes, located in the subfolder for the run you are performing. (Default=Off)
7) Set Get gene info to true if you want to include gene information about integration sites in the output data. This requires slightly more time per run, and a host GTF file.
8) If you want to save the configuration and location settings you just changed for future quick loading, navigate to File->Save Settings to save.
9) Finally, if you have already saved your settings from a previous use with the program, nagivate to File->Load Settings to load config file.
PERFORMING A RUN
1) Load your Fastq read(s), whether a single read or a mate pair, by clicking on the “…” button in the main display next to the Fastq File(s) field. Make sure if you are loading a Fastq pair, to use naming convention!
2) Click the “Start Run” button to automatically take reads through the pipeline and output the data.
3) “Yes” or “No” dialog options as presented by the interactive log allow you to confirm steps in the process, namely, to continue with
viral alignment, and to continue with host alignment. This feature can be turned off in the Options->Configurations panel.
4) The list of reads that had portions align to both viral and host references will be displayed in the Reads Containing Junction Sequences field. Click on one or use the arrow keys to select reads.
5) A visualization of the read will be presented in the Sequence field, while other attributes about the read will be presented in the Attributes field.
6) To save the data that has been generated, including the log’s text, navigate to File->Save. This generates a series of timestamped CSV files (think excel) that contains the attributes data for each read, including the sequence, and also saves a text file containing the log’s text. The series of csv’s is annotated in their titles, and you can find things like the automatically extracted unique reads in the no_duplicates and no_complements files.
ATTRIBUTE TABLE
Attributes about the selected read will be both presented in the attribute table and in the csv file generated. A “*” indicates that there are both “viral” and “host” versions. The fields’ descriptions are as follows:
a) Read Name: Read identifier used by the program. This is a shortened version of the read identifier given by the Fastq source.
b) Read Length: Length in base pairs of the initial read.
c) Fastq Source: If a mate pair was loaded, will point to which mate the read came from. If just a single Fastq, then this field will just be that Fastq.
d) Index: Book keeping index used by the program, also present in saved csv files for easy comparison.
e) * Component Length: The length of the read in base pairs which was mapped to the corresponding reference.
f) Local * Coordinates: The coordinate range of the queried read that mapped to the reference, where 1=first base in the sequence.
g) * Reference Coordinates: The coordinate range of the reference genome that correspond to the mapped region on the original sequence.
h) Viral Accession: The name of the viral reference sequence accession, as defined by the reference sequence itself.
i) * Map Type: Will be Plus/Plus if the query mapped to the reference in the same orientation, will be Plus/Minus if the query mapped to the reference’s complement.
j) * TM (Basic): Melting temperature of the segment described by * as determined by standard DNA melt formulas.
k) * TM (Salt Adjusted): Melting temperature of the segment described by * as determined by standard DNA melt formulas with salt adjustment, provided these values from the configurations panel.
l) Chromosome: Chromosome of host in which query aligned to, when host is human. Otherwise, will just be host accession.
m) Gene: Provided a GTF file was loaded, name of the gene (if any) within threshold distance on the genome from the host alignment. Default threshold is 10kb.
n) Inside Gene: True if host alignment was inside gene reported above, false if upstream or downstream from gene but within threshold.
o) Distance To Gene: Distance in bp of alignment to gene if not inside reported gene.
p) Direction: If not inside reported gene, upstream or downstream from gene.
q) Fcous Region: Sometimes when searching the GTF file, if there is a known feature that the host alignment falls on that is more specific than the gene, such as an exon, this feature will be reported.
r) Inserted regions: Coordinate ranges of the queried read that have been detected to be inserted. (Not mismatch, not reference sequence, contained in alignment)
s) Overlap: Number of bases of overlap, if any, between viral and host regions on query sequence.
t) Gene: If gene database file has been provided, closest gene from the query’s host reference coordinates within 100kbp.
u) Map Quality: Map quality as generated by Bowtie2 from the alignment files.
v) Gene Object (only included in saved .csv): Query of the GTF file which lists the gene reported, if any. This contains GTF file attributes about the gene, which include database source, gene id, and more as defined by the GTF file format.
w) Focus Object (only included in saved .csv):Query of the GTF file which lists the focus reported, if any. Same attributes as gene object.
HELP
If you have any questions or comments, please contact info@jbs-science.com.
Coming soon, translocation mode, which seeks to find read rearrangements in the host.