SOLiD™ System Small RNA Analysis Pipeline Tool (RNA2MAP)
Description
A diagram of these procedures are outlined below:
In the filtering step all reads are mapped against a reference containing by-product sequences (P2 adaptor, P1 and P2 primers; reference provided). The matching is based on 15 colors of the reads, the first one and last 19 being masked, allowing 0 or 1 mismatch. The output of this step is saved in byproduct directory. Reads not matching the previous reference are mapped against a reference containing rRNA, tRNA, repeated regions of the human genome (reference provided and stored in /RNA_pipeline_v0.3/reference/). Matching is based on first 20 colors of the reads, allowing 0 or 1 mismatches. The output of this step is saved in filtered directory. Reads not matching these references are parsed to the next step.
A reference based on miRBase sequences is generated for this step, by concatenating each miRNA precursor sequence (Sanger build 11) (reference provided and stored in /RNA_pipeline_v0.3/reference/). Reads are first matched based on the first 18 colors (last positions are masked), allowing 0 or 1 mismatch. If the library type parameter (-r) is set to “miRNA” than the matching locations are used as seeds for the next extension step (see Extension section). The results are stored in miRNA_Sanger_11 directory. Reads not matching this reference are parsed to the next step.
A reference based on refseq sequences (NCBI build 36) is generated for this step, by concatenating each refseq sequence (32662) (reference provided and stored in /RNA_pipeline_v0.3/reference/HUMAN_refseq_1188843346_concat_v3_validated.fasta). Reads are first matched based on the first 25 colors (last positions are masked), allowing up to 3 mismatches. If the library type parameter (-r) is set to “transcriptome” than the matching locations are used as seeds for the next extension step (see Extension section). The results are stored in refseq_36 directory. Reads not matching this reference are parsed to the human genome matching step.
Finally reads not matching the previous references are mapped against human genome reference. We use again a “seeding” step, in which reads are matched against human genome reference (NCBI build 36 with one fasta file for each chromosome; reference NOT provided) based on the first 20 colors (last positions are masked), allowing 0 or 1 mismatch. The matching results are used as seeds for an “extension” step. The matching process starts with the standard output files (extension.csfasta) fasta format files containing bead location information and tag’s (color) sequence. >1379_8_1167_F3 T30010310300130311122123302010003131 The first letter represents the last base of forward primer and the entire analysis is performed at color level, the conversion to base sequences being applied to the end results.
For each seed we estimate adaptor starting position as follows: if the adaptor starts at position n then the read (35 colors long) is compared with the “hypothetical” sequence composed of n colors from the reference (with same start point as the seed) followed by the first 35 – n colors from the adaptor. The actual read is compared (full 35 bases long) to the “hypothetical” one and the number of mismatches is recorded. The location n0 giving the smallest number of mismatches is considered adaptor start point, while the number of mismatches from full length of the read is associated with the starting seed. Seeds of the same read producing the same smallest number of mismatches are reported as hit locations of the read. The .bc files contain a column with explicit fragment length (read with trimmed adaptor). .ma and .gff files contain the color/base reads with adaptor trimmed.
Licensing
Support
Documentation
Software Download
Sample Data Download
|
|

