jueves, 26 de junio de 2014

Week 6: CIGAR and Flags

One of the key features when browsing BAM files is to be able to see the insertion, deletions, and SNPs. For that, not only displaying the base is important, but adding colours help to highlight the features. For that, I had to parse the CIGAR line to know the length of the read and the actual interoperation.

The orientation is stored in one of the flags (0x0010). To validate the orientation, I started adding a method to the sam line which tells if the orientation is forward or not. This makes the code later a bit clearer

 
 has_flag : function (f){ 
        f = parseInt(f);
        return (this.flags & f) == f ;
},
forward: function(){
        return  this.has_flag(16);
}

Then, I got a function that paints the DIV with all the bases inside and applies a CSS style to each one of the divs for each base. The CSS can be edited externally and the functionality should stay. I have to say that I spent a lot of time testing this, as for some wired reason the CSS is not applied at the right time.

build_div: function(){
              var new_div = document.createElement("div");
              new_div.style.height = container.opt.fontSize ;
              new_div.style.position = "absolute";
              n_pos = ( this.pos - 1) * container.opt.base_width;
              new_div.style.left = n_pos + "px";

              if(this.forward()){
                new_div.classList.add("bam_forward");
              }else{
                new_div.classList.add("bam_reverse");
              }
              
              var cigars = this.cigar.replace(/([SIXMND])/g, ":$1,");
              var cigars_array = cigars.split(',');
              var cig_index = 0;
              this.len = 0
              var cig_end  = -1;
              var cig ;
              var key;
              var length;
              var cig_index = 0;
              var last_div;
              changed = true;
              for ( var i = 0; i < this.seq.length; i++ ){
                if(i > cig_end || changed == true){
                  cig = cigars_array[cig_index].split(":"); 
                  key = cig[1];
                  length = parseInt(cig[0]);
                  cig_end = i + length;
                  cig_index +=1
                  changed = false;

                }
                if(key == "M" || key == "X" || key == "="){
                  display_base = this.seq[i];
                  var current_base_span = document.createElement("div");
                  new_div.appendChild(current_base_span);
                  current_base_span.className = "bam_base_" + display_base;
                  current_base_span.style.width = container.opt.base_width + "px";
                  current_base_span.style.cssFloat = "left";
                  current_base_span.appendChild(current_base_span.ownerDocument.createTextNode(display_base));
                  last_div = current_base_span;
                  this.len += 1
                  current_base_span.id = this.len
                }else if(key == "I"){
                  last_div.classList.add("bam_base_I");
                  changed = true;
                }else if(key == "D" || key == "N"){
                  for (var j  = 0; j < length; j ++ ) {
                     display_base =  "*";
                     var current_base_span = document.createElement("div");
                     current_base_span.classList.add("bam_base");
                     current_base_span.classList.add("bam_base_D");
                     current_base_span.style.width = container.opt.base_width + "px";
                     current_base_span.style.cssFloat = "left";
                     current_base_span.appendChild(current_base_span.ownerDocument.createTextNode(display_base));
                     last_div = current_base_span;
                     new_div.appendChild(current_base_span);
                     this.len += 1;
                      current_base_span.id = this.len
                  }
                  changed = true;
                  //cig_index += 1;
                  i--;
                }
              }
              new_div.style.width = container.opt.base_width * this.len + "px"; 
              this.div = new_div;
    }};

Week 5: Dragging around

In week 5 I worked more on the navigation. I ended up learning a lot of the internals of JS and exploring the jQuery library, something I had refused to do, avoiding unnecessary overheads.

For the navigation I want to be able to drag and drop on the region with reads. My naive idea was to simply calculate a div big enough to populate the whole chromosome, and then fill the divs according to whatever region I had on screen. However, I learnt that having more than 10,000 pixels of width, even when empty, just kills the browser.

My next attempt was to make my container div draggable and whenever I let the mouse go, re-render the divs with the offset. This ended up with adding the native JavaScript listeners to my container div:

 
_select_chromosome: function(full_region){
  this._container.empty();
  this.alignments = {};
  this.full_region = this.parse_region(full_region); //New object, to avoid modifying the current region unintentionally.

  var new_div = document.createElement("div");
  
  new_div.addEventListener('dragstart',this._drag_start,false); 
  new_div.addEventListener('dragover',this._drag_over,false); 
  new_div.addEventListener('drop',this._drop,false); 
  new_div.style.width = this.opt.width;
  new_div.draggable = "true";
  new_div.style.position = "absolute";
  new_div.style.overflow = "scroll";
  new_div.style.height = this.opt.height;
  new_div.bam_container = this;
  this._render_div = new_div;    
  this._container.append(new_div);  
}

And then adding the calculations of the offsets from the beginning and the end of the drag

 
_drag_start: function (event) {
    offset_data = event.clientX + ',' +  event.clientY;
    _startX = event.clientX;
    _bam_offset_data= offset_data;
    _tmp_target = event.target;
},

_move_draged: function(render_div, offset, event){

  console.log("From " +  _startX + " to " + event.clientX);
  var diff_x =  event.clientX   - _startX  ;
  var als_rend = render_div.children;
  for(var i = 0; i < als_rend.length; i++) {
    var local_left = parseInt(als_rend[i].offsetLeft , 10);
    var new_pos = local_left + diff_x;
    als_rend[i].style.left = new_pos + "px";
  }
},

_drag_over: function (event) { 
    var offset;
    offset = _bam_offset_data;
    var dm = _tmp_target ;
    var render_div =  _tmp_target ;
    event.preventDefault(); 
    return false; 
} ,
_drop: function (event) { 
    var offset;
    offset = _bam_offset_data;  
    var render_div =  _tmp_target ;
    render_div.bam_container._move_draged(render_div, offset, event);
    render_div.bam_container._move_to_top();
    event.preventDefault();
    return false;
}

However, this approach ended up not looking natural, so I ended up using jQuery instead. Personaly, I don't like the abuse of the $ sign (bad memories with php and perl!). So, I added a call to jQuery to enrich the div. And now it is now smoother.

_select_chromosome: function(full_region){
  this._container.empty();
  this.alignments = {};
  this.full_region = this.parse_region(full_region); //New object, to avoid modifying the current region unintentionally.
 
  var outter_div = document.createElement("div");
  outter_div.style.width = this.opt.width;
  outter_div.style.position = "absolute";
  outter_div.style.overflow = "scroll";
  outter_div.style.height = this.opt.height;
  var new_div = document.createElement("div");
  new_div.classList.add("ui-widget-content");
  jQuery(new_div).draggable({ axis: "x" });
  new_div.bam_container = this;
  this._render_div = new_div;    
  outter_div.appendChild(new_div);
  this._container.append(outter_div);  
}

martes, 10 de junio de 2014

Week 4: Moving around the reads

Now that I have scaffold to start filling the components, I can focus on rendering the actual components. One of my main concerns I have is to keep the rendered stack as short as possible, while not having to calculate all the time on which row the component is going to be.
The test page I got is getting a list of contains and you can select them to send the region to the SAM Viewer. The SAMViewer then generates an appropriate ajax quering the region to be displayed and a two step process is triggered. The first step is to render the floating divs in the container. The divs are only located in the X axis initially.
 
    render_visible: function(){
            var region = this.current_region;
            var start = region.start - this.opt.flanking_cache;
            var canvas = this._render_div;
            if(start < 0){
              start = 1;
            }
            //alert(JSON.stringify(this.alignments));
            for(i = start; i < region.end + this.opt.flanking_cache ; i++){
              if("undefined" !== typeof this.alignments[i]){
                var current_alignments = this.alignments[i];
                for (var j in current_alignments) {
                  aln = current_alignments[j];  
                  if("undefined" === typeof aln.div){ //We don't render  it again if it already exists
                    var new_div = document.createElement("div");
                    new_div.style.width = this.opt.base_width * aln.len + "px"; 
                    new_div.style.height = this.opt.fontSize ;
                    new_div.style.position = "absolute";
                    new_div.style.left = ( i - 1) * this.opt.base_width + "px"; 
                    new_div.style.backgroundColor =   this.opt.default_read_background; 
                    aln.div = new_div;
                    canvas.appendChild(new_div);
                  }
                  
                }
              }
            }
          }
In the second stage, the divs are pushed up as high as possible, without overlapping to each other. Originally I was trying to use some sort of CSS magic, but the floats can´t be customised to this level.Then, I thought on having a matrix with the occupied coordinates and keep it in the back, to know which space is free. Also, I tried to make a massive div, and use the scrollbars from the browser to move around, and just load from the current coordinate. However, the browser does render everything, displayed or not. So, for long chromosomes the solution ended up freezing the web browser.
Luckily, someone else is working in a similar problem and already queried in stack overflow (http://stackoverflow.com/questions/20048688/how-to-float-top). The answer relies on getting the global position, so I had to do a bit more of research on how to get the rectangle of my current div. So, in my case, I calculate the local rectangle.
 
    _move_to_top: function  (){
        var top = 1; // top value for next row
        var margin = 1; // space between rows
        var rows = []; // list of rows
          for (var c = 0; c < this._render_div.children.length; c++) {
              var ok = false;
              var child = this._render_div.children[c];
              var cr = {};
              cr.top = child.offsetTop;
              cr.left = child.offsetLeft;
              cr.right = cr.left + parseInt(child.style.width, 10);
              cr.bottom = cr.top + parseInt(child.style.height, 10);
             
              for (var i = 0; i < rows.length; i++) {
                  if (cr.left > rows[i].right) {
                      rows[i].right = cr.right;
                      child.style.top = rows[i].top + "px";
                      ok = true;
                      break;
                  }
                  if (cr.right < rows[i].left) {
                      rows[i].left = cr.left;
                      child.style.top = rows[i].top + "px";
                      ok = true;
                      break;
                  }
              }
              if (!ok) {
                  // add new row
                  rows.push({
                      top: top,
                      right: cr.right,
                      left: cr.left
                  }); 
                  child.style.top = top + "px";
                  top =  child.offsetTop + parseInt(child.style.height, 10) + margin;
              }
          }
      }
My next step is to enable scrolling around and testing that I don’t leak memory when moving around.

Week 3: Initial layout

My original plan for this week was to extract the code from TGAC’s browser rendering BAM files. I had a look at the source code, but it relied heavily on getting the position of the elements pre-calculated in the web-server. So I decided to use it just as a reference and start my rendering from the scratch. In order to get a better feeling of a real use scenario of how the component will be used, and because I need to compare several BAM files in simultaneously, I decided to start the prototype of a Tablet/IGV-like SAM viewer. Also, I set myself the task of BioJS objects and start to use the communication between them with events. From my point of view, it is similar to the AWT model to handle events, but there is only one way to be sure you understand a concept: coding it.
The BAM viewers I know only display a single alignment file at the time. That is going to be my starting point, however, I’m interested on comparing several alignment files at the same time. So, I will need to be able to have multiple BAMViewer objects in screen at the same time, and synchronised. For that, I’ll need to have some sort of callback when I select from the list of chromosomes the chromosome to be displayed in the main window. With that intent, I “register” all the windows that will have a callback in the Biojs.BAMRegionList object.
    
   add_region_callback: function(component){
     this.callbacks.push(component)
    },

    _region_click: function(region){
     arrayLength= this.callbacks.length;
     for (var i = 0; i < arrayLength; i++) {
      this.callbacks[i].setRegion(region);
     }
    }
Finally, I set up the frames for the divs that I’m going to be testing:

lunes, 9 de junio de 2014

Week 2: SAMtools server



BAM are usually used to store short read alignments. The size of BAM files is usually in the range of the gigabytes. As such, a backend server is required for any practical applications. For the development of a realistic use case of the BioJS BAMViewer. My first open source software is a binder for samtools in ruby (https://github.com/helios/bioruby-samtools). As I know pretty well the library I thought I could use it for my backend. Originally, I intended to use Rails, but at the end I settled up to Sintra(http://www.sinatrarb.com) as the web server.


As I intend to use this service to make a BAM viewer, I implemented the following functions:, I query the list of references (so, you can list all the chromosomes without having them in a database),

  • alignment: Gets the alignments in a region
  • wig: Converts the BAM file to wig, I will use it to display the coverage across the region
  • reference: Returns the string of the region. This way, you don’t need to load a full chromosome.
  • list: gets all the reference chromosome/contigs.

Each service has a function to dispatch its content, but since creating the object containing the sam file is the same across them, I have a function to wrap this behaviour

  

def get_bam(bam,reference)
        return @bam_files[bam] if @bam_files[bam] 
        
        bam_path = "#{self.settings.folder.to_s}/#{bam}.bam"
        reference_path =  "#{self.settings.reference.to_s}/#{reference}"
        return nil unless File.file?(bam_path)
        @bam_files[bam] = Bio::DB::Sam.new( 
            :fasta =>  reference_path,
            :bam => bam_path
        )
        
        return @bam_files[bam] 
      end
Then, each query has a method to dispatch it, the alignment function is:
  
         get '/*/*/alignment' do |ref, bam|
            region = params[:region]
            reg = Bio::DB::Fasta::Region.parse_region(region)
            stream do |out|
              get_bam(bam, ref).fetch(reg.entry, reg.start, reg.end) do |sam|
                out << "#{sam.sam_string}\n"
              end
            end
          end
For the wig output, I didn’t have in the library a function to convert from the SAM file, so I monkey patched the Region class, and with that I can generate wig files with arbitrary bin sizes.
  
    class Bio::DB::Fasta::Region
       def to_wig(opts={})
        step_size = opts[:step_size] ? opts[:step_size] : 1
        out = StringIO.new
        out << "fixedStep chrom=#{self.entry} span=#{step_size}\n"
        return out.string if self.pileup.size == 0
        current_start = self.pileup[0].pos
        current_acc = 0.0
        current_end = current_start + step_size
        self.pileup.each do |pile|

         if pile.pos >= current_end
          out << current_start << " " << (current_acc/step_size).to_s << "\n"
          current_start = pile.pos
          current_acc = 0.0
          current_end = current_start + step_size
         end
         current_acc += pile.coverage
        end
        out.string
       end
      end

Week 1: BAM to JSON


About a year ago, I heard about the BioJS project, I was looking for a component to display multiple sequence alignments. Back then, I could´t find a plugin suitable for me, and I thought I would design and implement mine when the time to develop the web server of my project was in place. As at that point my software wash´t ready to production, I didm´t pursued it further. However, I did read the documentation of how to design the components, so I when applying to the GSoC 2014 I was completely aware of the technicalities. So I decided to start my project as a clone of the BioJS project, making sure it compiles (I´m currently not compiling it for each test, but I test it now and then). The current branch for this protect is located in:  https://github.com/homonecloco/biojs/tree/BAMViewer



I haven´t written javascript since the time when the ajax catchy word and CSS started to change the design. I come from the age of the TD and TR. As such, I decided to start my return to .js with the basics. I´m starting by writing a parser for the SAM format, according to the specification 1.4 ( http://samtools.sourceforge.net). The current parser looks like:

       parse_sam_line: function(sam_line){
          var currentline = sam_line.split("\t");

          var cigar = currentline[5] 

          var obj = {
            qname : currentline[0] ,
            flags : parseInt(currentline[1],10),
            rname : currentline[2] ,
            pos   : parseInt(currentline[3],10) ,
            mapq  : parseInt(currentline[4],10) ,
            cigar : currentline[5] ,
            rnext : currentline[6] ,
            pnext : parseInt(currentline[7],10),
            tlen  : parseInt(currentline[8],10) ,
            seq   : currentline[9] ,
            qual  : currentline[10] ,
            len   : 100, //TODO: change this to use the cigar.  
            has_flag : function (f){return flags & f > 0 }
          };


          /* @is_paired  = (@flag & 0x0001) > 0
              @is_mapped             = @flag & 0x0002 > 0
              @query_unmapped        = @flag & 0x0004 > 0
              @mate_unmapped         = @flag & 0x0008 > 0
              @query_strand          = !(@flag & 0x0010 > 0)
              @mate_strand           = !(@flag & 0x0020 > 0)
              @first_in_pair         = @flag & 0x0040 > 0
              @second_in_pair        = @flag & 0x0080 > 0
              @primary               = !(@flag & 0x0100 > 0)
              @failed_quality        = @flag & 0x0200 > 0
              @is_duplicate          = @flag & 0x0400 > 0*/

              for(var j=12;j < currentline.length;j++){
                var tag = sam_line[j].split(":")

                if (tag[1] == "i"){
                 obj[tag[0]] = parseInt(tag[2]);
               }else if (tag[1 == "f"]){
                obj[tag[0]] = parseFloat(tag[2]);
              }
              else{ 
                obj[tag[0]] = tag[2];
              }
            }
            return obj;
          }

The parser creates a new object, by splitting each line of the SAM by it´s tabs. The numeric values are parsed accordingly and the extra fields are stored in a dictionary at the end. The length of the aligned bases is not in the format, so in the following weeks I´ll return to parse the CIGAR line, which is the field that contains the details of the alignments. Also, the has_flag function is used to validate the binary flags as stated in the documentation.

In order to avoid loading all the SAM file (usually multi-gigabyte files), I’m using ajax to load particular regions. At this point, I’m assuming I have a server which can query in the standard region used by samtools (chromosome:start-end). For testing purposes, it is possible to point to a local sam file. The loader looks like:
     
load_region: function(region){
        reference = this.reference;  
        reg = region.toString; //Format: chromsome:start-end
        jQuery.ajax({
          type: "GET",
          url: this.dataSet,
          data: { region: reg, ref: this.reference } ,
          dataType: "text",
          container: this,
          success: function (data) {
            correct = true
            reads = this.container.parse_sam(data);
            if(reads.length > 0){
              this.container.add_alignments(reads);
              this.container.render_visible();
              this.container._move_to_top();
            } else {
              alert("Unknown format detected");
            }

          },
          error: function (qXHR, textStatus, errorThrown) {
            alert(" Error loading the  SAM File! \n" + textStatus +
      "\n" + errorThrown + "\n" + qXHR );
          }
        });
}

Also, to avoid loading severalties the same coordinate, and enable asynchronous loading of subregions, a caching mechanism keeps a dictionary of positions with all the reads overlapping, and if for some region a read is loaded twice, it only keeps the first read. This is needed, because when you query a region, you get reads in the flanks that overlap to the next region.


 The current code adding to the cache is:

     add_alignments: function(alignments){
            var als = this.alignments;
            var added = 0;
            for(var i = 0; i < alignments.length; i++){
              var aln = alignments[i];
              if("undefined" === typeof als[aln.pos]){
                als[aln.pos] = {}; 
              }
              var current_alignments = als[aln.pos];
              if("undefined" ===  typeof als[aln.pos][aln.qname]){
                added ++;
                als[aln.pos][aln.qname] = aln;
              }
            }
      }

Week 0: Planning



The main goal of the BAM File Viewer for BioJS is to display the NGS alignments of regions in a reference. To be effective, it will be integrated with the BioJS FeatureViewer. The base implementation will be based on the BAM viewer widget in the TGAC Browser, but it is going to be extended to support the missing BAM flags in the current implementation. BAM files are too big to be kept in memory at once. Traditional visualisation programs only keep in memory the region to be displayed. To mimic this behavior, a reference implementation of a web server that extract manageable regions will be implemented in Java on the top of the Java-Genomics-IO API. However, the SAM parsing will be implemented in the client, enabling other server technologies to be used to query the regions in the BAM File.

The tasks to implement are:
  • SAM Parser. From the TAB file to JSON 
  • A minimal web server which queries regions, using the conventions of samtools 
  • Extract the TGAC browser BAM widget from the browser 
  • Design and implement the missing SAM flags 
  • Ιmplement BioJS events API 
  • Integrate with FeatureViewer/BigWig 

So, the overall design of the component is better understood in the following diagram. 

During the GSoC 2014 I will focus on implementing a single reference server in the initial stage.