[prev in list] [next in list] [prev in thread] [next in thread] 

List:       bioc-devel
Subject:    Re: [Bioc-devel] Random access to sequences in fasta files
From:       Thomas Dybdal Pedersen <thomasp85 () gmail ! com>
Date:       2015-01-29 16:15:07
Message-ID: 5603EC1D-91BD-4D55-BB01-594CCB828F57 () gmail ! com
[Download RAW message or body]

Thanks Martin

This was thought as a feauture request/discussion of biostrings, which is why I \
posted it here. Thought biostrings io capabilities was behind most other fasts \
readers on bioconductor...

/Thomas


> Den 29/01/2015 kl. 15.45 skrev Martin Morgan <mtmorgan@fredhutch.org>:
> 
> > On 01/29/2015 06:41 AM, Thomas Lin Pedersen wrote:
> > Hi
> > 
> > I'm querying on whether there are any plans on supporting random access reading \
> > of fasta files in the sense that it is possible to upfront specify the indexes of \
> > sequences that should be read in. 
> > I'm working on a package for comparative microbial genomics and it would be a \
> > huge speed improvement if it was possible to quickly read in 1000's of sequences \
> > distributed on as many files. Currently the proper, vectorised approach requires \
> > all files to be read in at once and then subsetted, but this can result in \
> > XStringSet's in the Gb range, just to access some sequences. The slow, un-R way \
> > would be to loop through each file (or each sequence using skip and nrec to only \
> > read in relevant sequences). I'm preferentially looking for an interface like: 
> > readXStringSet(files, rec)
> > 
> > Where rec is either a vector that would index into the XStringSet as if \
> > everything from files had been read in, or a list with the same length as files, \
> > containing the indexes of interest for each file.
> 
> Hi Thomas -- this should really be posted to support.bioconductor.org, but see \
> Rsamtools::FaFile and rtracklayer::TwoBitFile access through getSeq. 
> Martin
> 
> > with best wishes
> > 
> > Thomas
> > _______________________________________________
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic