[prev in list] [next in list] [prev in thread] [next in thread]
List: bioc-devel
Subject: Re: [Bioc-devel] findOverlaps and mclapply
From: Vincent Carey <stvjc () channing ! harvard ! edu>
Date: 2012-07-07 7:21:06
Message-ID: CAANT7+UvfqvCjZZZVuptDJ2dS6mbVqEw04hPfB-FnPRNfQ17xw () mail ! gmail ! com
[Download RAW message or body]
Done. See http://wiki.fhcrc.org/bioc/Seattle_Dev_Meeting_2012
On Sat, Jul 7, 2012 at 9:11 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote:
> Let's try to get an overview of these issues together for discussion at
> developer day.
> I will start a wiki node for developer day and drop this exchange within.
>
>
> On Fri, Jul 6, 2012 at 10:33 PM, Michael Lawrence <
> lawrence.michael@gene.com> wrote:
>
>> Hi Kasper,
>>
>> I think Herve was going to look into some optimizations to the lapply
>> behavior over the next release cycle. But I like the idea of leveraging
>> parallelism better. The user would specify options(mc.cores=), so that we
>> could avoid having to pass down more arguments.
>>
>> In the bigger picture, it would be nice if there were some general way for
>> the user to request parallelism. Setting options(mc.cores=) is a nice
>> approach, but it is specific to mclapply. I'm aware that mclapply has the
>> advantage of the slaves inheriting the session from the master, but often
>> more than one computer is available for a task. I admit that in this case
>> of overlap computation, the network overhead may not be worth it. But
>> methods that e.g. operate over a BamFileList might benefit a lot.
>>
>> I wonder if the general clusterApply mechanism could be leveraged. The
>> user
>> would specify a global default cluster using setDefaultCluster().
>> Currently, the cluster implementations beyond the mclapply forking is
>> pretty limited. One possible direction would be to implement a parallel
>> package backend using the new CRAN BatchJobs package, which supports
>> submitting jobs to LSF, SSH, etc, clusters. That API is currently
>> asynchronous, so some work would need to be done to make it behave
>> synchronously.
>>
>> Michael
>>
>> On Fri, Jul 6, 2012 at 8:51 AM, Kasper Daniel Hansen <
>> kasperdanielhansen@gmail.com> wrote:
>>
>> > This is about the findOverlaps method for (query = "GenomicRanges",
>> > subject = "GenomicRanges") in GenomicRanges.
>> >
>> > This function is really slow when the set of distinct seqnames is
>> > really big. Example
>> >
>> > library(BSgenome.Amellifera.BeeBase.assembly4)
>> > Un <- Amellifera$GroupUn
>> > gr <- GRanges(seqnames = names(Un),
>> > ranges= IRanges(start = 1 , width = width(Un)))
>> > length(gr)
>> >
>> > ## Only 9244 in length
>> >
>> > system.time(findOverlaps(gr[1], gr))
>> > user system elapsed
>> > 297.202 0.021 297.279
>> >
>> > Pretty slow for finding overlaps between a Granges with length 1 and a
>> > Granges with length roughly 10000.
>> >
>> > This is because the function essentially does an lapply over distinct
>> > seqnames. I raised this issue a while ago, and Michael said that he
>> > might consider building an IntervalTree over both seqnames and ranges.
>> > So this is really only an issue in organisms with many small contigs.
>> > However, in the mean time, I would appreciate making the function
>> > mclapply-aware, which is pretty simple and at least makes the function
>> > roughly #cores faster.
>> >
>> > My fix (which I have been using for a while) adds
>> > mc.cores = 1, mc.preschedule = TRUE
>> > to the specific method as well as
>> >
>> > matchMatrix <- do.call(rbind, mclapply(commonSeqnames,
>> > function(seqnm) {
>> > <SNIP>
>> > }, mc.cores = mc.cores,
>> > mc.preschedule = mc.preschedule))
>> >
>> > inside the methods definition. Also needed is making the package
>> > depend on parallel.
>> >
>> > I am raising this issue because I am submitting a package containing
>> > this fix, and I felt it might be good to propagate. I can commit it
>> > to subversion as well, if there is any interest.
>> >
>> > Best,
>> > Kasper
>> >
>> > _______________________________________________
>> > Bioc-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic