[prev in list] [next in list] [prev in thread] [next in thread] 

List:       bioc-devel
Subject:    Re: [Bioc-devel] findOverlaps and mclapply
From:       Vincent Carey <stvjc () channing ! harvard ! edu>
Date:       2012-07-07 7:21:06
Message-ID: CAANT7+UvfqvCjZZZVuptDJ2dS6mbVqEw04hPfB-FnPRNfQ17xw () mail ! gmail ! com
[Download RAW message or body]

Done.  See http://wiki.fhcrc.org/bioc/Seattle_Dev_Meeting_2012

On Sat, Jul 7, 2012 at 9:11 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote:

> Let's try to get an overview of these issues together for discussion at
> developer day.
> I will start a wiki node for developer day and drop this exchange within.
>
>
> On Fri, Jul 6, 2012 at 10:33 PM, Michael Lawrence <
> lawrence.michael@gene.com> wrote:
>
>> Hi Kasper,
>>
>> I think Herve was going to look into some optimizations to the lapply
>> behavior over the next release cycle. But I like the idea of leveraging
>> parallelism better. The user would specify options(mc.cores=), so that we
>> could avoid having to pass down more arguments.
>>
>> In the bigger picture, it would be nice if there were some general way for
>> the user to request parallelism. Setting options(mc.cores=) is a nice
>> approach, but it is specific to mclapply. I'm aware that mclapply has the
>> advantage of the slaves inheriting the session from the master, but often
>> more than one computer is available for a task. I admit that in this case
>> of overlap computation, the network overhead may not be worth it.  But
>> methods that e.g. operate over a BamFileList might benefit a lot.
>>
>> I wonder if the general clusterApply mechanism could be leveraged. The
>> user
>> would specify a global default cluster using setDefaultCluster().
>> Currently, the cluster implementations beyond the mclapply forking is
>> pretty limited. One possible direction would be to implement a parallel
>> package backend using the new CRAN BatchJobs package, which supports
>> submitting jobs to LSF, SSH, etc, clusters. That API is currently
>> asynchronous, so some work would need to be done to make it behave
>> synchronously.
>>
>> Michael
>>
>> On Fri, Jul 6, 2012 at 8:51 AM, Kasper Daniel Hansen <
>> kasperdanielhansen@gmail.com> wrote:
>>
>> > This is about the findOverlaps method for (query = "GenomicRanges",
>> > subject = "GenomicRanges") in GenomicRanges.
>> >
>> > This function is really slow when the set of distinct seqnames is
>> > really big.  Example
>> >
>> > library(BSgenome.Amellifera.BeeBase.assembly4)
>> > Un <- Amellifera$GroupUn
>> > gr <- GRanges(seqnames = names(Un),
>> >               ranges= IRanges(start = 1 , width = width(Un)))
>> > length(gr)
>> >
>> > ## Only 9244 in length
>> >
>> > system.time(findOverlaps(gr[1], gr))
>> >   user  system elapsed
>> >  297.202   0.021 297.279
>> >
>> > Pretty slow for finding overlaps between a Granges with length 1 and a
>> > Granges with length roughly 10000.
>> >
>> > This is because the function essentially does an lapply over distinct
>> > seqnames.  I raised this issue a while ago, and Michael said that he
>> > might consider building an IntervalTree over both seqnames and ranges.
>> >  So this is really only an issue in organisms with many small contigs.
>> >  However, in the mean time, I would appreciate making the function
>> > mclapply-aware, which is pretty simple and at least makes the function
>> > roughly #cores faster.
>> >
>> > My fix (which I have been using for a while) adds
>> >   mc.cores = 1, mc.preschedule = TRUE
>> > to the specific method as well as
>> >
>> >         matchMatrix <- do.call(rbind, mclapply(commonSeqnames,
>> >                                                function(seqnm) {
>> > <SNIP>
>> >                                                }, mc.cores = mc.cores,
>> > mc.preschedule = mc.preschedule))
>> >
>> > inside the methods definition.  Also needed is making the package
>> > depend on parallel.
>> >
>> > I am raising this issue because I am submitting a package containing
>> > this fix, and I felt it might be good to propagate.  I can commit it
>> > to subversion as well, if there is any interest.
>> >
>> > Best,
>> > Kasper
>> >
>> > _______________________________________________
>> > Bioc-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic