[prev in list] [next in list] [prev in thread] [next in thread] 

List:       bioc-devel
Subject:    Re: [Bioc-devel] merging DFrames
From:       Laurent Gatto <laurent.gatto () uclouvain ! be>
Date:       2020-10-21 18:22:24
Message-ID: DBBPR03MB5272C506873330A59B6567FA911C0 () DBBPR03MB5272 ! eurprd03 ! prod ! outlook ! com
[Download RAW message or body]

Thank you both - issue has just been opened.

Merci Hervé for pointing out the direct use of the `List()` constructor.

Laurent

________________________________________
From: Michael Lawrence <lawrence.michael@gene.com>
Sent: 21 October 2020 19:13
To: Pages, Herve
Cc: Laurent Gatto; bioc-devel@r-project.org
Subject: Re: [Bioc-devel] merging DFrames

Laurent,

Thanks for bringing this up and offering to help. Yes, please raise an issue. There's \
an opportunity to implement faster matching than base::merge(), using stuff like \
matchIntegerQuads(), findMatches(), and grouping().

grouping() can be really fast for character vectors, since it takes advantage of \
string internalization. For example, let's say you're merging on three character \
vector keys. Concatenate the keys of 'y' onto they keys of 'x'. Then call \
grouping(k1, k2, k3) and you effectively have a matching. Should be way faster than \
the paste() approach used by base::merge(). Would be interesting to see.

Michael

On Wed, Oct 21, 2020 at 9:37 AM Pages, Herve \
<hpages@fredhutch.org<mailto:hpages@fredhutch.org>> wrote: Hi Laurent,

I think the current implementation was just an expedient to have
something that works (in most cases). I don't know if a proper
implementation that doesn't go thru data.frame is on the TODO list. Michael?

I suggest you open an issue on GitHub under S4Vectors.

Cheers,
H.

PS: Note that you can pass the list elements directly to the List()
constructor, no need to construct an ordinary list first:

   List(1, 1:2, 1:3)  # same as List(list(1, 1:2, 1:3)))


On 10/21/20 08:35, Laurent Gatto wrote:
> When merging DFrame instances, the *List types are lost:
> 
> The following two instances have NumericList columns (y and z)
> d1 <- DataFrame(x = letters[1:3], y = List(list(1, 1:2, 1:3)))
> d2 <- DataFrame(x = letters[1:3], z = List(list(1:3, 1:2, 1)))
> 
> d1
> ## DataFrame with 3 rows and 2 columns
> ##             x             y
> ##   <character> <NumericList>
> ## 1           a             1
> ## 2           b           1,2
> ## 3           c         1,2,3
> 
> That are however converted to list when merged
> 
> merge(d1, d2, by = "x")
> ## DataFrame with 3 rows and 3 columns
> ##             x      y      z
> ##   <character> <list> <list>
> ## 1           a      1  1,2,3
> ## 2           b    1,2    1,2
> ## 3           c  1,2,3      1
> 
> Looking at merge,DataTable,DataTable (form with merge,DFrame,DFrame inherits), this \
> makes sense given that they are converted to data.frames, merged with \
> merge,data.frame,data.frame and the results is coerced back to DFrame: 
> > getMethod("merge", c("DataTable", "DataTable"))
> Method Definition:
> 
> function (x, y, ...)
> {
> .local <- function (x, y, by, ...)
> {
> if (is(by, "Hits")) {
> return(.mergeByHits(x, y, by, ...))
> }
> as(merge(as(x, "data.frame"), as(y, "data.frame"), by,
> ...), class(x))
> }
> .local(x, y, ...)
> }
> <bytecode: 0x556dd0032ca8>
> <environment: namespace:S4Vectors>
> 
> Signatures:
> x           y
> target  "DataTable" "DataTable"
> defined "DataTable" "DataTable"
> 
> I would like not to loose the *List classes in the individual DFrames.
> 
> Am I missing something? Is this something that is on the todo list, or that I could \
> help with? 
> Best wishes,
> 
> Laurent
> 
> 
> _______________________________________________
> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_b \
> ioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJK \
> aaPhzWA&m=TUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg&s=uqmel2bDfLejAXpRYsi-PFcGqjn8 \
> b6W-JmfpZDhOF7U&e=<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2 \
> Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_ \
> bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYb \
> W0WYiZvSXAJJKaaPhzWA%26m%3DTUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg%26s%3Duqmel2b \
> DfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U%26e%3D&data=04%7C01%7Claurent.gatto%40uclouvain \
> .be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C \
> 637388972091221595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBT \
> iI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NH8unxkgycej2AJIyCJxrE6J8OJVFKrciV48ra3vxJs%3D&reserved=0>
>  

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages@fredhutch.org<mailto:hpages@fredhutch.org>
Phone:  (206) 667-5791
Fax:    (206) 667-1319
_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.ou \
tlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=04%7C \
01%7Claurent.gatto%40uclouvain.be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ec \
fbc7c4127b4d582ec%7C0%7C0%7C637388972091231547%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA \
wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K5CFh04oSsBCszhNqzazM76%2BU1We8HtvlXjIftHT41g%3D&reserved=0>



--
Michael Lawrence
Senior Scientist, Data Science and Statistical Computing
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla@gene.com<mailto:michafla@gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic