[prev in list] [next in list] [prev in thread] [next in thread]
List: r-help
Subject: Re: [R] Split
From: Val <valkremk () gmail ! com>
Date: 2020-09-24 1:20:29
Message-ID: CAJOiR6b-iQfzhvBGzOGAY=ZbdArnwN=kft-JUdpFXPXQ-kFZxg () mail ! gmail ! com
[Download RAW message or body]
Thank you again for your help and giving me the opportunity to choose
the efficient method. For a small data set there is no discernable
difference between the different approaches. I will carry out a
comparison using the large data set.
On Wed, Sep 23, 2020 at 11:52 AM LMH <lmh_users-groups@molconn.com> wrote:
>
> Below is a script in bash the uses the awk tokenizer to do the work.
>
> This assumes that your input and output delimiter is space. The number of \
> consecutive delimiters in the input is not important. This also assumes that the \
> input file does not have a header row. That is easy to modify if you want. I always \
> keep header rows in my data files as I think that removing them is asking for \
> trouble down the road.
> I added a NULL for cases where there is no value for the last field. You could use \
> "." if you want.
> You should be able to find how to run this from inside R if you want. You will, of \
> course, need a bash environment to run this, so if you are not in linux you will \
> need cygwin or something similar.
> This should be very fast, but let me know if needs to be faster. If the X1_X2 \
> variant occurs less frequently than not then we should switch the order in which \
> the logic evaluates the options.
> LMH
>
>
> #! /bin/bash
>
> # input filename
> input_file=$1
>
> # output filename
> output_file=$2
>
> # make sure the input file exists
> if [ ! -f $input_file ]; then
> echo $input_file " cannot be found"
> exit 0
> fi
>
> # create the output file
> touch $output_file
>
> # make sure the output was created
> if [ ! -f $output_file ]; then
> echo $output_file " was not created"
> exit 0
> fi
>
> # write the header row
> echo "ID1 ID2 Y1 X1 X2" >> $output_file
>
> # character to find in the third token
> look_for='_'
>
> # process with awk
> # if the 3rd token contains '_'
> # split the third token on '_' into F[1] and F[2]
> # print the first two tokens, the indicator value of 1, and the split fields F[1] \
> and F[2] # otherwise,
> # print the first two tokens, the indicator value of 0, the 3rd token, and NULL
>
> cat $input_file | \
> awk -v find_char=$look_for '{ if($3 ~ find_char) { { split ($3, F, "_") }
> { print $1, $2, "1", F[1], F[2] }
> }
> else { print $1, $2, "0", $3, "NULL" }
> }' >> $output_file
>
>
>
>
>
>
>
> Val wrote:
> > Thank you all for the help!
> >
> > LMH, Yes I would like to see the alternative. I am using this for a
> > large data set and if the alternative is more efficient than this
> > then I would be happy.
> >
> > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567@gmail.com> wrote:
> > >
> > > To be clear, I think Rui's solution is perfectly fine and probably better than \
> > > what I offer below. But just for fun, I wanted to do it without the lapply(). \
> > > Here is one way. I think my comments suffice to explain.
> > > > ## which are the non "_" indices?
> > > > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> > > > ## paste "_." to these
> > > > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> > > > ## Now strsplit() and unlist() them to get a vector
> > > > z <- unlist(strsplit(F1$text, "_"))
> > > > ## now cbind() to the data frame
> > > > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> > > > F1
> > > ID1 ID2 text 1 2
> > > 1 A1 B1 NONE_. NONE .
> > > 2 A1 B1 cf_12 cf 12
> > > 3 A1 B1 NONE_. NONE .
> > > 4 A2 B2 X2_25 X2 25
> > > 5 A2 B3 fd_15 fd 15
> > > > ## You can change the names of the 2 columns yourself
> > >
> > > Cheers,
> > > Bert
> > >
> > > Bert Gunter
> > >
> > > "The trouble with having an open mind is that people keep coming along and \
> > > sticking things into it."
> > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >
> > >
> > > On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas@sapo.pt> wrote:
> > > >
> > > > Hello,
> > > >
> > > > A base R solution with strsplit, like in your code.
> > > >
> > > > F1$Y1 <- +grepl("_", F1$text)
> > > >
> > > > tmp <- strsplit(as.character(F1$text), "_")
> > > > tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
> > > > tmp <- do.call(rbind, tmp)
> > > > colnames(tmp) <- c("X1", "X2")
> > > > F1 <- cbind(F1[-3], tmp) # remove the original column
> > > > rm(tmp)
> > > >
> > > > F1
> > > > # ID1 ID2 Y1 X1 X2
> > > > #1 A1 B1 0 NONE .
> > > > #2 A1 B1 1 cf 12
> > > > #3 A1 B1 0 NONE .
> > > > #4 A2 B2 1 X2 25
> > > > #5 A2 B3 1 fd 15
> > > >
> > > >
> > > > Note that cbind dispatches on F1, an object of class "data.frame".
> > > > Therefore it's the method cbind.data.frame that is called and the result
> > > > is also a df, though tmp is a "matrix".
> > > >
> > > >
> > > > Hope this helps,
> > > >
> > > > Rui Barradas
> > > >
> > > >
> > > > Às 20:07 de 22/09/20, Rui Barradas escreveu:
> > > > > Hello,
> > > > >
> > > > > Something like this?
> > > > >
> > > > >
> > > > > F1$Y1 <- +grepl("_", F1$text)
> > > > > F1 <- F1[c(1, 2, 4, 3)]
> > > > > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
> > > > > "right")
> > > > > F1
> > > > >
> > > > >
> > > > > Hope this helps,
> > > > >
> > > > > Rui Barradas
> > > > >
> > > > > Às 19:55 de 22/09/20, Val escreveu:
> > > > > > HI All,
> > > > > >
> > > > > > I am trying to create new columns based on another column string
> > > > > > content. First I want to identify rows that contain a particular
> > > > > > string. If it contains, I want to split the string and create two
> > > > > > variables.
> > > > > >
> > > > > > Here is my sample of data.
> > > > > > F1<-read.table(text="ID1 ID2 text
> > > > > > A1 B1 NONE
> > > > > > A1 B1 cf_12
> > > > > > A1 B1 NONE
> > > > > > A2 B2 X2_25
> > > > > > A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F)
> > > > > > If the variable "text" contains this "_" I want to create an indicator
> > > > > > variable as shown below
> > > > > >
> > > > > > F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> > > > > >
> > > > > >
> > > > > > Then I want to split that string in to two, before "_" and after "_"
> > > > > > and create two variables as shown below
> > > > > > x1= strsplit(as.character(F1$text),'_',2)
> > > > > >
> > > > > > My problem is how to combine this with the original data frame. The
> > > > > > desired output is shown below,
> > > > > >
> > > > > >
> > > > > > ID1 ID2 Y1 X1 X2
> > > > > > A1 B1 0 NONE .
> > > > > > A1 B1 1 cf 12
> > > > > > A1 B1 0 NONE .
> > > > > > A2 B2 1 X2 25
> > > > > > A2 B3 1 fd 15
> > > > > >
> > > > > > Any help?
> > > > > > Thank you.
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide
> > > > > > http://www.R-project.org/posting-guide.html
> > > > > > and provide commented, minimal, self-contained, reproducible code.
> > > > > >
> > > > >
> > > > > ______________________________________________
> > > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide
> > > > > http://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal, self-contained, reproducible code.
> > > >
> > > > ______________________________________________
> > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic