'Re: Grammars and biological data formats'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl6-language
Subject:    Re: Grammars and biological data formats
From:       "Fields, Christopher J" <cjfields () illinois ! edu>
Date:       2014-08-16 18:45:24
Message-ID: 6CE887B7-5368-499F-B9BA-0E604B451735 () illinois ! edu
[Download RAW message or body]

Yes, that looks like an even better option.  I see that this is implemented in p5 as \
File::Map, which is a nice portable option.

Chris

> On Aug 16, 2014, at 7:51 AM, "Martin D Kealey" <martin@kurahaupo.gen.nz> wrote:
> 
> 
> Hmmm, what about just implementing mmap-as-string?
> 
> Then, assuming the parsing process is somewhat stream-like, the OS will take
> care of swapping in chunks as you need them. You don't even need anything
> special to support backtracking -- it's just a memory address, after all.
> 
> -Martin
> 
> > On Thu, 14 Aug 2014, Fields, Christopher J wrote:
> > Yeah, I'm thinking of a Cat-like class that would chunkify the data and check for \
> > matches. 
> > The main reason I would like to stick with a consistent grammar-based approach is \
> > I have seen many instances in BioPerl where a parser is essentially rewritten \
> > based on its purpose (full parsing, lazy parsing, indexing of flat files, adding \
> > to a persistent data store, etc).  Having a way to both parse a full grammar but \
> > also subparse for a specific token/rule is very handy, and when Cat comes around \
> > even more so. 
> > Chris
> > 
> > Sent from my iPad
> > 
> > > On Aug 14, 2014, at 6:40 AM, "Carl Mäsak" <cmasak@gmail.com> wrote:
> > > 
> > > I was going to pipe in and say that I wouldn't wait around for Cat,
> > > I'd write something that reads chunks and then parses that. It'll be a
> > > bit more code, but it'll work today. But I see you reached that
> > > conclusion already. :)
> > > 
> > > Lately I've found myself writing more and more grammars that parse
> > > just one line of some input. Provided that the same action object gets
> > > attached to the parse each time, that's an excellent place to store
> > > information that you want to persist between lines. Actually, action
> > > objects started to make a whole lot more sense to me after I found
> > > that use case, because it takes on the role of a session/lifetime
> > > object for the parse process itself.
> > > 
> > > // Carl
> > > 
> > > On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J
> > > <cjfields@illinois.edu> wrote:
> > > > On Aug 13, 2014, at 8:11 AM, Christopher Fields <cjfields@illinois.edu> \
> > > > wrote: 
> > > > > > On Aug 13, 2014, at 4:50 AM, Solomon Foster <colomon@gmail.com> wrote:
> > > > > > 
> > > > > > On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
> > > > > > <cjfields@illinois.edu> wrote:
> > > > > > > I have a fairly simple question regarding the feasibility of using \
> > > > > > > grammars with commonly used biological data formats. 
> > > > > > > My main question: if I wanted to parse() or subparse() vary large files \
> > > > > > > (not unheard of to have FASTA/FASTQ or other similar data files exceed \
> > > > > > > 100’s of GB) would a grammar be the best solution?  For instance, based \
> > > > > > > on what I am reading the semantics appear to be greedy; for instance: 
> > > > > > > Grammar.parsefile($file)
> > > > > > > 
> > > > > > > appears to be a convenient shorthand for:
> > > > > > > 
> > > > > > > Grammar.parse($file.slurp)
> > > > > > > 
> > > > > > > since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I \
> > > > > > > misunderstanding how this could be accomplished?
> > > > > > 
> > > > > > My understanding is it is intended that parsing can work on Cats
> > > > > > (hypothetical lazy strings) but this hasn't been implemented yet
> > > > > > anywhere.
> > > > > > 
> > > > > > --
> > > > > > Solomon Foster: colomon@gmail.com
> > > > > > HarmonyWare, Inc: http://www.harmonyware.com
> > > > > 
> > > > > Yeah, that’s what I recall as well.  I see very little in the specs re: Cat \
> > > > > unfortunately. 
> > > > > chris
> > > > 
> > > > Ah, nevermind.  I did a search of the IRC channel and found it’s considered \
> > > > to be a ‘6.1’ feature: 
> > > > http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974
> > > > 
> > > > It is mentioned a few times in the specs, I’m guessing based on where it’s \
> > > > thought to fit in best.  For the moment the proposal is to run grammar \
> > > > parsing on sized chunks of the input data, which might be how Cat would be \
> > > > implemented anyway. 
> > > > chris
> > > > 
> > 


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic