[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sas-l
Subject:    Re: Reading Web logs with SAS
From:       Savian <savian.net () GMAIL ! COM>
Date:       2009-03-31 23:18:38
Message-ID: ef73f7a8-cf3c-439d-b184-171114f27216 () v23g2000pro ! googlegroups ! com
[Download RAW message or body]

On Mar 31, 4:40 pm, joewhitehu...@GMAIL.COM (Joe Whitehurst) wrote:
> Alan,
>
> Refusal to use cookies is simply treated as measurement error which can be
> calculated because we can count the number of cookieless records.  I forgot
> to mention that this SAS Component Language program can also be used to do
> Web Analytics in real time.  Just pipe the web logs to the program as they
> are generated.  Since everything is done without sorting, web analytics can
> be done "on the fly".
> Joe.
>
>
>
>
>
> On Tue, Mar 31, 2009 at 6:24 PM, Savian <savian....@gmail.com> wrote:
> > On Mar 31, 3:45 pm, joewhitehu...@GMAIL.COM (Joe Whitehurst) wrote:
> > > Alan,
> > > You must have forgotten the challenge I issued to all the MMMMs out the=
> re
> > in
> > > SAS-L land a few years ago.  I have a SAS Component Language program th=
> at
> > > parses TBs of web logs from multiple servers all intermixed and
> > determines
> > > individual session information in one pass without sorting the data.  I
> > > challenged the MMMs to come up with a SAS macro language program that
> > could
> > > do the same thing.  Of course all the efforts by the MMMMs failed.  You
> > can
> > > refresh your memory by rereading:
>
> > > Macro Mavens and Innocent Bystanders,
>
> > > I fear the Macro Mavens' challenge might have gotten lost in the clutte=
> r
> > of
> > > a thread grown too long, so I will start a new thread and summarize the
> > > challenge in one place.
>
> > > THE CHALLENGE
>
> > > Macro Mavens, listen up.
>
> > > You have an input source containing an unknown number of responses from
> > an
> > > unknown number of individuals (each with a unique identifier known as a
> > > "cookie") arriving in an unknown order.  There could be millions of
> > > individuals who could make hundreds of responses.  Your first task is t=
> o
> > > determine the time of the first and last response for each individual a=
> nd
> > to
> > > group each individual's responses in temporal order. Your next task is =
> to
> > > calculate the amount of time between the first response and the last
> > > response for each individual (session duration) and to reconstruct the
> > > entire path (the essence of web log analytics) the individual took
> > through
> > > the web site and whether he made any "business" related responses.
> > > If no response has been received from an individual for exactly 30
> > minutes,
> > > that individual's session is considered closed.  That individual could,
> > > however, return many times and start new sessions, so we want to count
> > the
> > > number of sessions each person has during a 24 hour period.
> > > What makes this a difficult problem is the requirement to accomplish al=
> l
> > the
> > > above tasks with one pass of the data.  If you allow multiple passes of
> > the
> > > data then there is not a problem worth mentioning.
>
> > > That's it in a nutshell.  Now come on you macro maven manure movers
> > (MMMM),
> > > put down your pitch forks and put your thinking caps on.
>
> > >   .
>
> > > On Tue, Mar 31, 2009 at 4:39 PM, Savian <savian....@gmail.com> wrote:
> > > > On Mar 31, 12:03 pm, ohri2...@GMAIL.COM (Ajay ohri) wrote:
> > > > > it depends on the volume of web logs to be parsed. if volume is hig=
> h,
> > > > > yes SAS can be tweaked to read web logs in a certain way. especiall=
> y
> > > > > since many of them use similar formats ( wordpress , blogger,type
> > pad)
> > > > > are three.
>
> > > > > those formats can be checked using the css,.php  and theme editor o=
> f
> > > > > the html. tweaking your code for parising is most painful as you ne=
> ed
> > > > > to tweak the point at which post begins or ends .
>
> > > > > while perl is considered stadard, try using a browser macro using a
> > > > > language developed atwww.iopus.comitrecordsmacro while browsing
> > > > > same as excel macro records vba.
>
> > > > > once you have compiled your main browser file in the .iim format yo=
> u
> > > > > can use SAS (or a normal excel VBA macro)  to open Imacro applicati=
> on
> > > > > , run the .iim file ,download in the standard location ,
>
> > > > > and you can use google desktop for searching the huge volume of tex=
> t
> > > > > files downloaded. uses the same algorithm of google ;)
>
> > > > >www.decisionstats.com
>
> > > > > Rodney Dangerfield  - "I haven't spoken to my wife in years. I didn=
> 't
> > > > > want to interrupt her."
>
> > > > > On Tue, Mar 31, 2009 at 10:07 AM, Savian <savian....@gmail.com>
> > wrote:
> > > > > > On Mar 30, 4:35 pm, yamira...@YAHOO.COM (Richard Whitehead) wrote=
> :
> > > > > >> someone on another forum posted that web logs can be read direct=
> ly
> > > > with
> > > > > >> proc import.  is this true?  btw, i am in a brief sas-less perio=
> d,
> > so
> > > > i
> > > > > >> can't actually check for myself. :-)  anyway, regardless, of the
> > > > answer to
> > > > > >> the above, is there an easy way, i.e. not having to write code i=
> n
> > a
> > > > data
> > > > > >> step, to read web logs with sas?
>
> > > > > >> thanks in advance,
>
> > > > > >> richard whitehead
>
> > > > > > I am unsure if I hit the wrong button or what with my posting on
> > this
> > > > > > issue. Anyway, somewhat of a reprise (I hope it doesn't appear
> > twice):
>
> > > > > > In a short answer, no, proc import won't get you anywhere close t=
> o
> > > > > > what you want.Reading web logs is bad enough, analyzing them is a
> > > > > > nightmare. But let's skip the gory details for now.
>
> > > > > > 1. Find a program on the web that does this for you already: don'=
> t
> > > > > > reinvent the wheel.
>
> > > > > > ....... really, see # 1....
>
> > > > > > 1. Ok, if you decide to use SAS (and not their web analytic
> > product),
> > > > > > skip the SAS functions and use regular expressions. I wish I knew
> > more
> > > > > > about regex when I dealt with trillions of bytes of these records
> > from
> > > > > > dozens of companies.
> > > > > > 2. Keep in mind that 90% of the records are useless info. Throw
> > them
> > > > > > away immediately. If you have very high volume, use Perl for pre-
> > > > > > processing.
> > > > > > 3. Web logs can vary in columns used, layout within a single file=
> ,
> > and
> > > > > > layout from web server to web server. I have even seen embedded E=
> OF
> > > > > > markers in a log.
> > > > > > 4. Analyzing them is plain hard and is fraught with error. There
> > are
> > > > > > so many things on the web that cause inaccuracy that take anythin=
> g
> > you
> > > > > > see with a large grain of salt.Actually, make that a salt block..=
> .
> > > > > > 5. Know thy enemy and narrow scope. Decide if you need to read fr=
> om
> > > > > > multiple web server architectures or just one. Is the layout fixe=
> d
> > for
> > > > > > what you have to do or can it vary.
> > > > > > 6. As I tell people, weblogs are the second hardest datasource I
> > have
> > > > > > ever dealt with (CDRs being #1).
>
> > > > > > Alan
> > > > > > Savian- Hide quoted text -
>
> > > > > - Show quoted text -
>
> > > > Web logs are not the same as web blogs.
>
> > > > Web logs from a web server can be in the TBs/day so simple methods of
> > > > reading are not useful. 90% of the volume can be tossed so using a lo=
> w-
> > > > level language, like Perl, that can get rid of the 90% is what is
> > > > required.
>
> > > > There are also interdepencies between records so simple search tools
> > > > do not provide any significant analysis. The records need to be
> > > > parsed, joined (as best you can since there are no common links
> > > > between records), then analyzed. This needs to be done in a parallel
> > > > fashion for large volumes.
>
> > > > Now the hard question is how do you do parallel analysis of sessions
> > > > if the user session crosses multiple logs? That's where some of the
> > > > tricks come into play and why I recommended finding an existing parse=
> r
> > > > that handles those issues.
>
> > > > Alan
> > > > Savian- Hide quoted text -
>
> > > - Show quoted text -
>
> > Joe,
>
> > How do you handle it if someone doesn=92t use cookies? I am curious.
>
> > BTW, this is a serious problem for higher languages due to overhead
> > issues and volume. There are ways to compensate but a good way to do
> > it is a multi-pass with the 1st one simply as a garbage mechanism
> > using a low-level language. Doing everything using parallel processing
> > is another challenge.
>
> > Alan
> > Savian- Hide quoted text -
>
> - Show quoted text -

Consider creating a 'fake' cookie for the record using the server IP
address and the user-agent string. You may need to throw a few more
fields in there but it can help with issue.

Alan
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic