[prev in list] [next in list] [prev in thread] [next in thread] 

List:       flume-user
Subject:    Flume timestamp partitioning overlaps
From:       Dominik_Hübner <contact () dhuebner ! com>
Date:       2015-07-08 8:23:42
Message-ID: 68E41083-1987-4F64-9407-E273ED1E797E () dhuebner ! com
[Download RAW message or body]

I am using Cloudera's example source to collect a sample of Twitter's stream \
partitioned by year -> month -> day -> hour.  \
https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java \
<https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java>


timestamp of an event is set by 
headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));

My agent config:
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://kronos.feeb.co:8020/user/flume/tweets/%Y/%m/%d/%H/ \
<hdfs://kronos.feeb.co:8020/user/flume/tweets/%25Y/%25m/%25d/%25H/>

However, I see that in almost all hours there is at least one (more often multiple \
records) from the last second of the previous hour. 

Is there any way to prevent having those overlaps in data? 
Hourly aggregation without dropping data becomes unnecessarily messy due to this.


[Attachment #3 (unknown)]

<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; \
-webkit-line-break: after-white-space;" class="">I am using Cloudera's example source \
to collect a sample of Twitter's stream partitioned by year -&gt; month -&gt; day \
-&gt; hour.&nbsp;<div class=""><a \
href="https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java" \
class="">https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java</a></div><div \
class=""><br class=""></div><div class="">timestamp of an event is set \
by&nbsp;</div><div class=""><span style="color: rgb(51, 51, 51); font-family: \
Consolas, 'Liberation Mono', Menlo, Courier, monospace; line-height: \
16.7999992370605px; white-space: pre; widows: 1; background-color: rgb(255, 255, \
255);" class="">headers</span><span class="pl-k" style="box-sizing: border-box; \
color: rgb(167, 29, 93); font-family: Consolas, 'Liberation Mono', Menlo, Courier, \
monospace; line-height: 16.7999992370605px; white-space: pre; widows: 1; \
background-color: rgb(255, 255, 255);">.</span><span style="color: rgb(51, 51, 51); \
font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; line-height: \
16.7999992370605px; white-space: pre; widows: 1; background-color: rgb(255, 255, \
255);" class="">put(</span><span class="pl-s" style="box-sizing: border-box; color: \
rgb(24, 54, 145); font-family: Consolas, 'Liberation Mono', Menlo, Courier, \
monospace; line-height: 16.7999992370605px; white-space: pre; widows: 1; \
background-color: rgb(255, 255, 255);"><span class="pl-pds" style="box-sizing: \
border-box;">"</span>timestamp<span class="pl-pds" style="box-sizing: \
border-box;">"</span></span><span style="color: rgb(51, 51, 51); font-family: \
Consolas, 'Liberation Mono', Menlo, Courier, monospace; line-height: \
16.7999992370605px; white-space: pre; widows: 1; background-color: rgb(255, 255, \
255);" class="">, </span><span class="pl-smi" style="box-sizing: border-box; color: \
rgb(51, 51, 51); font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; \
line-height: 16.7999992370605px; white-space: pre; widows: 1; background-color: \
rgb(255, 255, 255);">String</span><span class="pl-k" style="box-sizing: border-box; \
color: rgb(167, 29, 93); font-family: Consolas, 'Liberation Mono', Menlo, Courier, \
monospace; line-height: 16.7999992370605px; white-space: pre; widows: 1; \
background-color: rgb(255, 255, 255);">.</span><span style="color: rgb(51, 51, 51); \
font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; line-height: \
16.7999992370605px; white-space: pre; widows: 1; background-color: rgb(255, 255, \
255);" class="">valueOf(status</span><span class="pl-k" style="box-sizing: \
border-box; color: rgb(167, 29, 93); font-family: Consolas, 'Liberation Mono', Menlo, \
Courier, monospace; line-height: 16.7999992370605px; white-space: pre; widows: 1; \
background-color: rgb(255, 255, 255);">.</span><span style="color: rgb(51, 51, 51); \
font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; line-height: \
16.7999992370605px; white-space: pre; widows: 1; background-color: rgb(255, 255, \
255);" class="">getCreatedAt()</span><span class="pl-k" style="box-sizing: \
border-box; color: rgb(167, 29, 93); font-family: Consolas, 'Liberation Mono', Menlo, \
Courier, monospace; line-height: 16.7999992370605px; white-space: pre; widows: 1; \
background-color: rgb(255, 255, 255);">.</span><span style="color: rgb(51, 51, 51); \
font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; line-height: \
16.7999992370605px; white-space: pre; widows: 1; background-color: rgb(255, 255, \
255);" class="">getTime()));</span></div><div class=""><br class=""></div><div \
class="">My agent config:</div><div class="">TwitterAgent.sinks.HDFS.hdfs.path=<a \
href="hdfs://kronos.feeb.co:8020/user/flume/tweets/%25Y/%25m/%25d/%25H/" \
class="">hdfs://kronos.feeb.co:8020/user/flume/tweets/%Y/%m/%d/%H/</a></div><div \
class=""><br class=""></div><div class="">However, I see that in almost all hours \
there is at least one (more often multiple records) from the last second of the \
previous hour.&nbsp;</div><div class=""><br class=""></div><div class="">Is there any \
way to prevent having those overlaps in data?&nbsp;</div><div class="">Hourly \
aggregation without dropping data becomes unnecessarily messy due to \
this.</div></body></html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic