[prev in list] [next in list] [prev in thread] [next in thread]
List: pgsql-performance
Subject: [PERFORM] Greenplum MapReduce
From: Suvankar Roy <suvankar.roy () tcs ! com>
Date: 2009-07-30 12:36:46
Message-ID: OFE7F51FE9.5B525609-ON65257603.00426DC3-65257603.00442D92 () tcs ! com
[Download RAW message or body]
This is a multipart message in MIME format.
--=_alternative 00442D5C65257603_=
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"
Hi all,
Has anybody worked on Greenplum MapReduce programming ?
I am facing a problem while trying to execute the below Greenplum
Mapreduce program written in YAML (in blue).
The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token
while scanning for the next token, at line 7 (in red)
If somebody can explain this and the potential solution
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
- INPUT:
NAME: doc
TABLE: documents
- INPUT:
NAME: kw
TABLE: keywords
- MAP:
NAME: doc_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in data.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
for term in terms:
yield([doc_id, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- doc_id integer
- data text
RETURNS:
- doc_id integer
- term text
- positions text
- MAP:
NAME: kw_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in keyword.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
yield([keyword_id, i, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- keyword_id integer
- keyword text
RETURNS:
- keyword_id integer
- nterms integer
- term text
- positions text
- TASK:
NAME: doc_prep
SOURCE: doc
MAP: doc_map
- TASK:
NAME: kw_prep
SOURCE: kw
MAP: kw_map
- INPUT:
NAME: term_join
QUERY: |
SELECT doc.doc_id, kw.keyword_id, kw.term,
kw.nterms,
doc.positions as doc_positions,
kw.positions as kw_positions
FROM doc_prep doc INNER JOIN kw_prep kw ON
(doc.term = kw.term)
- REDUCE:
NAME: term_reducer
TRANSITION: term_transition
FINALIZE: term_finalizer
- TRANSITION:
NAME: term_transition
LANGUAGE: python
PARAMETERS:
- state text
- term text
- nterms integer
- doc_positions text
- kw_positions text
FUNCTION: |
if state:
kw_split = state.split(':')
else:
kw_split = []
for i in range(0,nterms):
kw_split.append('')
for kw_p in kw_positions.split(','):
kw_split[int(kw_p)-1] = doc_positions
outstate = kw_split[0]
for s in kw_split[1:]:
outstate = outstate + ':' + s
return outstate
- FINALIZE:
NAME: term_finalizer
LANGUAGE: python
RETURNS:
- count integer
MODE: MULTI
FUNCTION: |
if not state:
return 0
kw_split = state.split(':')
previous = None
for i in range(0,len(kw_split)):
isplit = kw_split[i].split(',')
if any(map(lambda(x): x == '', isplit)):
return 0
adjusted = set(map(lambda(x): int(x)-i,
isplit))
if (previous):
previous =
adjusted.intersection(previous)
else:
previous = adjusted
if previous:
return len(previous)
return 0
- TASK:
NAME: term_match
SOURCE: term_join
REDUCE: term_reducer
- INPUT:
NAME: final_output
QUERY: |
SELECT doc.*, kw.*, tm.count
FROM documents doc, keywords kw, term_match tm
WHERE doc.doc_id = tm.doc_id
AND kw.keyword_id = tm.keyword_id
AND tm.count > 0
EXECUTE:
- RUN:
SOURCE: final_output
TARGET: STDOUT
Regards,
Suvankar Roy
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you
--=_alternative 00442D5C65257603_=
Content-Transfer-Encoding: 7bit
Content-Type: text/html; charset="us-ascii"
<br><font size=2 face="sans-serif">Hi all,</font>
<br>
<br><font size=2 face="sans-serif">Has anybody worked on Greenplum MapReduce
programming ?</font>
<br>
<br><font size=2 face="sans-serif">I am facing a problem while trying to
execute the below Greenplum Mapreduce program written in YAML (in blue).
</font>
<br>
<br><font size=2 face="sans-serif">The error is thrown in the 7th line
as:</font>
<br><font size=2 face="sans-serif"><b>Error: YAML syntax error - found
character that cannot start any token while scanning for the next token,
at line 7 (in red)</b></font>
<br>
<br><font size=2 face="sans-serif">If somebody can explain this and the
potential solution</font>
<br>
<br><font size=2 color=blue face="Courier New">%YAML 1.1</font>
<br><font size=2 color=blue face="Courier New">---</font>
<br><font size=2 color=blue face="Courier New">VERSION: 1.0.0.1 </font>
<br><font size=2 color=blue face="Courier New">DATABASE: test_db1</font>
<br><font size=2 color=blue face="Courier New">USER: gpadmin</font>
<br><font size=2 color=blue face="Courier New">DEFINE: </font>
<br><font size=2 color=red face="Courier New">
- INPUT:</font>
<br><font size=2 color=blue face="Courier New">
NAME: doc</font>
<br><font size=2 color=blue face="Courier New">
TABLE: documents </font>
<br><font size=2 color=blue face="Courier New">
- INPUT:</font>
<br><font size=2 color=blue face="Courier New">
NAME: kw</font>
<br><font size=2 color=blue face="Courier New">
TABLE: keywords</font>
<br><font size=2 color=blue face="Courier New">
- MAP: </font>
<br><font size=2 color=blue face="Courier New">
NAME:
doc_map </font>
<br><font size=2 color=blue face="Courier New">
LANGUAGE:
python </font>
<br><font size=2 color=blue face="Courier New">
FUNCTION:
|</font>
<br><font size=2 color=blue face="Courier New">
i = 0 </font>
<br><font size=2 color=blue face="Courier New">
terms = {}</font>
<br><font size=2 color=blue face="Courier New">
for term in data.lower().split(): </font>
<br><font size=2 color=blue face="Courier New">
i = i + 1</font>
<br><font size=2 color=blue face="Courier New">
if term in terms: </font>
<br><font size=2 color=blue face="Courier New">
terms[term] += ','+str(i) </font>
<br><font size=2 color=blue face="Courier New">
else: </font>
<br><font size=2 color=blue face="Courier New">
terms[term] = str(i) </font>
<br><font size=2 color=blue face="Courier New">
for term in terms: </font>
<br><font size=2 color=blue face="Courier New">
yield([doc_id, term, terms[term]])
</font>
<br><font size=2 color=blue face="Courier New">
OPTIMIZE: STRICT IMMUTABLE
</font>
<br><font size=2 color=blue face="Courier New">
PARAMETERS: </font>
<br><font size=2 color=blue face="Courier New">
- doc_id integer </font>
<br><font size=2 color=blue face="Courier New">
- data text </font>
<br><font size=2 color=blue face="Courier New">
RETURNS: </font>
<br><font size=2 color=blue face="Courier New">
- doc_id integer </font>
<br><font size=2 color=blue face="Courier New">
- term text </font>
<br><font size=2 color=blue face="Courier New">
- positions text </font>
<br><font size=2 color=blue face="Courier New">
- MAP: </font>
<br><font size=2 color=blue face="Courier New">
NAME:
kw_map </font>
<br><font size=2 color=blue face="Courier New">
LANGUAGE:
python </font>
<br><font size=2 color=blue face="Courier New">
FUNCTION:
| </font>
<br><font size=2 color=blue face="Courier New">
i = 0 </font>
<br><font size=2 color=blue face="Courier New">
terms = {} </font>
<br><font size=2 color=blue face="Courier New">
for term in keyword.lower().split(): </font>
<br><font size=2 color=blue face="Courier New">
i = i + 1 </font>
<br><font size=2 color=blue face="Courier New">
if term in terms: </font>
<br><font size=2 color=blue face="Courier New">
terms[term] += ','+str(i) </font>
<br><font size=2 color=blue face="Courier New">
else: </font>
<br><font size=2 color=blue face="Courier New">
terms[term] = str(i) </font>
<br><font size=2 color=blue face="Courier New">
yield([keyword_id, i, term,
terms[term]]) </font>
<br><font size=2 color=blue face="Courier New">
OPTIMIZE: STRICT IMMUTABLE
</font>
<br><font size=2 color=blue face="Courier New">
PARAMETERS: </font>
<br><font size=2 color=blue face="Courier New">
- keyword_id integer </font>
<br><font size=2 color=blue face="Courier New">
- keyword text </font>
<br><font size=2 color=blue face="Courier New">
RETURNS: </font>
<br><font size=2 color=blue face="Courier New">
- keyword_id integer </font>
<br><font size=2 color=blue face="Courier New">
- nterms integer </font>
<br><font size=2 color=blue face="Courier New">
- term text </font>
<br><font size=2 color=blue face="Courier New">
- positions text </font>
<br><font size=2 color=blue face="Courier New">
- TASK: </font>
<br><font size=2 color=blue face="Courier New">
NAME: doc_prep </font>
<br><font size=2 color=blue face="Courier New">
SOURCE: doc </font>
<br><font size=2 color=blue face="Courier New">
MAP: doc_map</font>
<br><font size=2 color=blue face="Courier New">
- TASK: </font>
<br><font size=2 color=blue face="Courier New">
NAME: kw_prep </font>
<br><font size=2 color=blue face="Courier New">
SOURCE: kw </font>
<br><font size=2 color=blue face="Courier New">
MAP: kw_map
</font>
<br><font size=2 color=blue face="Courier New">
- INPUT: </font>
<br><font size=2 color=blue face="Courier New">
NAME: term_join </font>
<br><font size=2 color=blue face="Courier New">
QUERY: | </font>
<br><font size=2 color=blue face="Courier New">
SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms, </font>
<br><font size=2 color=blue face="Courier New">
doc.positions as doc_positions,
</font>
<br><font size=2 color=blue face="Courier New">
kw.positions as kw_positions
</font>
<br><font size=2 color=blue face="Courier New">
FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)</font>
<br><font size=2 color=blue face="Courier New">
- REDUCE: </font>
<br><font size=2 color=blue face="Courier New">
NAME: term_reducer </font>
<br><font size=2 color=blue face="Courier New">
TRANSITION: term_transition
</font>
<br><font size=2 color=blue face="Courier New">
FINALIZE: term_finalizer
</font>
<br><font size=2 color=blue face="Courier New">
- TRANSITION: </font>
<br><font size=2 color=blue face="Courier New">
NAME: term_transition </font>
<br><font size=2 color=blue face="Courier New">
LANGUAGE: python </font>
<br><font size=2 color=blue face="Courier New">
PARAMETERS: </font>
<br><font size=2 color=blue face="Courier New">
- state text </font>
<br><font size=2 color=blue face="Courier New">
- term text </font>
<br><font size=2 color=blue face="Courier New">
- nterms integer </font>
<br><font size=2 color=blue face="Courier New">
- doc_positions text </font>
<br><font size=2 color=blue face="Courier New">
- kw_positions text </font>
<br><font size=2 color=blue face="Courier New">
FUNCTION: | </font>
<br><font size=2 color=blue face="Courier New">
if state: </font>
<br><font size=2 color=blue face="Courier New">
kw_split = state.split(':')
</font>
<br><font size=2 color=blue face="Courier New">
else: </font>
<br><font size=2 color=blue face="Courier New">
kw_split = [] </font>
<br><font size=2 color=blue face="Courier New">
for i in range(0,nterms): </font>
<br><font size=2 color=blue face="Courier New">
kw_split.append('') </font>
<br><font size=2 color=blue face="Courier New">
for kw_p in kw_positions.split(','): </font>
<br><font size=2 color=blue face="Courier New">
kw_split[int(kw_p)-1] = doc_positions
</font>
<br><font size=2 color=blue face="Courier New">
outstate = kw_split[0] </font>
<br><font size=2 color=blue face="Courier New">
for s in kw_split[1:]: </font>
<br><font size=2 color=blue face="Courier New">
outstate = outstate + ':' +
s </font>
<br><font size=2 color=blue face="Courier New">
return outstate </font>
<br><font size=2 color=blue face="Courier New">
- FINALIZE: </font>
<br><font size=2 color=blue face="Courier New">
NAME: term_finalizer </font>
<br><font size=2 color=blue face="Courier New">
LANGUAGE: python </font>
<br><font size=2 color=blue face="Courier New">
RETURNS: </font>
<br><font size=2 color=blue face="Courier New">
- count integer </font>
<br><font size=2 color=blue face="Courier New">
MODE: MULTI </font>
<br><font size=2 color=blue face="Courier New">
FUNCTION: | </font>
<br><font size=2 color=blue face="Courier New">
if not state: </font>
<br><font size=2 color=blue face="Courier New">
return 0 </font>
<br><font size=2 color=blue face="Courier New">
kw_split = state.split(':') </font>
<br><font size=2 color=blue face="Courier New">
previous = None </font>
<br><font size=2 color=blue face="Courier New">
for i in range(0,len(kw_split)): </font>
<br><font size=2 color=blue face="Courier New">
isplit = kw_split[i].split(',')
</font>
<br><font size=2 color=blue face="Courier New">
if any(map(lambda(x): x ==
'', isplit)): </font>
<br><font size=2 color=blue face="Courier New">
return 0 </font>
<br><font size=2 color=blue face="Courier New">
adjusted = set(map(lambda(x):
int(x)-i, isplit)) </font>
<br><font size=2 color=blue face="Courier New">
if (previous): </font>
<br><font size=2 color=blue face="Courier New">
previous = adjusted.intersection(previous) </font>
<br><font size=2 color=blue face="Courier New">
else: </font>
<br><font size=2 color=blue face="Courier New">
previous = adjusted </font>
<br><font size=2 color=blue face="Courier New">
if previous: </font>
<br><font size=2 color=blue face="Courier New">
return len(previous) </font>
<br><font size=2 color=blue face="Courier New">
return 0</font>
<br><font size=2 color=blue face="Courier New">
- TASK: </font>
<br><font size=2 color=blue face="Courier New">
NAME: term_match </font>
<br><font size=2 color=blue face="Courier New">
SOURCE: term_join </font>
<br><font size=2 color=blue face="Courier New">
REDUCE: term_reducer </font>
<br><font size=2 color=blue face="Courier New">
- INPUT: </font>
<br><font size=2 color=blue face="Courier New">
NAME: final_output </font>
<br><font size=2 color=blue face="Courier New">
QUERY: | </font>
<br><font size=2 color=blue face="Courier New">
SELECT doc.*, kw.*, tm.count </font>
<br><font size=2 color=blue face="Courier New">
FROM documents doc, keywords kw, term_match tm </font>
<br><font size=2 color=blue face="Courier New">
WHERE doc.doc_id = tm.doc_id </font>
<br><font size=2 color=blue face="Courier New">
AND kw.keyword_id = tm.keyword_id </font>
<br><font size=2 color=blue face="Courier New">
AND tm.count > 0 </font>
<br><font size=2 color=blue face="Courier New">
EXECUTE: </font>
<br><font size=2 color=blue face="Courier New">
- RUN: </font>
<br><font size=2 color=blue face="Courier New">
SOURCE: final_output </font>
<br><font size=2 color=blue face="Courier New">
TARGET: STDOUT</font>
<br>
<br>
<br>
<br><font size=2 face="sans-serif">Regards,</font>
<br><font size=2 face="sans-serif"><br>
Suvankar Roy<br>
</font><pre>=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you
</pre>
--=_alternative 00442D5C65257603_=--
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic