<?xml version="1.0" encoding="iso-8859-1"?>
<feed xmlns="http://www.w3.org/2005/Atom"
	xml:lang="en">
	<title>PostgreSQL</title>
	<subtitle>pg codings - Frakkle.com</subtitle>
        <link rel="alternate" type="text/html" href="http://frakkle.com/postgresql.php"/>
        <link rel="self" type="application/atom+xml" href="http://frakkle.com/postgresql-atom.xml"/>
	<updated>2008-06-29T02:12:21-07:00</updated>
	<author>
	<name>admin</name>
	<uri>http://frakkle.com/postgresql.php</uri>
	<email>mathew@lifeart.net.au</email>
	</author>
	<id>tag:frakklefrakshome,2008:postgresql</id>
	<generator uri="http://www.pivotlog.net" version="Pivot - 1.40.5: 'Dreadwind'">Pivot</generator>
	<rights>Copyright (c) 2008, Authors of PostgreSQL</rights>
	
	
	
	<entry>
		<title>in response to: Need Help Reducing View Calculations</title>
		<link rel="alternate" type="text/html" href="http://frakkle.com/entry/106/in_response_to_need_help_reduc/postgresql" />
		<updated>2007-11-07T18:21:00-07:00</updated>
		<published>2007-11-07T18:18:00-07:00</published>
		<id>tag:frakklefrakshome,2008:postgresql.106</id>
		<link rel="related" type="text/html" href=""  />
		<summary type="text">I&amp;#39;m posting this via trackback because of a strange posting problem:  (layout is strange and missing the email field?  Yes I am using a crud browser from here... that could be it)  


I have to agree with Hernan. 


The simplest saving I can see is an extra column to save on the "run time" calculation.  This does not have to be dangerous (ie potentially different to the date column you have already)Just setup a trigger for whenever the existing date column gets updated(/inserted) to update the extra column. 


Another alternative is to use a materialised view.  On Postgres this is a manual process involving setting up a view and rules, and ig you want it always up to date, triggers again on the source table to update the materialised view whenever the data changes.  


This latter idea would be the fastest - faster than your original view by quite a margin, in fact.  You may then be able to get rid of some (all?) indexes you have already on the existing table - because they would no longer be needed. 


One thing I think would be handy that I have on my list of "when I get time" is to write some reasonably automatic code to do materialised views without the manual coding (ala Oracle).  People may use these things more if they were a bit more simple to do. 


Link to Need Help Reducing View Calculations page: http://www.justatheory.com/computers/databases/postgresql/reducing_view_calculations.html</summary>
        <content type="html" xml:lang="en" xml:base="http://frakkle.com/entry/106/in_response_to_need_help_reduc/postgresql"><![CDATA[
                <p>
I&#39;m posting this via trackback because of a strange posting problem:  (layout is strange and missing the email field?  Yes I am using a crud browser from here... that could be it)  
</p>
<p>
I have to agree with Hernan. 
</p>
<p>
The simplest saving I can see is an extra column to save on the &quot;run time&quot; calculation.  This does not have to be dangerous (ie potentially different to the date column you have already)Just setup a trigger for whenever the existing date column gets updated(/inserted) to update the extra column. 
</p>
<p>
Another alternative is to use a materialised view.  On Postgres this is a manual process involving setting up a view and rules, and ig you want it always up to date, triggers again on the source table to update the materialised view whenever the data changes.  
</p>
<p>
This latter idea would be the fastest - faster than your original view by quite a margin, in fact.  You may then be able to get rid of some (all?) indexes you have already on the existing table - because they would no longer be needed. 
</p>
<p>
One thing I think would be handy that I have on my list of &quot;when I get time&quot; is to write some reasonably automatic code to do materialised views without the manual coding (ala Oracle).  People may use these things more if they were a bit more simple to do. 
</p>
<p>
Link to <em>Need Help Reducing View Calculations</em> page: <a target="_blank" href="http://www.justatheory.com/computers/databases/postgresql/reducing_view_calculations.html">http://www.justatheory.com/computers/databases/postgresql/reducing_view_calculations.html</a></p>
		]]></content>
		<author>
			<name>frak</name>
		</author>
	</entry>
	
	
	
	<entry>
		<title>Data Warehouse / Business Intelligence - PostgreSQL or Oracle</title>
		<link rel="alternate" type="text/html" href="http://frakkle.com/entry/105/data_warehouse__business_intel/postgresql" />
		<updated>2007-06-17T16:14:00-07:00</updated>
		<published>2007-06-15T01:48:00-07:00</published>
		<id>tag:frakklefrakshome,2008:postgresql.105</id>
		<link rel="related" type="text/html" href=""  />
		<summary type="text">I&amp;#39;ve just started a new contract for a Federal department where an old grants management system is being replaced with a PostgreSQL/Java based version.  The sister project (sub project really) is a data warehouse.  The choices were Oracle or PostgreSQL (PostgreSQL was what attracted me to the contract actually) 


The argument that I have been unable to win in putting together a warehouse on PostgreSQL comes down to tool maturity - ie the risk involved in something "not proven" so it is almost certainly going to be built using Oracle Warehouse Builder and a BI tool tbd. 


I have a long history with databases, and experience with PostgreSQL that dates to the beginning of version 7, however I&amp;#39;ve never been involved in formal datawarehousing or ETL - so I can&amp;#39;t speak with the authority I would prefer ;-) 


From my research (mostly involving asking Liam at Fujitsu) Bizgres which seems to be still at the "not finished" stage, and Jasper.  I have experience in Jasper products from some time back - they looked very good then - so I can&amp;#39;t wait to try out JasperETL. 


In any case I am likely to do a parallel run of the work with OWB using Jasper/Postgres to see how they go in my own time.  I would welcome any comments on all this - particularly if they can be backed up with real-world datawarehousing projects. 


For those wondering about this blog, family illness has kept me from this for some time, unfortunately... something that is now coming to an end in a good way. 


Cheers, 


Mathew Frank (http://frakkle.com)</summary>
        <content type="html" xml:lang="en" xml:base="http://frakkle.com/entry/105/data_warehouse__business_intel/postgresql"><![CDATA[
                <p>
I&#39;ve just started a new contract for a Federal department where an old grants management system is being replaced with a PostgreSQL/Java based version.  The sister project (sub project really) is a data warehouse.  The choices were Oracle or PostgreSQL (PostgreSQL was what attracted me to the contract actually) 
</p>
<p>
The argument that I have been unable to win in putting together a warehouse on PostgreSQL comes down to tool maturity - ie the risk involved in something &quot;not proven&quot; so it is almost certainly going to be built using Oracle Warehouse Builder and a BI tool tbd. 
</p>
<p>
I have a long history with databases, and experience with PostgreSQL that dates to the beginning of version 7, however I&#39;ve never been involved in formal datawarehousing or ETL - so I can&#39;t speak with the authority I would prefer ;-) 
</p>
<p>
From my research (mostly involving asking Liam at Fujitsu) Bizgres which seems to be still at the &quot;not finished&quot; stage, and Jasper.  I have experience in Jasper products from some time back - they looked very good then - so I can&#39;t wait to try out JasperETL. 
</p>
<p>
In any case I am likely to do a parallel run of the work with OWB using Jasper/Postgres to see how they go in my own time.  I would welcome any comments on all this - particularly if they can be backed up with real-world datawarehousing projects. 
</p>
<p>
For those wondering about this blog, family illness has kept me from this for some time, unfortunately... something that is now coming to an end in a good way. 
</p>
<p>
Cheers, 
</p>
<p>
Mathew Frank (<a target="_blank" href="http://frakkle.com/">http://frakkle.com</a>)</p>
		]]></content>
		<author>
			<name>frak</name>
		</author>
	</entry>
	
	
	
	<entry>
		<title>Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt3</title>
		<link rel="alternate" type="text/html" href="http://frakkle.com/entry/94/working_with_moving_data_setse/postgresql" />
		<updated>2006-04-22T01:51:00-07:00</updated>
		<published>2006-04-22T01:51:00-07:00</published>
		<id>tag:frakklefrakshome,2008:postgresql.94</id>
		<link rel="related" type="text/html" href=""  />
		<summary type="text">So how to I create/return a "rolling set" of data efficiently, instead
of freshly linking 30-100 rows of data to every row of data?  The
first part of the answer is arrays.  These are part of the SQL
standard, and supported in PostgreSQL since last decade.

The second part is how do we do operations on these arrays - there is
no SUM/COUNT/AVG etc for array elements.  And how do we work with
different time-slices of data.  (ie if we have a moving period of
30 days how to we return a moving average of 21 days or similar)

If we remember from the first post - 41 seconds is the time to beat
(btw - I forget to mention before:  I am dealing with about 10
years of data for NAB (National Australia Bank - listed on the
ASX)).  The following function - running the same resultant rows -
takes 3.4 seconds.

Since building the original function, I have actually changed it to
give me arrays for every column (trading day, open, closed, low, high)
and the final runtime is 4.1 seconds.   It should be noted
that this is a specific function, however for different tables/column
sets it is very easy to change for your own use.  I could have
made this more generic at the cost of performance - but there did not
seem to be much point for my use - certainly not to provide this
example.

So this is the function - at least the new version:
CREATE TYPE dt_stock_set AS
   (stock varchar(10),
    day_id int4,
    day_ids int4[],
    closing numeric[],
    open numeric[],
    low numeric[],
    high numeric[],
    volume numeric[],
    period int4);

CREATE OR REPLACE FUNCTION last_set(code "varchar", period int4, day_max int4, day_min int4)
  RETURNS SETOF dt_stock_set AS
$BODY$
declare
    rtn dt_stock_set%rowtype;
    rs record;
    last_stock varchar(10);
    ar_day_id int4[];
    ar_close numeric[];
    ar_open numeric[];
    ar_high numeric[];
    ar_low numeric[];
    ar_volume numeric[];
    ar_pointer int4;
BEGIN
    last_stock = '';
    FOR rs IN 
        SELECT s.stock, s.day_id, s.closing, s.open, s.high, s.low, s.volume
        FROM stocks AS s
        WHERE (code IS NULL OR s.stock=code) 
            AND (day_max IS NULL OR s.day_id <= day_max)
            AND (day_min IS NULL OR s.day_id >= day_min)
        ORDER BY s.stock ASC, day_id ASC
    LOOP
        IF last_stock != rs.stock THEN
            last_stock := rs.stock;
            ar_day_id := Array[rs.day_id];
            ar_close := Array[rs.closing];
            ar_open := Array[rs.open];
            ar_low := Array[rs.low];
            ar_high := Array[rs.high];
            ar_volume := Array[rs.volume];
        ELSE
        --ar_close := rs.closing || ar_close[1:29] ; --NOTE this does not work in PL hence the loop
            FOR ar_pointer IN REVERSE period .. 2 LOOP
                ar_day_id[ar_pointer] := ar_day_id[ar_pointer-1];
                ar_close[ar_pointer] := ar_close[ar_pointer-1];
                ar_open[ar_pointer] := ar_open[ar_pointer-1];
                ar_low[ar_pointer] := ar_low[ar_pointer-1];
                ar_high[ar_pointer] := ar_high[ar_pointer-1];
                ar_volume[ar_pointer] := ar_volume[ar_pointer-1];
            END LOOP;
            ar_day_id[1] := rs.day_id;
            ar_close[1] := rs.closing;
            ar_open[1] := rs.open;
            ar_low[1] := rs.low;
            ar_high[1] := rs.high;
            ar_volume[1] := rs.volume;
        END IF;
        rtn := (rs.stock, rs.day_id,
ar_day_id, ar_close, ar_open, ar_low, ar_high, ar_volume , period);
        return next rtn;
    END LOOP;
END
$BODY$
  LANGUAGE 'plpgsql' STABLE;

The last part of this is to produce functions that will do my agregates
for me on the arrays.  The easiest way to do this (though not the
most efficient it is a tradeoff I am happy with for now given that the
efficiency of the total package is great) is to convert the arrays to
SETs.  This SET will be returned cached for effiency and will be
used by all array (of same size) agregate functions for the row
CREATE OR REPLACE FUNCTION array_to_set(ar_in numeric[]) 
  RETURNS SETOF numeric AS
$BODY$
declare
    rtn numeric;
    ar_pointer int4 = 1;
BEGIN
      --ar_upper := array_upper( ar_in );
    WHILE ( ar_in[ar_pointer] IS NOT NULL ) LOOP
        return next ar_in[ar_pointer];
        ar_pointer := ar_pointer + 1;
    END LOOP;
END
$BODY$
  LANGUAGE 'plpgsql' STRICT IMMUTABLE;

The next part is to create the functions that take the arrays and return the agregate values (using the SET data internally):
CREATE OR REPLACE FUNCTION sum(numeric[]) 
    RETURNS numeric AS
$BODY$
    SELECT sum( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE  FUNCTION count_array(numeric[]) 
    RETURNS bigint AS
$BODY$
    SELECT count( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE OR REPLACE FUNCTION avg(numeric[]) 
    RETURNS numeric AS
$BODY$
    SELECT avg( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE OR REPLACE FUNCTION stddev(numeric[]) 
    RETURNS numeric AS
$BODY$
    SELECT stddev( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;
All very simple.  As you can see they all have the same name as the standard agregate functions with the exception of count_array()
given that I did not want to intefere with count() - given that it is
the only function that already takes an array in the first place.

So how do I work with smaller moving sets (ie smaller array
sizes).  The moving arrays are indexed such that index 1 is the
newest value (it corrosponds to the current row value).  So
slicing the array to the first 10 slices will give the last 10 rows of
data.  This is done by this notation:

colname[1:10]            --example of returning the array itself
  avg( last_set.closing[1:10] )  --example of using the avg function

To put this all together then, what follows is an
example query for you that use the components I have discussed. 
It returns the last current trading day, stock close price, a 10 day
moving average, bollinger bands on the 10 day period, 1 21 day moving
average, the movement of the stock that day, and the day of the week of
the current trading day.

select last_set.day_ids[1] AS trading_day, last_set.closing[1] AS closing,  
    avg( last_set.closing[1:10] ) moving_average_10,
      avg( last_set.closing[1:10] ) + 2*stddev( last_set.closing[1:10] ) AS bollinger_high_10,
      avg( last_set.closing[1:10] ) - 2*stddev( last_set.closing[1:10] ) AS bollinger_low_10,
      avg( last_set.closing[1:21] ) AS moving_average_21, 
      last_set.closing[1]-last_set.open[1] AS movement,
      extract( dow from date(last_set.day_ids[1]) ) AS day_of_week
  from last_set('NAB', 30, NULL, NULL)
  order by last_set.day_ids[1]


So that's it.  I do hope you find this of value - if not directly, then at least as another way of thinking about things.

http://frakkle.com</summary>
        <content type="html" xml:lang="en" xml:base="http://frakkle.com/entry/94/working_with_moving_data_setse/postgresql"><![CDATA[
                So how to I create/return a "rolling set" of data efficiently, instead
of freshly linking 30-100 rows of data to every row of data?  The
first part of the answer is arrays.  These are part of the SQL
standard, and supported in PostgreSQL since last decade.</p>
<p>
The second part is how do we do operations on these arrays - there is
no SUM/COUNT/AVG etc for array elements.  And how do we work with
different time-slices of data.  (ie if we have a moving period of
30 days how to we return a moving average of 21 days or similar)</p>
<p>
If we remember from the first post - 41 seconds is the time to beat
(btw - I forget to mention before:  I am dealing with about 10
years of data for NAB (National Australia Bank - listed on the
ASX)).  The following function - running the same resultant rows -
takes 3.4 seconds.</p>
<p>
Since building the original function, I have actually changed it to
give me arrays for every column (trading day, open, closed, low, high)
and the final runtime is 4.1 seconds.   It should be noted
that this is a specific function, however for different tables/column
sets it is very easy to change for your own use.  I could have
made this more generic at the cost of performance - but there did not
seem to be much point for my use - certainly not to provide this
example.</p>
<p>
So this is the function - at least the new version:<br  />
<blockquote><font color="Navy">CREATE TYPE dt_stock_set AS<br  />
   (stock varchar(10),<br  />
    day_id int4,<br  />
    day_ids int4[],<br  />
    closing numeric[],<br  />
    open numeric[],<br  />
    low numeric[],<br  />
    high numeric[],<br  />
    volume numeric[],<br  />
    period int4);</p>
<p>
CREATE OR REPLACE FUNCTION last_set(code "varchar", period int4, day_max int4, day_min int4)<br  />
  RETURNS SETOF dt_stock_set AS<br  />
$BODY$<br  />
declare<br  />
    rtn dt_stock_set%rowtype;<br  />
    rs record;<br  />
    last_stock varchar(10);<br  />
    ar_day_id int4[];<br  />
    ar_close numeric[];<br  />
    ar_open numeric[];<br  />
    ar_high numeric[];<br  />
    ar_low numeric[];<br  />
    ar_volume numeric[];<br  />
    ar_pointer int4;<br  />
BEGIN<br  />
    last_stock = '';<br  />
    FOR rs IN <br  />
        SELECT s.stock, s.day_id, s.closing, s.open, s.high, s.low, s.volume<br  />
        FROM stocks AS s<br  />
        WHERE (code IS NULL OR s.stock=code) <br  />
            AND (day_max IS NULL OR s.day_id &lt;= day_max)<br  />
            AND (day_min IS NULL OR s.day_id &gt;= day_min)<br  />
        ORDER BY s.stock ASC, day_id ASC<br  />
    LOOP<br  />
        IF last_stock != rs.stock THEN<br  />
            last_stock := rs.stock;<br  />
            ar_day_id := Array[rs.day_id];<br  />
            ar_close := Array[rs.closing];<br  />
            ar_open := Array[rs.open];<br  />
            ar_low := Array[rs.low];<br  />
            ar_high := Array[rs.high];<br  />
            ar_volume := Array[rs.volume];<br  />
        ELSE<br  />
        <font color="DarkGreen">--ar_close := rs.closing || ar_close[1:29] ; --NOTE this does not work in PL hence the loop</font><br  />
            FOR ar_pointer IN REVERSE period .. 2 LOOP<br  />
                ar_day_id[ar_pointer] := ar_day_id[ar_pointer-1];<br  />
                ar_close[ar_pointer] := ar_close[ar_pointer-1];<br  />
                ar_open[ar_pointer] := ar_open[ar_pointer-1];<br  />
                ar_low[ar_pointer] := ar_low[ar_pointer-1];<br  />
                ar_high[ar_pointer] := ar_high[ar_pointer-1];<br  />
                ar_volume[ar_pointer] := ar_volume[ar_pointer-1];<br  />
            END LOOP;<br  />
            ar_day_id[1] := rs.day_id;<br  />
            ar_close[1] := rs.closing;<br  />
            ar_open[1] := rs.open;<br  />
            ar_low[1] := rs.low;<br  />
            ar_high[1] := rs.high;<br  />
            ar_volume[1] := rs.volume;<br  />
        END IF;<br  />
        rtn := (rs.stock, rs.day_id,
ar_day_id, ar_close, ar_open, ar_low, ar_high, ar_volume , period);<br  />
        return next rtn;<br  />
    END LOOP;<br  />
END<br  />
$BODY$<br  />
  LANGUAGE 'plpgsql' STABLE;</font><br  />
</blockquote>
The last part of this is to produce functions that will do my agregates
for me on the arrays.  The easiest way to do this (though not the
most efficient it is a tradeoff I am happy with for now given that the
efficiency of the total package is great) is to convert the arrays to
SETs.  This SET will be returned cached for effiency and will be
used by all array (of same size) agregate functions for the row<br  />
<blockquote><font color="Navy">CREATE OR REPLACE FUNCTION array_to_set(ar_in numeric[]) <br  />
  RETURNS SETOF numeric AS<br  />
$BODY$<br  />
declare<br  />
    rtn numeric;<br  />
    ar_pointer int4 = 1;<br  />
BEGIN<br  />
  <font color="DarkGreen">    --ar_upper := array_upper( ar_in );</font><br  />
    WHILE ( ar_in[ar_pointer] IS NOT NULL ) LOOP<br  />
        return next ar_in[ar_pointer];<br  />
        ar_pointer := ar_pointer + 1;<br  />
    END LOOP;<br  />
END<br  />
$BODY$<br  />
  LANGUAGE 'plpgsql' STRICT IMMUTABLE;</font><br  />
</blockquote>
The next part is to create the functions that take the arrays and return the agregate values (using the SET data internally):<br  />
<blockquote><font color="Navy">CREATE OR REPLACE FUNCTION sum(numeric[]) <br  />
    RETURNS numeric AS<br  />
$BODY$<br  />
    SELECT sum( array_to_set )<br  />
    FROM array_to_set( $1 )<br  />
$BODY$<br  />
  LANGUAGE 'sql' STRICT IMMUTABLE;</p>
<p>
CREATE  FUNCTION count_array(numeric[]) <br  />
    RETURNS bigint AS<br  />
$BODY$<br  />
    SELECT count( array_to_set )<br  />
    FROM array_to_set( $1 )<br  />
$BODY$<br  />
  LANGUAGE 'sql' STRICT IMMUTABLE;</p>
<p>
CREATE OR REPLACE FUNCTION avg(numeric[]) <br  />
    RETURNS numeric AS<br  />
$BODY$<br  />
    SELECT avg( array_to_set )<br  />
    FROM array_to_set( $1 )<br  />
$BODY$<br  />
  LANGUAGE 'sql' STRICT IMMUTABLE;</p>
<p>
CREATE OR REPLACE FUNCTION stddev(numeric[]) <br  />
    RETURNS numeric AS<br  />
$BODY$<br  />
    SELECT stddev( array_to_set )<br  />
    FROM array_to_set( $1 )<br  />
$BODY$<br  />
  LANGUAGE 'sql' STRICT IMMUTABLE;</font>
</blockquote><font color="Black">All very simple.  As you can see they all have the same name as the standard agregate functions with the exception of </font><font color="Black">count_array()
given that I did not want to intefere with count() - given that it is
the only function that already takes an array in the first place.</p>
<p>
So how do I work with smaller moving sets (ie smaller array
sizes).  The moving arrays are indexed such that index 1 is the
newest value (it corrosponds to the current row value).  So
slicing the array to the first 10 slices will give the last 10 rows of
data.  This is done by this notation:<br  />
</font>
<blockquote><font color="Navy">colname[1:10]            <font color="DarkGreen">--example of returning the array itself</font><br  />
  </font><font color="Navy">avg( last_set.closing[1:10] )  <font color="DarkGreen">--example of using the avg function</font></font><br  />
</blockquote>
<font color="Black">To put this all together then, what follows is an
example query for you that use the components I have discussed. 
It returns the last current trading day, stock close price, a 10 day
moving average, bollinger bands on the 10 day period, 1 21 day moving
average, the movement of the stock that day, and the day of the week of
the current trading day.<br  />
</font>
<blockquote><font color="Navy">select last_set.day_ids[1] AS trading_day, last_set.closing[1] AS closing,  <br  />
    avg( last_set.closing[1:10] ) moving_average_10,<br  />
  </font><font color="Navy">    avg( last_set.closing[1:10] ) + 2*stddev( last_set.closing[1:10] ) AS bollinger_high_10,<br  />
  </font><font color="Navy">    avg( last_set.closing[1:10] ) - 2*stddev( last_set.closing[1:10] ) AS bollinger_low_10,<br  />
  </font><font color="Navy">    avg( last_set.closing[1:21] ) AS moving_average_21, <br  />
  </font><font color="Navy">    last_set.closing[1]-last_set.open[1] AS movement,<br  />
  </font><font color="Navy">    extract( dow from date(last_set.day_ids[1]) ) AS day_of_week<br  />
  </font><font color="Navy">from last_set('NAB', 30, NULL, NULL)<br  />
  </font><font color="Navy">order by last_set.day_ids[1]</font><br  />
</blockquote>
<font color="Black"><br  />
So that's it.  I do hope you find this of value - if not directly, then at least as another way of thinking about things.</p>
<p>
<a target="_blank" href="http://frakkle.com">http://frakkle.com</a><br  />
</font></p>
		]]></content>
		<author>
			<name>frak</name>
		</author>
	</entry>
	
	
	
	<entry>
		<title>Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt2</title>
		<link rel="alternate" type="text/html" href="http://frakkle.com/entry/93/working_with_moving_data_setse/postgresql" />
		<updated>2006-04-18T23:10:00-07:00</updated>
		<published>2006-04-18T23:10:00-07:00</published>
		<id>tag:frakklefrakshome,2008:postgresql.93</id>
		<link rel="related" type="text/html" href=""  />
		<summary type="text">So how did I take a 40 second operation - that was still in the simple
stage - and make it efficient?  Before I start, I should give you
some info that you may have guessed from the last post.

Stocks table definition:CREATE TABLE stocks(
  stock varchar(10) NOT NULL,
  day_id int4 NOT NULL,
  open numeric,
  high numeric,
  low numeric,
  closing numeric,
  volume int4,
  stock_index int4,
  CONSTRAINT pk PRIMARY KEY (stock, day_id)
) WITHOUT OIDS;

The stock_index, stock and day_id columns are indexed.  The
stock_index column is what I used in my 40 second query I showed you
last time.  (withouth this it was REALLY slow!  I had to do a
seperate query using LIMIT to get the day_id values required to look up
the required rows)

A picture of what I mean when I say "moving data set": |------------15 row set (row 15)-----------| |  |------------15 row set (row 16)-----------| |  |  |------------15 row set (row 17)-----------|| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25| |----10 row set (row 10)----|  |  |  |  |  |  |  |    |----10 row set (row 11)----|  |  |  |  |  |  |       |----10 row set (row 12)----|  |  |  |  |  |          |----10 row set (row 13)----|  |  |  |  |             |----10 row set (row 14)----|  |  |  |                |----10 row set (row 15)----|  |  |                   |----10 row set (row 16)----|  |                      |----10 row set (row 17)----|
As you can see more easily from the above diagram, the moving set of 10
is part of the moving set of 15.  (as indeed is any data set that
involves the last 15 rows or less).  So how to "roll through" the
rows to create a moving data set just the
once - instead of linking in another setof rows every time a new period
is needed for another moving average?

Instead of linking 15 rows to every row, then linking 10 rows to every row, how about a function that returns the last n
(15 above) rows in a way that can be sliced up into smaller chunks
(like the last 10 rows).  A function that simply adds the current
row values to the beginning of the set and drops the last value off the
end?  The latter would mean the data itself is only looked up
(stocks table row-count) once and not - stocks table row-count  *  set1 row-count  *  set2 row-count  * setn row-count...  ( =   a lot of work for the db! )

 - 15 times and then 10 times for every row of raw data in the stocks table.

This post is getting long again, so I'll tell you exactly how tomorrow - with code.

http://frakkle.com</summary>
        <content type="html" xml:lang="en" xml:base="http://frakkle.com/entry/93/working_with_moving_data_setse/postgresql"><![CDATA[
                So how did I take a 40 second operation - that was still in the simple
stage - and make it efficient?  Before I start, I should give you
some info that you may have guessed from the last post.</p>
<p>
<b>Stocks table definition:</b></p><blockquote><font color="Navy">CREATE TABLE stocks(<br  />
  stock varchar(10) NOT NULL,<br  />
  day_id int4 NOT NULL,<br  />
  open numeric,<br  />
  high numeric,<br  />
  low numeric,<br  />
  closing numeric,<br  />
  volume int4,<br  />
  stock_index int4,<br  />
  CONSTRAINT pk PRIMARY KEY (stock, day_id)<br  />
) WITHOUT OIDS;</font><br  />
</blockquote>
The stock_index, stock and day_id columns are indexed.  The
stock_index column is what I used in my 40 second query I showed you
last time.  (withouth this it was REALLY slow!  I had to do a
seperate query using LIMIT to get the day_id values required to look up
the required rows)</p>
<p>
<b>A picture of what I mean when I say "moving data set":</b></p><pre><font color="Brown"> |------------15 row set (row 15)-----------|<br  /> |  |------------15 row set (row 16)-----------|<br  /> |  |  |------------15 row set (row 17)-----------|<br  />| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|<br  /> |----10 row set (row 10)----|  |  |  |  |  |  |  |<br  />    |----10 row set (row 11)----|  |  |  |  |  |  |<br  />       |----10 row set (row 12)----|  |  |  |  |  |<br  />          |----10 row set (row 13)----|  |  |  |  |<br  />             |----10 row set (row 14)----|  |  |  |<br  />                |----10 row set (row 15)----|  |  |<br  />                   |----10 row set (row 16)----|  |<br  />                      |----10 row set (row 17)----|</font><br  /></pre>
As you can see more easily from the above diagram, the moving set of 10
is part of the moving set of 15.  (as indeed is any data set that
involves the last 15 rows or less).  So how to "roll through" the
rows to create a moving data set just the
once - instead of linking in another setof rows every time a new period
is needed for another moving average?</p>
<p>
Instead of linking 15 rows to every row, then linking 10 rows to every row, how about a function that returns the last <i>n</i>
(15 above) rows in a way that can be sliced up into smaller chunks
(like the last 10 rows).  A function that simply adds the current
row values to the beginning of the set and drops the last value off the
end?  The latter would mean the data itself is only looked up
(stocks table row-count) once and not - </p><blockquote><font color="Navy">stocks table row-count  *  </font><font color="Navy">set1 row-count  *  set2 row-count  * setn row-count...  ( =   a lot of work for the db! )</font><br  />
</blockquote>
 - 15 times and then 10 times for every row of raw data in the stocks table.</p>
<p>
This post is getting long again, so I'll tell you exactly how tomorrow - with code.</p>
<p>
<a target="_blank" href="http://frakkle.com">http://frakkle.com</a></p>
		]]></content>
		<author>
			<name>frak</name>
		</author>
	</entry>
	
	
	
	<entry>
		<title>Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt1</title>
		<link rel="alternate" type="text/html" href="http://frakkle.com/entry/92/working_with_moving_data_setse/postgresql" />
		<updated>2006-04-18T11:21:00-07:00</updated>
		<published>2006-04-18T11:21:00-07:00</published>
		<id>tag:frakklefrakshome,2008:postgresql.92</id>
		<link rel="related" type="text/html" href=""  />
		<summary type="text">I have been recently working on stock market technical analysis using
PostgreSQL.   One of the hassles is that you want to create
several moving sets of data - for moving averages, standard deviations
(for creating bollinger bands) and the like.   It is common
to have a short term moving average and a long term moving average to
see where they cross.  Using pure SQL then you will need to link
in an instance of the stock data to 10 rows of stock data from the same
table, and then again for the 21 rows of data (hence creating a 10 day moving average with bollinger bands, and a 21 day moving average)

The following is an example with those two moving data sets - 21 and
10.   Even though indexes are being used, it takes 40 seconds
to run!
SELECT s.stock, s.day_id, s.closing, avg(s10.closing), avg(s10.closing) + 2*stddev(s10.closing) AS bollinger_high,
    avg(s10.closing) - 2*stddev(s10.closing) AS bollinger_low,
    avg(s21.closing) AS ma_21
FROM stocks AS s
    INNER JOIN stocks s10
        ON s10.stock = s.stock
        AND s10.stock_index <= s.stock_index
        AND s10.stock_index > s.stock_index - 10
    INNER JOIN stocks s21
        ON s21.stock = s.stock
        AND s21.stock_index <= s.stock_index
        AND s21.stock_index > s.stock_index - 21
GROUP BY s.stock, s.day_id, s.closing
  
Remember that this is the beginning.  Adding an extra period would
exponentially slow things down even further.  ( a single moving
average/dataset was extremely quick.    This is the
query plan:
GroupAggregate  (cost=76987947.62..80822192.41 rows=2316 width=44) (actual time=31699.194..38996.966 rows=2316 loops=1)
  ->  Sort  (cost=76987947.62..77371364.00
rows=153366549 width=44) (actual time=31672.742..32523.684 rows=484095
loops=1)
        Sort Key: s.stock, s.day_id, s.closing
        ->  Nested
Loop  (cost=0.00..18700393.29 rows=153366549 width=44) (actual
time=82.308..16989.105 rows=484095 loops=1)
             
Join Filter: (("inner".stock)::text = ("outer".stock)::text)
             
->  Nested Loop  (cost=0.00..194391.82 rows=595984
width=44) (actual time=82.277..13726.831 rows=48426 loops=1)
                   
Join Filter: ((("outer".stock)::text = ("inner".stock)::text) AND
("outer".stock_index > ("inner".stock_index - 21)))
                   
->  Index Scan using pk on stocks s21  (cost=0.00..439.98
rows=2316 width=22) (actual time=38.341..2949.632 rows=2316 loops=1)
                   
->  Index Scan using idx_stock_index on stocks s 
(cost=0.00..68.30 rows=772 width=26) (actual time=0.034..2.765
rows=1159 loops=2316)
                         
Index Cond: ("outer".stock_index <= s.stock_index)
             
->  Index Scan using idx_stock_index on stocks s10 
(cost=0.00..25.91 rows=257 width=22) (actual time=0.007..0.033 rows=10
loops=48426)
                   
Index Cond: ((s10.stock_index <= "outer".stock_index) AND
(s10.stock_index > ("outer".stock_index - 10)))
Total runtime: 39006.705 ms


So then the answer came to me:  Create a PL function that will
return a single moving set of data for each row in a way that will not
blow the query out of the water in terms of required index scans, joins
and sorts.   Why lookup the same set of data every time a new
period is added?

Does not a 21 day moving average include the figures required to do a 10 day moving average?

For the details as to how, you will need to read the next installment. ;-)

http://frakkle.com</summary>
        <content type="html" xml:lang="en" xml:base="http://frakkle.com/entry/92/working_with_moving_data_setse/postgresql"><![CDATA[
                I have been recently working on stock market technical analysis using
PostgreSQL.   One of the hassles is that you want to create
several moving sets of data - for moving averages, standard deviations
(for creating bollinger bands) and the like.   It is common
to have a short term moving average and a long term moving average to
see where they cross.  Using pure SQL then you will need to link
in an instance of the stock data to 10 rows of stock data from the same
table, and then again for the 21 rows of data (hence creating a <i>10 day moving average</i> with <i>bollinger bands</i>, and a <i>21 day moving average</i>)</p>
<p>
The following is an example with those two moving data sets - 21 and
10.   Even though indexes are being used, it takes 40 seconds
to run!<br  />
<blockquote><font color="Navy">SELECT s.stock, s.day_id, s.closing, avg(s10.closing), avg(s10.closing) + 2*stddev(s10.closing) AS bollinger_high,<br  />
    avg(s10.closing) - 2*stddev(s10.closing) AS bollinger_low,<br  />
    avg(s21.closing) AS ma_21<br  />
FROM stocks AS s<br  />
    INNER JOIN stocks s10<br  />
        ON s10.stock = s.stock<br  />
        AND s10.stock_index &lt;= s.stock_index<br  />
        AND s10.stock_index &gt; s.stock_index - 10<br  />
    INNER JOIN stocks s21<br  />
        ON s21.stock = s.stock<br  />
        AND s21.stock_index &lt;= s.stock_index<br  />
        AND s21.stock_index &gt; s.stock_index - 21<br  />
GROUP BY s.stock, s.day_id, s.closing<br  />
  </font></blockquote>
Remember that this is the beginning.  Adding an extra period would
exponentially slow things down even further.  ( a single moving
average/dataset was extremely quick.    This is the
query plan:<br  />
<blockquote><font color="DarkGreen">GroupAggregate  (cost=76987947.62..80822192.41 rows=2316 width=44) (actual time=31699.194..38996.966 rows=2316 loops=1)<br  />
  -&gt;  Sort  (cost=76987947.62..77371364.00
rows=153366549 width=44) (actual time=31672.742..32523.684 rows=484095
loops=1)<br  />
        Sort Key: s.stock, s.day_id, s.closing<br  />
        -&gt;  Nested
Loop  (cost=0.00..18700393.29 rows=153366549 width=44) (actual
time=82.308..16989.105 rows=484095 loops=1)<br  />
             
Join Filter: (("inner".stock)::text = ("outer".stock)::text)<br  />
             
-&gt;  Nested Loop  (cost=0.00..194391.82 rows=595984
width=44) (actual time=82.277..13726.831 rows=48426 loops=1)<br  />
                   
Join Filter: ((("outer".stock)::text = ("inner".stock)::text) AND
("outer".stock_index &gt; ("inner".stock_index - 21)))<br  />
                   
-&gt;  Index Scan using pk on stocks s21  (cost=0.00..439.98
rows=2316 width=22) (actual time=38.341..2949.632 rows=2316 loops=1)<br  />
                   
-&gt;  Index Scan using idx_stock_index on stocks s 
(cost=0.00..68.30 rows=772 width=26) (actual time=0.034..2.765
rows=1159 loops=2316)<br  />
                         
Index Cond: ("outer".stock_index &lt;= s.stock_index)<br  />
             
-&gt;  Index Scan using idx_stock_index on stocks s10 
(cost=0.00..25.91 rows=257 width=22) (actual time=0.007..0.033 rows=10
loops=48426)<br  />
                   
Index Cond: ((s10.stock_index &lt;= "outer".stock_index) AND
(s10.stock_index &gt; ("outer".stock_index - 10)))<br  />
Total runtime: 39006.705 ms</font><br  />
</blockquote>
<br  />
So then the answer came to me:  Create a PL function that will
return a single moving set of data for each row in a way that will not
blow the query out of the water in terms of required index scans, joins
and sorts.   Why lookup the same set of data every time a new
period is added?</p>
<p>
Does not a 21 day moving average include the figures required to do a 10 day moving average?</p>
<p>
For the details as to how, you will need to read the next installment. ;-)</p>
<p>
<a target="_blank" href="http://frakkle.com">http://frakkle.com</a></p>
		]]></content>
		<author>
			<name>frak</name>
		</author>
	</entry>
	
	
	
</feed>
