22 April 06 - 01:51Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt3
So how to I create/return a "rolling set" of data efficiently, instead of freshly linking 30-100 rows of data to every row of data? The first part of the answer is arrays. These are part of the SQL standard, and supported in PostgreSQL since last decade.The second part is how do we do operations on these arrays - there is no SUM/COUNT/AVG etc for array elements. And how do we work with different time-slices of data. (ie if we have a moving period of 30 days how to we return a moving average of 21 days or similar)
If we remember from the first post - 41 seconds is the time to beat (btw - I forget to mention before: I am dealing with about 10 years of data for NAB (National Australia Bank - listed on the ASX)). The following function - running the same resultant rows - takes 3.4 seconds.
Since building the original function, I have actually changed it to give me arrays for every column (trading day, open, closed, low, high) and the final runtime is 4.1 seconds. It should be noted that this is a specific function, however for different tables/column sets it is very easy to change for your own use. I could have made this more generic at the cost of performance - but there did not seem to be much point for my use - certainly not to provide this example.
So this is the function - at least the new version:
CREATE TYPE dt_stock_set ASThe last part of this is to produce functions that will do my agregates for me on the arrays. The easiest way to do this (though not the most efficient it is a tradeoff I am happy with for now given that the efficiency of the total package is great) is to convert the arrays to SETs. This SET will be returned cached for effiency and will be used by all array (of same size) agregate functions for the row
(stock varchar(10),
day_id int4,
day_ids int4[],
closing numeric[],
open numeric[],
low numeric[],
high numeric[],
volume numeric[],
period int4);CREATE OR REPLACE FUNCTION last_set(code "varchar", period int4, day_max int4, day_min int4)
RETURNS SETOF dt_stock_set AS
$BODY$
declare
rtn dt_stock_set%rowtype;
rs record;
last_stock varchar(10);
ar_day_id int4[];
ar_close numeric[];
ar_open numeric[];
ar_high numeric[];
ar_low numeric[];
ar_volume numeric[];
ar_pointer int4;
BEGIN
last_stock = '';
FOR rs IN
SELECT s.stock, s.day_id, s.closing, s.open, s.high, s.low, s.volume
FROM stocks AS s
WHERE (code IS NULL OR s.stock=code)
AND (day_max IS NULL OR s.day_id <= day_max)
AND (day_min IS NULL OR s.day_id >= day_min)
ORDER BY s.stock ASC, day_id ASC
LOOP
IF last_stock != rs.stock THEN
last_stock := rs.stock;
ar_day_id := Array[rs.day_id];
ar_close := Array[rs.closing];
ar_open := Array[rs.open];
ar_low := Array[rs.low];
ar_high := Array[rs.high];
ar_volume := Array[rs.volume];
ELSE
--ar_close := rs.closing || ar_close[1:29] ; --NOTE this does not work in PL hence the loop
FOR ar_pointer IN REVERSE period .. 2 LOOP
ar_day_id[ar_pointer] := ar_day_id[ar_pointer-1];
ar_close[ar_pointer] := ar_close[ar_pointer-1];
ar_open[ar_pointer] := ar_open[ar_pointer-1];
ar_low[ar_pointer] := ar_low[ar_pointer-1];
ar_high[ar_pointer] := ar_high[ar_pointer-1];
ar_volume[ar_pointer] := ar_volume[ar_pointer-1];
END LOOP;
ar_day_id[1] := rs.day_id;
ar_close[1] := rs.closing;
ar_open[1] := rs.open;
ar_low[1] := rs.low;
ar_high[1] := rs.high;
ar_volume[1] := rs.volume;
END IF;
rtn := (rs.stock, rs.day_id, ar_day_id, ar_close, ar_open, ar_low, ar_high, ar_volume , period);
return next rtn;
END LOOP;
END
$BODY$
LANGUAGE 'plpgsql' STABLE;
CREATE OR REPLACE FUNCTION array_to_set(ar_in numeric[])The next part is to create the functions that take the arrays and return the agregate values (using the SET data internally):
RETURNS SETOF numeric AS
$BODY$
declare
rtn numeric;
ar_pointer int4 = 1;
BEGIN
--ar_upper := array_upper( ar_in );
WHILE ( ar_in[ar_pointer] IS NOT NULL ) LOOP
return next ar_in[ar_pointer];
ar_pointer := ar_pointer + 1;
END LOOP;
END
$BODY$
LANGUAGE 'plpgsql' STRICT IMMUTABLE;
CREATE OR REPLACE FUNCTION sum(numeric[])All very simple. As you can see they all have the same name as the standard agregate functions with the exception of count_array() given that I did not want to intefere with count() - given that it is the only function that already takes an array in the first place.
RETURNS numeric AS
$BODY$
SELECT sum( array_to_set )
FROM array_to_set( $1 )
$BODY$
LANGUAGE 'sql' STRICT IMMUTABLE;CREATE FUNCTION count_array(numeric[])
RETURNS bigint AS
$BODY$
SELECT count( array_to_set )
FROM array_to_set( $1 )
$BODY$
LANGUAGE 'sql' STRICT IMMUTABLE;CREATE OR REPLACE FUNCTION avg(numeric[])
RETURNS numeric AS
$BODY$
SELECT avg( array_to_set )
FROM array_to_set( $1 )
$BODY$
LANGUAGE 'sql' STRICT IMMUTABLE;CREATE OR REPLACE FUNCTION stddev(numeric[])
RETURNS numeric AS
$BODY$
SELECT stddev( array_to_set )
FROM array_to_set( $1 )
$BODY$
LANGUAGE 'sql' STRICT IMMUTABLE;
So how do I work with smaller moving sets (ie smaller array
sizes). The moving arrays are indexed such that index 1 is the
newest value (it corrosponds to the current row value). So
slicing the array to the first 10 slices will give the last 10 rows of
data. This is done by this notation:
colname[1:10] --example of returning the array itselfTo put this all together then, what follows is an example query for you that use the components I have discussed. It returns the last current trading day, stock close price, a 10 day moving average, bollinger bands on the 10 day period, 1 21 day moving average, the movement of the stock that day, and the day of the week of the current trading day.
avg( last_set.closing[1:10] ) --example of using the avg function
select last_set.day_ids[1] AS trading_day, last_set.closing[1] AS closing,
avg( last_set.closing[1:10] ) moving_average_10,
avg( last_set.closing[1:10] ) + 2*stddev( last_set.closing[1:10] ) AS bollinger_high_10,
avg( last_set.closing[1:10] ) - 2*stddev( last_set.closing[1:10] ) AS bollinger_low_10,
avg( last_set.closing[1:21] ) AS moving_average_21,
last_set.closing[1]-last_set.open[1] AS movement,
extract( dow from date(last_set.day_ids[1]) ) AS day_of_week
from last_set('NAB', 30, NULL, NULL)
order by last_set.day_ids[1]
So that's it. I do hope you find this of value - if not directly, then at least as another way of thinking about things.
- postgresql - No comments / No trackback - § ¶