About

My ramblings on SEO hacking and internet marketing, and general techie stuff

Calendar

Archives

01 Apr - 30 Apr 2006

Stuart Pearson
Flights to Geneva
Outback Steakhouse Recipes
Simon Zach
Swimming Pool Builder Sussex

Miscellany

Powered by Pivot - 1.40.1: 'Dreadwind' 
XML: RSS Feed 
XML: Atom Feed 

22 April 06 - 01:51Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt3

So how to I create/return a "rolling set" of data efficiently, instead of freshly linking 30-100 rows of data to every row of data?  The first part of the answer is arrays.  These are part of the SQL standard, and supported in PostgreSQL since last decade.

The second part is how do we do operations on these arrays - there is no SUM/COUNT/AVG etc for array elements.  And how do we work with different time-slices of data.  (ie if we have a moving period of 30 days how to we return a moving average of 21 days or similar)

If we remember from the first post - 41 seconds is the time to beat (btw - I forget to mention before:  I am dealing with about 10 years of data for NAB (National Australia Bank - listed on the ASX)).  The following function - running the same resultant rows - takes 3.4 seconds.

Since building the original function, I have actually changed it to give me arrays for every column (trading day, open, closed, low, high) and the final runtime is 4.1 seconds.   It should be noted that this is a specific function, however for different tables/column sets it is very easy to change for your own use.  I could have made this more generic at the cost of performance - but there did not seem to be much point for my use - certainly not to provide this example.

So this is the function - at least the new version:

CREATE TYPE dt_stock_set AS
   (stock varchar(10),
    day_id int4,
    day_ids int4[],
    closing numeric[],
    open numeric[],
    low numeric[],
    high numeric[],
    volume numeric[],
    period int4);

CREATE OR REPLACE FUNCTION last_set(code "varchar", period int4, day_max int4, day_min int4)
  RETURNS SETOF dt_stock_set AS
$BODY$
declare
    rtn dt_stock_set%rowtype;
    rs record;
    last_stock varchar(10);
    ar_day_id int4[];
    ar_close numeric[];
    ar_open numeric[];
    ar_high numeric[];
    ar_low numeric[];
    ar_volume numeric[];
    ar_pointer int4;
BEGIN
    last_stock = '';
    FOR rs IN
        SELECT s.stock, s.day_id, s.closing, s.open, s.high, s.low, s.volume
        FROM stocks AS s
        WHERE (code IS NULL OR s.stock=code)
            AND (day_max IS NULL OR s.day_id <= day_max)
            AND (day_min IS NULL OR s.day_id >= day_min)
        ORDER BY s.stock ASC, day_id ASC
    LOOP
        IF last_stock != rs.stock THEN
            last_stock := rs.stock;
            ar_day_id := Array[rs.day_id];
            ar_close := Array[rs.closing];
            ar_open := Array[rs.open];
            ar_low := Array[rs.low];
            ar_high := Array[rs.high];
            ar_volume := Array[rs.volume];
        ELSE
        --ar_close := rs.closing || ar_close[1:29] ; --NOTE this does not work in PL hence the loop
            FOR ar_pointer IN REVERSE period .. 2 LOOP
                ar_day_id[ar_pointer] := ar_day_id[ar_pointer-1];
                ar_close[ar_pointer] := ar_close[ar_pointer-1];
                ar_open[ar_pointer] := ar_open[ar_pointer-1];
                ar_low[ar_pointer] := ar_low[ar_pointer-1];
                ar_high[ar_pointer] := ar_high[ar_pointer-1];
                ar_volume[ar_pointer] := ar_volume[ar_pointer-1];
            END LOOP;
            ar_day_id[1] := rs.day_id;
            ar_close[1] := rs.closing;
            ar_open[1] := rs.open;
            ar_low[1] := rs.low;
            ar_high[1] := rs.high;
            ar_volume[1] := rs.volume;
        END IF;
        rtn := (rs.stock, rs.day_id, ar_day_id, ar_close, ar_open, ar_low, ar_high, ar_volume , period);
        return next rtn;
    END LOOP;
END
$BODY$
  LANGUAGE 'plpgsql' STABLE;

The last part of this is to produce functions that will do my agregates for me on the arrays.  The easiest way to do this (though not the most efficient it is a tradeoff I am happy with for now given that the efficiency of the total package is great) is to convert the arrays to SETs.  This SET will be returned cached for effiency and will be used by all array (of same size) agregate functions for the row
CREATE OR REPLACE FUNCTION array_to_set(ar_in numeric[])
  RETURNS SETOF numeric AS
$BODY$
declare
    rtn numeric;
    ar_pointer int4 = 1;
BEGIN
    --ar_upper := array_upper( ar_in );
    WHILE ( ar_in[ar_pointer] IS NOT NULL ) LOOP
        return next ar_in[ar_pointer];
        ar_pointer := ar_pointer + 1;
    END LOOP;
END
$BODY$
  LANGUAGE 'plpgsql' STRICT IMMUTABLE;

The next part is to create the functions that take the arrays and return the agregate values (using the SET data internally):
CREATE OR REPLACE FUNCTION sum(numeric[])
    RETURNS numeric AS
$BODY$
    SELECT sum( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE  FUNCTION count_array(numeric[])
    RETURNS bigint AS
$BODY$
    SELECT count( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE OR REPLACE FUNCTION avg(numeric[])
    RETURNS numeric AS
$BODY$
    SELECT avg( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE OR REPLACE FUNCTION stddev(numeric[])
    RETURNS numeric AS
$BODY$
    SELECT stddev( array_to_set )
    FROM array_to_set( $1 )
$BODY$
  LANGUAGE 'sql' STRICT IMMUTABLE;

All very simple.  As you can see they all have the same name as the standard agregate functions with the exception of count_array() given that I did not want to intefere with count() - given that it is the only function that already takes an array in the first place.

So how do I work with smaller moving sets (ie smaller array sizes).  The moving arrays are indexed such that index 1 is the newest value (it corrosponds to the current row value).  So slicing the array to the first 10 slices will give the last 10 rows of data.  This is done by this notation:

colname[1:10]         --example of returning the array itself
avg( last_set.closing[1:10] )  --example of using the avg function
To put this all together then, what follows is an example query for you that use the components I have discussed.  It returns the last current trading day, stock close price, a 10 day moving average, bollinger bands on the 10 day period, 1 21 day moving average, the movement of the stock that day, and the day of the week of the current trading day.
select last_set.day_ids[1] AS trading_day, last_set.closing[1] AS closing, 
    avg( last_set.closing[1:10] ) moving_average_10,
    avg( last_set.closing[1:10] ) + 2*stddev( last_set.closing[1:10] ) AS bollinger_high_10,
    avg( last_set.closing[1:10] ) - 2*stddev( last_set.closing[1:10] ) AS bollinger_low_10,
    avg( last_set.closing[1:21] ) AS moving_average_21,
    last_set.closing[1]-last_set.open[1] AS movement,
    extract( dow from date(last_set.day_ids[1]) ) AS day_of_week
from last_set('NAB', 30, NULL, NULL)
order by last_set.day_ids[1]

So that's it.  I do hope you find this of value - if not directly, then at least as another way of thinking about things.

http://frakkle.com

- postgresql - No comments / No trackback - §

18 April 06 - 23:10Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt2

So how did I take a 40 second operation - that was still in the simple stage - and make it efficient?  Before I start, I should give you some info that you may have guessed from the last post.

Stocks table definition:

CREATE TABLE stocks(
  stock varchar(10) NOT NULL,
  day_id int4 NOT NULL,
  open numeric,
  high numeric,
  low numeric,
  closing numeric,
  volume int4,
  stock_index int4,
  CONSTRAINT pk PRIMARY KEY (stock, day_id)
) WITHOUT OIDS;

The stock_index, stock and day_id columns are indexed.  The stock_index column is what I used in my 40 second query I showed you last time.  (withouth this it was REALLY slow!  I had to do a seperate query using LIMIT to get the day_id values required to look up the required rows)

A picture of what I mean when I say "moving data set":

 |------------15 row set (row 15)-----------|
| |------------15 row set (row 16)-----------|
| | |------------15 row set (row 17)-----------|
| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|
|----10 row set (row 10)----| | | | | | | |
|----10 row set (row 11)----| | | | | | |
|----10 row set (row 12)----| | | | | |
|----10 row set (row 13)----| | | | |
|----10 row set (row 14)----| | | |
|----10 row set (row 15)----| | |
|----10 row set (row 16)----| |
|----10 row set (row 17)----|

As you can see more easily from the above diagram, the moving set of 10 is part of the moving set of 15.  (as indeed is any data set that involves the last 15 rows or less).  So how to "roll through" the rows to create a moving data set just the once - instead of linking in another setof rows every time a new period is needed for another moving average?

Instead of linking 15 rows to every row, then linking 10 rows to every row, how about a function that returns the last n (15 above) rows in a way that can be sliced up into smaller chunks (like the last 10 rows).  A function that simply adds the current row values to the beginning of the set and drops the last value off the end?  The latter would mean the data itself is only looked up (stocks table row-count) once and not -

stocks table row-count  *  set1 row-count  *  set2 row-count  * setn row-count...  ( =   a lot of work for the db! )
 - 15 times and then 10 times for every row of raw data in the stocks table.

This post is getting long again, so I'll tell you exactly how tomorrow - with code.

http://frakkle.com

- postgresql - No comments / No trackback - §

18 April 06 - 11:21Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt1

I have been recently working on stock market technical analysis using PostgreSQL.   One of the hassles is that you want to create several moving sets of data - for moving averages, standard deviations (for creating bollinger bands) and the like.   It is common to have a short term moving average and a long term moving average to see where they cross.  Using pure SQL then you will need to link in an instance of the stock data to 10 rows of stock data from the same table, and then again for the 21 rows of data (hence creating a 10 day moving average with bollinger bands, and a 21 day moving average)

The following is an example with those two moving data sets - 21 and 10.   Even though indexes are being used, it takes 40 seconds to run!

SELECT s.stock, s.day_id, s.closing, avg(s10.closing), avg(s10.closing) + 2*stddev(s10.closing) AS bollinger_high,
    avg(s10.closing) - 2*stddev(s10.closing) AS bollinger_low,
    avg(s21.closing) AS ma_21
FROM stocks AS s
    INNER JOIN stocks s10
        ON s10.stock = s.stock
        AND s10.stock_index <= s.stock_index
        AND s10.stock_index > s.stock_index - 10
    INNER JOIN stocks s21
        ON s21.stock = s.stock
        AND s21.stock_index <= s.stock_index
        AND s21.stock_index > s.stock_index - 21
GROUP BY s.stock, s.day_id, s.closing
Remember that this is the beginning.  Adding an extra period would exponentially slow things down even further.  ( a single moving average/dataset was extremely quick.    This is the query plan:
GroupAggregate  (cost=76987947.62..80822192.41 rows=2316 width=44) (actual time=31699.194..38996.966 rows=2316 loops=1)
  ->  Sort  (cost=76987947.62..77371364.00 rows=153366549 width=44) (actual time=31672.742..32523.684 rows=484095 loops=1)
        Sort Key: s.stock, s.day_id, s.closing
        ->  Nested Loop  (cost=0.00..18700393.29 rows=153366549 width=44) (actual time=82.308..16989.105 rows=484095 loops=1)
              Join Filter: (("inner".stock)::text = ("outer".stock)::text)
              ->  Nested Loop  (cost=0.00..194391.82 rows=595984 width=44) (actual time=82.277..13726.831 rows=48426 loops=1)
                    Join Filter: ((("outer".stock)::text = ("inner".stock)::text) AND ("outer".stock_index > ("inner".stock_index - 21)))
                    ->  Index Scan using pk on stocks s21  (cost=0.00..439.98 rows=2316 width=22) (actual time=38.341..2949.632 rows=2316 loops=1)
                    ->  Index Scan using idx_stock_index on stocks s  (cost=0.00..68.30 rows=772 width=26) (actual time=0.034..2.765 rows=1159 loops=2316)
                          Index Cond: ("outer".stock_index <= s.stock_index)
              ->  Index Scan using idx_stock_index on stocks s10  (cost=0.00..25.91 rows=257 width=22) (actual time=0.007..0.033 rows=10 loops=48426)
                    Index Cond: ((s10.stock_index <= "outer".stock_index) AND (s10.stock_index > ("outer".stock_index - 10)))
Total runtime: 39006.705 ms


So then the answer came to me:  Create a PL function that will return a single moving set of data for each row in a way that will not blow the query out of the water in terms of required index scans, joins and sorts.   Why lookup the same set of data every time a new period is added?

Does not a 21 day moving average include the figures required to do a 10 day moving average?

For the details as to how, you will need to read the next installment. ;-)

http://frakkle.com

- postgresql - No comments / No trackback - §

Linkdump