Working with moving data sets(eg for moving averages) efficiently in PostgreSQL - Pt1
I have been recently working on stock market technical analysis using PostgreSQL. One of the hassles is that you want to create several moving sets of data - for moving averages, standard deviations (for creating bollinger bands) and the like. It is common to have a short term moving average and a long term moving average to see where they cross. Using pure SQL then you will need to link in an instance of the stock data to 10 rows of stock data from the same table, and then again for the 21 rows of data (hence creating a 10 day moving average with bollinger bands, and a 21 day moving average)
The following is an example with those two moving data sets - 21 and
10. Even though indexes are being used, it takes 40 seconds
to run!
SELECT s.stock, s.day_id, s.closing, avg(s10.closing), avg(s10.closing) + 2*stddev(s10.closing) AS bollinger_high,Remember that this is the beginning. Adding an extra period would exponentially slow things down even further. ( a single moving average/dataset was extremely quick. This is the query plan:
avg(s10.closing) - 2*stddev(s10.closing) AS bollinger_low,
avg(s21.closing) AS ma_21
FROM stocks AS s
INNER JOIN stocks s10
ON s10.stock = s.stock
AND s10.stock_index <= s.stock_index
AND s10.stock_index > s.stock_index - 10
INNER JOIN stocks s21
ON s21.stock = s.stock
AND s21.stock_index <= s.stock_index
AND s21.stock_index > s.stock_index - 21
GROUP BY s.stock, s.day_id, s.closing
GroupAggregate (cost=76987947.62..80822192.41 rows=2316 width=44) (actual time=31699.194..38996.966 rows=2316 loops=1)
-> Sort (cost=76987947.62..77371364.00 rows=153366549 width=44) (actual time=31672.742..32523.684 rows=484095 loops=1)
Sort Key: s.stock, s.day_id, s.closing
-> Nested Loop (cost=0.00..18700393.29 rows=153366549 width=44) (actual time=82.308..16989.105 rows=484095 loops=1)
Join Filter: (("inner".stock)::text = ("outer".stock)::text)
-> Nested Loop (cost=0.00..194391.82 rows=595984 width=44) (actual time=82.277..13726.831 rows=48426 loops=1)
Join Filter: ((("outer".stock)::text = ("inner".stock)::text) AND ("outer".stock_index > ("inner".stock_index - 21)))
-> Index Scan using pk on stocks s21 (cost=0.00..439.98 rows=2316 width=22) (actual time=38.341..2949.632 rows=2316 loops=1)
-> Index Scan using idx_stock_index on stocks s (cost=0.00..68.30 rows=772 width=26) (actual time=0.034..2.765 rows=1159 loops=2316)
Index Cond: ("outer".stock_index <= s.stock_index)
-> Index Scan using idx_stock_index on stocks s10 (cost=0.00..25.91 rows=257 width=22) (actual time=0.007..0.033 rows=10 loops=48426)
Index Cond: ((s10.stock_index <= "outer".stock_index) AND (s10.stock_index > ("outer".stock_index - 10)))
Total runtime: 39006.705 ms
So then the answer came to me: Create a PL function that will return a single moving set of data for each row in a way that will not blow the query out of the water in terms of required index scans, joins and sorts. Why lookup the same set of data every time a new period is added?
Does not a 21 day moving average include the figures required to do a 10 day moving average?
For the details as to how, you will need to read the next installment. ;-)
No comments:
No trackbacks:
Trackback link: