Although magical beliefs (such as belief in luck and precognition) are presumably universal, the extent to which such beliefs are embraced likely varies across cultures. We assessed the effect of culture on luck and precognition beliefs in two large-scale multinational studies (Study 1: k = 16, N = 17,664; Study 2: k = 25, N = 4,024). Over and above the effects of demographic factors, culture was a significant predictor of luck and precognition beliefs in both studies. Indeed, when culture was added to demographic models, the variance accounted for in luck and precognition beliefs approximately doubled. Belief in luck and precognition was highest in Latvia and Russia (Study 1) and South Asia (Study 2), and lowest in Protestant Europe (Studies 1 and 2). Thus, beyond the effects of age, gender, education, and religiosity, culture is a significant factor in explaining variance in people’s belief in luck and precognition. Follow-up analyses found a relatively consistent effect of socio-economic development, such that belief in luck and precognition were more prevalent in countries with lower scores on the Human Development Index. There was also some evidence that these beliefs were stronger in more collectivist cultures, but this effect was inconsistent. We discuss the possibility that there are culturally specific historical factors that contribute to relative openness to such beliefs in Russia, Latvia, and South Asia.
That is from a new paper by Emily A. Harris, Taciano L. Milfont, and Matthew J. Hornsey, via the excellent Kevin Lewis.
In the past, FMZ officially released a perpetual grid strategy, which was popular among users, and the onlookers who traded TRX in real bots have gained a lot of profits in the past year with controllable risks. However, the perpetual grid strategy also has some problems:
It is necessary to set parameters such as initial price, grid spacing, grid value, long-short mode, etc. The settings are cumbersome and have a great impact on profits, making it difficult for novices to set.
The perpetual grid strategy has a high risk of short-selling, while the risk of long-selling is relatively low. Even if the grid value is set to a small value, it will not have a great impact on the short-selling price.
The perpetual contract grid can choose to only go long to avoid the risk of shorting, it seems okay so far. However, it needs to face the problem that the current price exceeds the initial price, resulting in a short position, and the initial price needs to be reset.
I wrote an article on the principle of the balance strategy and the comparison with the grid strategy before, and you can still refer to it now: https://www.fmz.com/digest-topic/5930. The balance strategy always holds positions with a fixed value ratio or value, sells some when it rises, and buys when it falls. It can be run with simple settings. Even if the currency price rises a lot, there is no risk of going short. The problem with the spot balance strategy is that the capital utilization is low, and there is no easy way to increase leverage. And perpetual contracts can solve the problem. If the total capital is 1000, 2000 can be held fixedly, which exceeds the original capital and improves the capital utilization. Another parameter is the adjustment ratio, which controls how much to sacle in or dump the position. If it is set to 0.01, it means that the position is dumped once for 1% increase and scaled in once for 1% decrease.
For beginners, the balance strategy is highly recommended. The operation is simple, just set a parameter of holding ratio or position value, and you can run it mindlessly without worrying about constant price increases. Those with certain experience can choose the grid strategy, and decide the upper and lower limits of fluctuations and the funds per grid, so as to improve the utilization of funds and obtain maximum profits.
In order to facilitate the backtesting of more trading pairs, this document will show the complete backtesting process, and users can adjust different parameters and trading pairs for comparison. (The version is Python3, and an agent is required to download the quotation. Users can download Anancoda3 by themselves or run it through Google's colab)
import requests
from datetime import date,datetime
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests, zipfile, io
%matplotlib inline
## Current trading pairs
Info = requests.get('https://fapi.binance.com/fapi/v1/exchangeInfo')
symbols = [s['symbol'] for s in Info.json()['symbols']]
symbols = list(set(filter(lambda x: x[-4:] == 'USDT', [s.split('_')[0] for s in symbols]))-
set(['1000SHIBUSDT','1000XECUSDT','BTCDOMUSDT','DEFIUSDT','BTCSTUSDT'])) + ['SHIBUSDT','XECUSDT']
print(symbols)
#Get the function of the K-line of any period
def GetKlines(symbol='BTCUSDT',start='2020-8-10',end='2021-8-10',period='1h',base='fapi',v = 'v1'):
Klines = []
start_time = int(time.mktime(datetime.strptime(start, "%Y-%m-%d").timetuple()))*1000 + 8*60*60*1000
end_time = min(int(time.mktime(datetime.strptime(end, "%Y-%m-%d").timetuple()))*1000 + 8*60*60*1000,time.time()*1000)
intervel_map = {'m':60*1000,'h':60*60*1000,'d':24*60*60*1000}
while start_time < end_time:
mid_time = start_time+1000*int(period[:-1])*intervel_map[period[-1]]
url = 'https://'+base+'.binance.com/'+base+'/'+v+'/klines?symbol=%s&interval=%s&startTime=%s&endTime=%s&limit=1000'%(symbol,period,start_time,mid_time)
#print(url)
res = requests.get(url)
res_list = res.json()
if type(res_list) == list and len(res_list) > 0:
start_time = res_list[-1][0]+int(period[:-1])*intervel_map[period[-1]]
Klines += res_list
if type(res_list) == list and len(res_list) == 0:
start_time = start_time+1000*int(period[:-1])*intervel_map[period[-1]]
if mid_time >= end_time:
break
df = pd.DataFrame(Klines,columns=['time','open','high','low','close','amount','end_time','volume','count','buy_amount','buy_volume','null']).astype('float')
df.index = pd.to_datetime(df.time,unit='ms')
return df
By downloading the closing prices of all trading pairs from 2021 to the present, we can observe the changes in the overall market index: 2021 to 2022 is undoubtedly a bull market, and the index once rose by 14 times. It can be said that gold is everywhere, and many coins have risen hundreds of times. However, in 2022, the bear market that has lasted for half a year has begun, with the index plunging 80%, and dozens of coins have withdrawn by more than 90%. Such pump-and-dump reflects the enormous risk of grid strategies. The index is currently at around 3, which is still a 200% gain compared to the beginning of 2021, and it should be a relative bottom at the moment, considering the development of the market.
Currencies whose highest price has increased more than 10 times since the beginning of the year:
#Download closing prices for all trading pairs
start_date = '2021-1-1'
end_date = '2022-05-30'
period = '1d'
df_all = pd.DataFrame(index=pd.date_range(start=start_date, end=end_date, freq=period),columns=symbols)
for i in range(len(symbols)):
#print(symbols[i])
symbol = symbols[i]
df_s = GetKlines(symbol=symbol,start=start_date,end=end_date,period=period,base='api',v='v3')
df_all[symbol] = df_s[~df_s.index.duplicated(keep='first')].close
#The highest increase over the beginning of the year
max_up = df_all.max()/df_all.fillna(method='bfill').iloc[0]
print(max_up.map(lambda x:round(x,3)).sort_values().to_dict())
First of all, we use the simplest code to simulate the situation of falling all the way down, and see the liquidation price of different position values. Since the strategy always holds a long position, there is no risk in going up. The initial capital is 1000, the currency price is 1, and the adjustment ratio is 0.01. The results are as follows. It can be seen that the risk of long liquidation is not low. With 1.5 times leverage, it can resist a 50% decline. Given the current relative bottom situation, it is an acceptable risk.
#Still using the original backtesting engine
class Exchange:
def __init__(self, trade_symbols, fee=0.0004, initial_balance=10000):
self.initial_balance = initial_balance #Initial assets
self.fee = fee
self.trade_symbols = trade_symbols
self.account = {'USDT':{'realised_profit':0, 'unrealised_profit':0, 'total':initial_balance, 'fee':0}}
for symbol in trade_symbols:
self.account[symbol] = {'amount':0, 'hold_price':0, 'value':0, 'price':0, 'realised_profit':0,'unrealised_profit':0,'fee':0}
def Trade(self, symbol, direction, price, amount):
cover_amount = 0 if direction*self.account[symbol]['amount'] >=0 else min(abs(self.account[symbol]['amount']), amount)
open_amount = amount - cover_amount
self.account['USDT']['realised_profit'] -= price*amount*self.fee #Deduct the handling fees
self.account['USDT']['fee'] += price*amount*self.fee
self.account[symbol]['fee'] += price*amount*self.fee
if cover_amount > 0: #Close the position first
self.account['USDT']['realised_profit'] += -direction*(price - self.account[symbol]['hold_price'])*cover_amount #Profits
self.account[symbol]['realised_profit'] += -direction*(price - self.account[symbol]['hold_price'])*cover_amount
self.account[symbol]['amount'] -= -direction*cover_amount
self.account[symbol]['hold_price'] = 0 if self.account[symbol]['amount'] == 0 else self.account[symbol]['hold_price']
if open_amount > 0:
total_cost = self.account[symbol]['hold_price']*direction*self.account[symbol]['amount'] + price*open_amount
total_amount = direction*self.account[symbol]['amount']+open_amount
self.account[symbol]['hold_price'] = total_cost/total_amount
self.account[symbol]['amount'] += direction*open_amount
def Buy(self, symbol, price, amount):
self.Trade(symbol, 1, price, amount)
def Sell(self, symbol, price, amount):
self.Trade(symbol, -1, price, amount)
def Update(self, close_price): #Update of assets
self.account['USDT']['unrealised_profit'] = 0
for symbol in self.trade_symbols:
self.account[symbol]['unrealised_profit'] = (close_price[symbol] - self.account[symbol]['hold_price'])*self.account[symbol]['amount']
self.account[symbol]['price'] = close_price[symbol]
self.account[symbol]['value'] = abs(self.account[symbol]['amount'])*close_price[symbol]
self.account['USDT']['unrealised_profit'] += self.account[symbol]['unrealised_profit']
self.account['USDT']['total'] = round(self.account['USDT']['realised_profit'] + self.initial_balance + self.account['USDT']['unrealised_profit'],6)
First of all, we backtest the performance of the TRX balance strategy. The maximum retracement of TRX in this round of bear market is relatively small, so it has certain specificity. The data is selected from the 5min K-line from 2021 to the present, with an initial capital of 1000, the adjustment ratio is 0.01, the position value is 2000, and the handling fee is 0.0002.
The initial price of TRX was 0.02676U, and the highest price during the period reached 0.18U. It is currently around 0.08U, and the fluctuations are very violent. If you run the long-short grid strategy at the beginning, it is difficult to escape the result of short-selling. The balance strategies are less of a problem.
The final return of the backtest is 4524U, which is very close to the return of TRX at 0.18. The leverage is lower than 2 times from the beginning and finally lower than 0.4, and the possibility of liquidation is also getting lower and lowerr, during which there can be an opportunity to increase the value of the position. But below 2000U is always the same income. This is also one of the disadvantages of the balance strategy.
symbol = 'TRXUSDT'
df_trx = GetKlines(symbol=symbol,start='2021-1-1',end='2022-5-30',period='5m')
#Actual leverage of occupancy
(res_trx.value/(res_trx.profit+1000)).plot(figsize=(15,6),grid=True);
Let's backtest WAVES again. This currency is quite special. It rose from 6U to 60U at the beginning, and finally fell back to the current 8U. The final profit is 4945, far more than the profit of holding the currency unchanged.
symbol = 'WAVESUSDT'
df_waves = GetKlines(symbol=symbol,start='2021-1-1',end='2022-5-30',period='5m')
By the way, the performance of the grid strategy is backtested, the grid spacing is 0.01, and the grid value is 10. In the case of nearly 10 times of the increase, both WAVES and TRX have experienced huge drawdowns. Among them, WAVES has withdrawn 5000U, and TRX has also exceeded 3000U. If the initial capital is small, the positions alomst will be liquidated.
#Grid strategy
pct = 0.01
value = 10*pct/0.01
e = Exchange([symbol], fee=0.0002, initial_balance=1000)
init_price = df_waves.iloc[0].open
res_list = [] #For storing intermediate results
for row in df_waves.itertuples():
buy_price = (value / pct - value) / (value / (pct * init_price) + e.account[symbol]['amount'])
sell_price = (value / pct + value) / (value / (pct *init_price) + e.account[symbol]['amount'])
while row.low < buy_price:
e.Buy(symbol,buy_price,value/buy_price)
e.Update({symbol:row.close})
buy_price = (value / pct - value) / (value / (pct * init_price) + e.account[symbol]['amount']) #The buy order price, since it is a pending order transaction, is also the final matching price=
while row.high > sell_price:
e.Sell(symbol,sell_price,value/sell_price)
e.Update({symbol:row.close})
sell_price = (value / pct + value) / (value / (pct *init_price) + e.account[symbol]['amount'])
if int(row.time)%(60*60*1000) == 0:
e.Update({symbol:row.close})
res_list.append([row.time, row.close, e.account[symbol]['amount'],e.account[symbol]['amount']*row.close, e.account['USDT']['total']-e.initial_balance])
res_waves_net = pd.DataFrame(data=res_list, columns=['time','price','amount','value','profit'])
res_waves_net.index = pd.to_datetime(res_waves_net.time,unit='ms')
print(pct,e.account['USDT']['realised_profit']+e.account['USDT']['unrealised_profit'] ,round(e.account['USDT']['fee'],0))
This time, the backtest analysis used the 5min K-line, the fluctuations in the middle is not completely simulated, so the actual profits should be slightly higher. Overall, the balance strategy bears relatively small risk, not afraid of skyrocketing, and there is no need to adjust the parameters, it is relatively easy to use and suitable for novice users. The grid strategy is very sensitive to the initial price setting and requires some judgment of the market. In the long run, the risk of going short is high. The current round of bear market has been stable at the bottom for some time, many currencies are currently down more than 90% from their highs, if you are optimistic about some currencis, this is a good time to enter the market, you may want to open a balance strategy to buy the bottom, add a little leverage and get profits from volatility and price increase.
The Binance Thousand League Battle will provide free access to the perpetual balance strategy, and everyone is welcome to experience it.
This month the TerraUSD stablecoin and associated Luna reserve cryptocurrency crashed and lost most of its value. About $45 billion in market capitalization vanished within a week.
Apparently there was some sort of sell-off, but then TerraUSD was supposed to be a stablecoin, being pegged to the US dollar via a complex algorithmic relationship. That algorithm must have failed.
How has it failed? How fragile was this pegging of a cryptocurrency to the USD?
I guess that different currencies can ultimately never be pegged perfectly and a stablecoin is a bit of a misnomer, but still I wonder how easy it might have been to break this pegging.
Tools such as foreign exchange intervention can ease the effects of shocks but need to be carefully weighed against potential longer-term costs.
Sub-Saharan African countries face important monetary policy challenges. The pandemic dented economic growth, and even now the recovery is likely to leave output below the pre-crisis trend this year. Several countries in the region have also seen inflation increase, a challenge that is in some cases compounded by fiscal dominance emanating from high public debt levels.
Many of these economies may also face capital outflows as the major central banks in advanced economies withdraw policy stimulus and raise interest rates in the period ahead. The economic impact of the conflict raging in Ukraine—including the attendant sharp rise in energy and food prices—is likely to further intensify the challenges.
How should countries in sub-Saharan Africa manage this volatile environment?
Exchange rate considerations
Countries with managed or free-floating exchange rate regimes generally benefit from allowing currencies to adjust, while focusing monetary policy on domestic objectives.
That said, many countries in sub-Saharan Africa with floating exchange rate regimes have characteristics and vulnerabilities that can limit the benefits from fully flexible rates. For instance, dominant currency pricing (i.e., rigid export prices in US dollar terms) can weaken the beneficial trade adjustments associated with flexible rates.
Moreover, shallow markets (i.e., markets with limited liquidity) can amplify exchange rate movements and yield excessive volatility. Foreign exchange markets tend to be shallow in many countries in the region, as evidenced by wide spreads between bid and ask prices.
High foreign-currency denominated liabilities are also a key vulnerability in several economies. In the presence of large currency mismatches on balance sheets, exchange rate depreciations can undermine the financial health of corporates and households. And weak central bank credibility can cause exchange rate changes to have a bigger effect on inflation (high passthrough). Such currency mismatches and high passthrough can cause output and inflation to move in opposite directions following shocks, thereby worsening the tradeoffs that policymakers face.
There is also evidence that the exchange rate passthrough in low-income countries is substantially higher than it is in more advanced economies, which poses a particular problem given the often heavy dependance on food and energy imports.
How should countries that exhibit such vulnerabilities manage their policy responses?
First, it remains important to reduce the vulnerabilities over time. This includes reducing balance sheet mismatches; developing money and foreign currency markets; and reducing exchange rate passthrough by building monetary policy credibility. Many of these are areas where IMF technical assistance can help.
But in the near-term—while vulnerabilities remain high—the IMF’s work toward an Integrated Policy Framework suggests that using additional tools may help ease short-term policy trade-offs when certain shocks hit. In particular, where reserves are adequate and these tools are available, foreign exchange intervention, macroprudential policy measures and capital flow measures can help enhance monetary policy autonomy, improve financial and price stability, and reduce output volatility.
For instance, simulations with the framework’s models suggest that in response to a sharp tightening of global financial conditions or other negative external financial shock, a country exhibiting such vulnerabilities could improve immediate economic outcomes by using foreign exchange intervention to reduce exchange rate depreciation and thereby limit the inflationary impact and reduce negative balance sheet effects. This results in higher output and lower inflation than would have been feasible without the use of the additional policy instrument.
For central banks considering such policies, however, a few important qualifiers are in order. Importantly, the tools should not be used to maintain an over- or undervalued exchange rate. Moreover, while additional tools can help alleviate short-term tradeoffs, this benefit needs to be carefully weighed against potential longer-term costs. Such costs may include, for instance, reduced incentives for market development and appropriate risk management in the private sector.
Communicating about the joint use of multiple tools in a more complex framework can be very challenging, too, and expanding the set of policy options may subject central banks to political pressures. Central banks will thus need to weigh the benefits against potential negative impacts on their own transparency and credibility, especially in circumstances where policy frameworks are not yet well established.
Inflation is plaguing consumer markets, putting pressure on the Federal Reserve to raise interest rates to tighten the money supply. But as Rex Nutting writes in a MarketWatch column titled “Why Interest Rates Aren’t Really the Right Tool to Control Inflation”: It may be heresy to those who think the Fed is all-powerful, but the honest answer […]
This post shows an hybrid approach between the "optimize for one use case" idea of document databases (where a single table holds all information) and "one database for many use-cases" (where relational data modeling allows multiple access patterns). I'm basing this post on a question by @Manish in our Slack forum, and will show a demo on YugabyteDB, but all code is valid with PostgreSQL.
The idea is to store some post content, by user id, and with a list of tags and groups (friend circle). From a user point of view, where the critical use cases are: inserting a new post, and retrieving posts by user, here is the table that fits the main case:
This follows a single-table data model, each row being a document with its content and lists of group_ids and tag_ids. In PostgreSQL, this creates a heap table and a secondary index for the primary key. In YugabyteDB, a table is stored clustered on the primary key, to allow fast point and range access without a secondary index. I didn't mention the sharding method in the CREATE statement, to keep the code compatible with PostgreSQL. YugabyteDB default is hash on the first column, and range on next. So this is equivalent to PRIMARY KEY (user_id HASH, created_date ASC, post_id ASC).
The business key is user_id, post_id and I enforce it with a UNIQUE constraint. But for the primary key, I'm adding the date. The drawback of adding created_date in primary key is full rewrite of the whole document in case the created_date is updated, which is probably not the case here. The advantage is to allow fast access to a time range when looking at one user posts. This is something to decide when knowing all access patterns. For YugabyteDB, it would be better to set the sharding options explicitly, with a descending order: PRIMARY KEY (user_id HASH, created_date DESC post_id ASC).
GIN indexes
Other use cases involve getting posts by tag_idor by group_id to fill the feed list. With this single-table design, I can create GIN indexes on those arrays:
GIN indexes are supported in YugabyteDB as of version 2.11 and the feature roadmap is tracked in #7850
However, I cannot add the created_date in them. This is a limitation from PostgreSQL. Trying to add it in the INCLUDING clause will raise ERROR: access method "gin" does not support included columns and trying to add it in the indexed columns will raise ERROR: data type timestamp with time zone has no default operator class for access method "gin".
Because of this limitation, I'll not create those GIN indexes here for the queries that are based on range of created_date. However, there is still a place where GIN index may help, for text search on the post content. I described this in a previous post.
Normalization
So, what is the solution in a SQL database? Here is the reason why relational database were invented: allow efficient query for multiple access paths without compromising data integrity. Rather than storing the post tags and groups as an array within the posts, I can store each in their own table:
I've declared the foreign key to guarantee the consistency of data, and it contains the created_date. This is on purpose to be able to filter on a time range from this table, before doing to read the main table to get the content. This is why I didn't create the GIN indexes which didn't allow it. I've also declared the unicity of columns without the created_date to enforce data integrity. My goal here is to consider these posts_by_group and posts_by_tag tables like a secondary index. Exactly like the GIN index I wanted to build, but with additional columns to match the selectivity of the uses cases. I'll not update those additional tables directly but they will be automatically maintained when updating the main posts_by_user table. And this is where I need a trigger.
Trigger and Atomic Procedure
Ideally, the application code focuses on business logic. Data integrity should be fully implemented in the database. The best is with declarative constraints like the FOREIGN KEY I defined above. Or, when declarative is not possible, with triggers and procedures. And this is what I'm showing here.
The application will simply insert (or update or delete) on the main table posts_by_user.
I'm keeping the tag_ids and group_ids arrays there to get everything in one document when accessing to a post by its primary key. This also eases the logic to maintain the secondary tables posts_by_tag and posts_by_group because any DML on the main table will know the old and new value. Thanks to this, there is no need for other indexes on those secondary tables. The logic is simple, and can be optimized if needed, here I delete the entries for the old values and insert those for the new ones. This is exactly how a secondary index works, but here I'm coding it in a procedure:
CREATEORREPLACEFUNCTIONposts_by_user_dml()RETURNSTRIGGERAS$$declareloop_tag_idbigint;loop_group_idbigint;beginifold.tag_idsisnotnullthen-- delete entries for old valuesforeachloop_tag_idinarrayold.tag_idsloopdeletefromposts_by_tagtwheret.tag_id=loop_tag_idandt.user_id=old.user_idandt.post_id=old.post_id;endloop;endif;ifnew.tag_idsisnotnullthen-- insert entries for new valuesforeachloop_tag_idinarraynew.tag_idsloopinsertintoposts_by_tag(tag_id,user_id,post_id,created_date)values(loop_tag_id,new.user_id,new.post_id,new.created_date);endloop;endif;ifold.group_idsisnotnullthen-- delete entries for old valuesforeachloop_group_idinarrayold.group_idsloopdeletefromposts_by_grouptwheret.group_id=loop_group_idandt.user_id=old.user_idandt.post_id=old.post_id;endloop;endif;ifnew.group_idsisnotnullthen-- insert entries for new valuesforeachloop_group_idinarraynew.group_idsloopinsertintoposts_by_group(group_id,user_id,post_id,created_date)values(loop_group_id,new.user_id,new.post_id,new.created_date);endloop;endif;returnnew;end;$$LANGUAGEplpgsql;
This procedure operates with the old and new records from a trigger. Here is the declaration of the trigger:
The ON DELETE is not necessary because this one is automated by the ON DELETE CASCADE foreign key constraint.
I usually prefer calling a procedure rather than a DML statement with some triggers behind it. This to avoid side effects hidden behind a SQL statement. But here a trigger is fine because the INSERT statement keeps its semantic. What happens here is just technical to maintain the secondary tables.
Unit Tests
Deploying code in the database does not exempt you from unit testing it. This is easy. There are 3 DML operations (DELETE, INSERT, UPDATE) and values can be null, empty array or array of integers.
Of course, maintaining the secondary tables has a cost on inserts, like any additional index. It is important to understand how it scales on a distributed database. Here is an example inserting 100 posts per users, for 10 users, with 20 tags and groups (chosen at random among 1000 ones):
I repeat this with \watch and in a small server this is about 8 seconds for 1000 posts, which means about 8 milliseconds per post:
Time: 10604.948 ms (00:10.605)
Mon 20 Dec 2021 11:29:17 AM GMT (every 0.1s)
INSERT 0 1000
Time: 8533.703 ms (00:08.534)
Mon 20 Dec 2021 11:29:25 AM GMT (every 0.1s)
INSERT 0 1000
Time: 8320.473 ms (00:08.320)
Mon 20 Dec 2021 11:29:34 AM GMT (every 0.1s)
INSERT 0 1000
Note that in my lab, I still have the GIN indexes created. This doesn't change the latency because all is distributed across many nodes that are not saturated. Same with the indexes that enforce the unique constraints. I've run this from 6 sessions connected to the 6 nodes of a RF=3 Multi-AZ YugabyteDB cluster (which means High Availability with 3-way replication to be resilient to one availability zone failure). The nodes are AWS c5.2xlarge. I've run 6 threads connected and loaded 1000 new posts per second. This is displayed as 3000 Ops/Sec because each insert is actually 3 inserts with my trigger maintaining the secondary tables:
Now the question is about how it scales. Assessing the scalability means understanding the time complexity.
I've traced the calls between the YSQL layer (the PostgreSQL code) and the DocDB layer (the YugabyteDB distributed storage and transaction):
I'm not going into the details here. The important is to see that all PGSQL_WRITE operations to the main and secondary tables are all buffered together. They are distributed to each node (depending on the hash code from the primary key) but it is not a per-table fan-out. This is what is great with using stored procedures in distributed databases: the whole transaction can be processed with waiting the acknowledgement for individual statements.
Queries
I have 43 million posts loaded (the ANALYZE ran for 4 minutes)
Getting the the last 2-days posts for one user has to read only the main table, thanks to all being stored in it:
This is the fastest execution plan you can have: read a range from one table, in the primary key order, requiring no additional sorting. It could even be a bit better if I had declared the created_date as DESC instead of ASC (I didn't to keep the PostgreSQL compatibility for the blog post). On PostgreSQL the same execution plan will have to read rows from the heap table, probably scattered, as one user doesn't post all at the same time.
So this returned 2173 posts in 85 milliseconds from a table that has 40 million posts. But the size of the table doesn't matter because it is a range scan.
Getting the posts by a list of tag requires a join from the secondary table:
Even if there is a join, the time complexity of Index Access and Nested Loop time complexity depends on the result rather than the size of the table. This query is still in the 80 milliseconds:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..28.40 rows=100 width=120) (actual time=2.691..70.550 rows=100 loops=1)
-> Nested Loop (cost=0.00..28.40 rows=100 width=120) (actual time=2.689..70.513 rows=100 loops=1)
-> Index Scan Backward using posts_by_tag_pkey on posts_by_tag (cost=0.00..17.00 rows=100 width=24) (actual time=1.830..1.971 rows=100 loops=1)
Index Cond: ((tag_id = 1) AND (created_date > (now() - '1 mon'::interval)))
-> Index Scan using posts_by_user_pkey on posts_by_user (cost=0.00..0.11 rows=1 width=120) (actual time=0.670..0.670 rows=1 loops=100)
Index Cond: ((user_id = posts_by_tag.user_id) AND (created_date = posts_by_tag.created_date) AND (post_id = posts_by_tag.post_id))
Planning Time: 0.269 ms
Execution Time: 71.185 ms
This execution plan gives a response time that is proportional to the number of posts per tag filtered by the secondary table (here 100 rows from the latest ones with tag 1 in the past month). In 71 milliseconds. I've run the same getting 1000 rows in 634 milliseconds. With Nested Loop, the join is under control as soon as you bound the inner query rows. And this is why I wanted all filtering criteria in the secondary tables.
When tracing the calls to the storage I see two PGSQL_READ operations:
Clearly, in a distributed SQL database, this should be used to gather only a few posts (as filtered on the secondary table columns) because those calls go to all nodes to read the content from the main table, and have to seek in many places. I'm using this secondary table as a secondary index and it is important not to add another level of fan-out. This table should be scanned on its primary key and filter as much as possible.
Now as I kept the GIN indexes, let's see what would happen:
yugabyte=# explain analyze
select *
from posts_by_user
where created_date > now() - interval '1 month'
and tag_ids @>'{1}'
order by created_date desc limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
------
Limit (cost=52.11..52.36 rows=100 width=120) (actual time=289549.350..289549.380 rows=100 loops=1)
-> Sort (cost=52.11..54.61 rows=1000 width=120) (actual time=289549.348..289549.363 rows=100 loops=1)
Sort Key: created_date DESC
Sort Method: top-N heapsort Memory: 128kB
-> Index Scan using posts_by_user_tag_ids on posts_by_user (cost=4.00..13.89 rows=1000 width=120) (actual time=8720.517..289510.325 rows=250376 loo
ps=1)
Index Cond: (tag_ids @> '{1}'::bigint[])
Filter: (created_date > (now() - '1 mon'::interval))
Rows Removed by Filter: 2845624
Planning Time: 0.093 ms
Execution Time: 289550.383 ms
(10 rows)
You see the problem: because the index does do not contain the created_date, this has to be filtered out later (Rows Removed by Filter). Does it matter? Test it with your use case and your data. And also with your YugabyteDB version. I'm on 2.11 here and maybe one day the GIN index defined can filter on the date. Because, being part of the primary key, it is possible to get it from the index entry. This is an advantage of YugabyteDB storing the tables in the primary key LSM tree: the secondary index reference the row by they primary key. The possibility to use it to filter may be related to #10169. On standard PostgreSQL you will see another join method for this one, but let's keep it for the next post.
Distributed SQL database considerations
Joining the two tables, by reading the one for tag or group first, and then fan-out read to all nodes to get the related posts, is still scalable when calls to the distributed storage are optimized by the SQL query layer. Here YugabyteDB, provides fully consistent global indexes, tables stored in the primary index, triggers to guarantee data integrity beyond what the declarative constraints can provide, procedures to batch many statements in one call... The same technique can be used with PostgreSQL, with additional heap fetches, but no cross-server calls. But maybe you don't need it because Bitmap Scan on GIN index may provide an acceptable response time. I'll run the same data and query on PostgreSQL in a next post.
I hope this use case illustrates a reason for triggers and stored procedures, even when you don't want any business logic in the database. This is only data logic for data integrity and performance reason, and this belongs to the SQL database.