In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
In [2]:
plt.style.use('seaborn-poster')
plt.style.use('seaborn-whitegrid')
plt.rc('figure', figsize=(12, 8))
In [3]:
df_users = pd.read_csv('users.csv')
df = pd.read_csv('submissions.csv')
df['SubmissionTime'] = pd.to_datetime(df['SubmissionTime'])

Financial Forecasting Challenge Results

The competition is over and we have a winner: LiDer! The competition was fierce and quality of all the entries was extremely high. Even more pleasing was the friendly and supportive atmosphere on the discussion forums. This is the first such competition we have run as a company. We wanted to create a fun challenge that gave competitors the opportunity to work with real financial data. There were 406 entries in total of which 263 were able to beat our example models. Here's a quick recap of the final leaderboard:

In [4]:
df_users.sort_values('PrivateMark')[:10]
Out[4]:
Name PublicMark PrivateMark PublicRank PrivateRank NumberOfSubmissions
0 LiDer 0.288640 0.282720 1 1 166
1 Humberto Brandão 0.289052 0.283326 2 2 111
2 DeePBluE 0.291568 0.284367 6 3 37
3 JohnHarris 0.290807 0.284400 5 4 58
4 Thistle Singer 0.293624 0.285060 8 5 46
5 Bull in a China Shop 0.290138 0.285204 3 6 74
6 nabrac 0.290444 0.285588 4 7 10
7 CanadianLover 0.294592 0.285831 9 8 58
8 ShakeShakeShakeIt 0.293268 0.286484 7 9 35
9 Kacper Madej 0.294972 0.286920 11 10 48

There was some jostling for position amongst the top 10 but the top 2 competitors managed to maintain their position on both the public and private leaderboard.

Submissions became more frequent in the final days of the competition.

In [5]:
y = df.groupby(df['SubmissionTime'].dt.floor('D')).size()
plt.bar(y.index, y, width=1.0)
plt.xlabel('SubmissionTime')
plt.ylabel('Submissions per day');

and here are a selection of the leading scores over time (for competitors with over 50 submissions). The leaders made rapid progress towards a score of 0.300, but squeezing out the last small improvements in wMSE took much longer.

In [6]:
def f(g):
    if (len(g) > 50) and (min(g['PublicMark']) < 0.35):
        best_public = [min(g[g['SubmissionTime'] <= t]['PublicMark']) for t in g['SubmissionTime']]
        plt.step(g['SubmissionTime'], best_public, where='post')
df.groupby('UserId').apply(f)
plt.ylim(0.28,.35)
plt.xlabel('SubmissionTime')
plt.ylabel('PublicMark');

Looking at the histogram of all the PublicMarks, there is a clear leading group with marks below about 0.350. This mark compares well to a LightGBM model with stock and market as categorical features ( https://lightgbm.readthedocs.io/en/latest/index.html ), which achieves a public score of 0.306. A lot of people mentioned that they used this model as a starting point.

Some users chose to publish their models in some very interesting blog posts:

One particular feature of this competition, which was mentioned in the discussion forums, is the inclusion of contemporaneous and even future data. Of course in real financial forecasting this is not possible: one is restricted to making predictions using the past data only. This is perhaps why the best models are able to explain such a large fraction (>40%) of the variance of the data.

In [7]:
df_users['PublicMark'].hist(bins=np.arange(0.25,0.55,.01))
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
plt.axvline(0.467223, label='Linear regression', color=colors[1])
plt.axvline(0.412500, label='Stockwise mean', color=colors[2])
plt.axvline(0.305600, label='LightGBM', color=colors[3])
plt.legend()
plt.xlabel('PublicMark')
plt.ylabel('Number of people');

When designing the competition, one issue that we felt was important was to ensure that the PublicMark was a reliable indicator of the PrivateMark. Since the scores are based on random data, there will be random variations in the public and private marks. Financial data is naturally very noisy, so we therefore tried to make the dataset big enough to minimize this problem. Here is a plot of the two marks for 1800 submissions. It is clear that the public and private marks are extremely well correlated with one another.

In [8]:
df.plot('PublicMark', 'PrivateMark', xlim=(.25,.45), ylim=(.25,.45), kind='scatter', alpha=0.5);

In addition to random variation, there could also be a systematic variation between public and private mark. For example, since the public mark was visible to throughout the duration of the contest, it may be possible for competitors to optimize for models with a better public mark but which result in a worse private mark. If this were true then the public mark would become lower relative to the private mark as the 'model complexity' increases. Model complexity is a general concept that depends on the type of model, but could be some measure like number of training iterations, number of features included etc. Unfortunately, the true model complexity is only known to the competitor who trained it. Instead, we use the number of submissions as a proxy for it, the idea being that more submissions are more likely to lead to overfitting.

To measure this, we plot the difference between PublicMark and PrivateMark as a function of number of submissions, it appears that there is a slight trend for models with many submissions to have a better-than-average public score. However, the difference is less than the random variation in the scores, which suggests that overfitting to the public dataset was not a big problem in this competition.

In [9]:
def f(g):
    return g.sort_values('SubmissionTime').reset_index(drop=True).reset_index()[['index', 'PublicMark', 'PrivateMark']]

g = df.groupby(df['UserId']).apply(f)
y = (g['PublicMark'] - g['PrivateMark']).clip(0, .02)
x = g['index']
sns.regplot(x, y, marker='.')
plt.xlabel('Number of submissions')
plt.ylabel('PublicMark - PrivateMark');

That's all for now. Congratulations to the winner, and to everyone who participated. We hope to be back soon with another challenge!