TRANSFORMING RAW DATA INTO INSIGHT & ACTIONABLE INFORMATION
After reading the book Moneyball for the first time, I built a factor model in hopes of finding a way to finally be competitive in my fantasy baseball league - which I had consistently been terrible at. It worked immediately. By taking raw data and turning it into actionable information, I was able to solve a problem that had long perplexed me. It was like discovering a new power. What else could I do with this?
Today, I build models for everything and have come a long way since that first simple spreadsheet but still use a lot of the same concepts.
To build a traditional factor model, you would regress a dependent variable against a series of independent variables and use the resulting beta coefficients as the factor weights... assuming your resulting r-squared and t-test showed a meaningful relationship of course.
Variables typically fall into one of two categories... continuous or dichotomous. Dichotomous variables only have two possible outcomes and have to be regressed using a statistical method called logistic regression. Continuous variables, on the other hand, are not constrained. Think of daily high temperatures or the time of your daily commute as examples of independent continuous variables.
I use a principal component analysis (PCA) model - a first cousin of the factor model - to recognize technical trading patterns in stocks as they form. PCA modeling is mostly empirical as something like future stock prices can never be fully explained (as the market has recently reminded me).
My PCA uses a combination of continuous and dichotomous variables to recognize patterns as they form.
The continuous variables include things like normalized trends and variance to moving averages. However, because these are empirical metrics, I can't use beta coefficients as factor multiples to produce any type of meaningful information. Rather, I have to rely on the size of my universe of stocks (currently 400 names) to convert these continuous measurements into discreet variables. By doing this, I am able to evaluate individual measurements and occurrences in context of a large population.
For example, let's say we have a universe of 5 stocks and want to identify which one is most likely to experience a mean reversion. Assuming we're only using one variable, variance to the 20-day moving average, we would collect all of their measurements and sort them high to low. The highest measurement would be assigned a discreet value of '5' and the next would be assigned a discreet value of '4' and so on. Using these discreet values, we gain immediate insight into each stock in relation to the overall population.
Again, dichotomous variables can only assume one of two possible values. Like flipping a coin, there are only two possible outcomes... except when a coin lands perfectly on its side which is amazing to witness, but I digress. However, unlike flipping a coin where the outcomes are equally likely if you're using a fair coin, dichotomous variables don't have to have equally probable outcomes.
In my PCA equity model, I use dichotomous variables to identify absolutes... like if a stock is trading above its 50-day moving average, or if the relationship between the 20-day and 50-day moving averages is positive or negative.
Here again, the size of my population comes into play because each outcome has to be assigned a value. Using our previous example of a 5 stock universe, we could assign a value of '1' to a positive observation and a value of '0' to a negative observation (this would constitute a Bernoulli variable by the way). But if we have a population of 400 names and are using multiple variables, a simple '1' and '0' would probably not be effective in expressing the observation. Therefore, I typically use larger measurements depending on the aggregate of the discreet variables.
Another type of variable that I have used in predictive modeling is what I refer to as a 'dispersion variable.' Again, by utilizing a population, we can compare individual observations and metrics relative to a larger population to produce insights.
Dispersion variables compare a measurement relative to the population and test whether that observation falls outside of a defined dispersion metric like 1 standard deviation above the population mean or 1.65 standard deviations below the population mean. This differs from a dichotomous variable in that a certain values are assigned to metrics that fall above or below various measures.
For example, when using dispersion variables to rank baseball players, I would take a population of 650 OPS (on base percentage + slugging percentage) measurements and determine which players were more than 2 standard deviations above and below the mean. Players above the +2 standard deviation metric would be assigned a positive value (somewhat extemporaneously, I'll admit)... let's say of '+5' and players below the -2 standard deviation metric would be assigned a value of '-5'. Next, players who were above the +1 standard deviation metric would receive a '+3' and players below -1 standard deviation would be assigned a '-3'. Measurements that fell inside of the +/- 1 standard deviation would be assigned a '0' value.
Doing this across a series of statistics and with a large enough population creates another layer of insight that was sometimes missed using other variable types.
Comments
Post a Comment