Skip to main content

Variable Types for Principal Component & Factor Modeling


TRANSFORMING RAW DATA INTO INSIGHT & ACTIONABLE INFORMATION

After reading the book Moneyball for the first time, I built a factor model in hopes of finding a way to finally be competitive in my fantasy baseball league - which I had consistently been terrible at.  It worked immediately.  By taking raw data and turning it into actionable information, I was able to solve a problem that had long perplexed me.  It was like discovering a new power.  What else could I do with this?

Today, I build models for everything and have come a long way since that first simple spreadsheet but still use a lot of the same concepts.

To build a traditional factor model, you would regress a dependent variable against a series of independent variables and use the resulting beta coefficients as the factor weights... assuming your resulting r-squared and t-test showed a meaningful relationship of course.

Variables typically fall into one of two categories... continuous or dichotomous.  Dichotomous variables only have two possible outcomes and have to be regressed using a statistical method called logistic regression.  Continuous variables, on the other hand, are not constrained.  Think of daily high temperatures or the time of your daily commute as examples of independent continuous variables.

I use a principal component analysis (PCA) model - a first cousin of the factor model - to recognize technical trading patterns in stocks as they form.  PCA modeling is mostly empirical as something like future stock prices can never be fully explained (as the market has recently reminded me).

My PCA uses a combination of continuous and dichotomous variables to recognize patterns as they form.

The continuous variables include things like normalized trends and variance to moving averages.  However, because these are empirical metrics, I can't use beta coefficients as factor multiples to produce any type of  meaningful information.  Rather, I have to rely on the size of my universe of stocks (currently 400 names) to convert these continuous measurements into discreet variables.  By doing this, I am able to evaluate individual measurements and occurrences in context of a large population.

For example, let's say we have a universe of 5 stocks and want to identify which one is most likely to experience a mean reversion.  Assuming we're only using one variable, variance to the 20-day moving average, we would collect all of their measurements and sort them high to low.  The highest measurement would be assigned a discreet value of '5' and the next would be assigned a discreet value of '4' and so on.  Using these discreet values, we gain immediate insight into each stock in relation to the overall population.

Again, dichotomous variables can only assume one of two possible values.  Like flipping a coin, there are only two possible outcomes... except when a coin lands perfectly on its side which is amazing to witness, but I digress.  However, unlike flipping a coin where the outcomes are equally likely if you're using a fair coin, dichotomous variables don't have to have equally probable outcomes.

In my PCA equity model, I use dichotomous variables to identify absolutes... like if a stock is trading above its 50-day moving average, or if the relationship between the 20-day and 50-day moving averages is positive or negative.

Here again, the size of my population comes into play because each outcome has to be assigned a value.  Using our previous example of a 5 stock universe, we could assign a value of '1' to a positive observation and a value of '0' to a negative observation (this would constitute a Bernoulli variable by the way).  But if we have a population of 400 names and are using multiple variables, a simple '1' and '0' would probably not be effective in expressing the observation.  Therefore, I typically use larger measurements depending on the aggregate of the discreet variables.

Another type of variable that I have used in predictive modeling is what I refer to as a 'dispersion variable.'  Again, by utilizing a population, we can compare individual observations and metrics relative to a larger population to produce insights.

Dispersion variables compare a measurement relative to the population and test whether that observation falls outside of a defined dispersion metric like 1 standard deviation above the population mean or 1.65 standard deviations below the population mean.  This differs from a dichotomous variable in that a certain values are assigned to metrics that fall above or below various measures.

For example, when using dispersion variables to rank baseball players, I would take a population of 650 OPS (on base percentage + slugging percentage) measurements and determine which players were more than 2 standard deviations above and below the mean.  Players above the +2 standard deviation metric would be assigned a positive value (somewhat extemporaneously, I'll admit)... let's say of '+5' and players below the -2 standard deviation metric would be assigned a value of '-5'.  Next, players who were above the +1 standard deviation metric would receive a '+3' and players below -1 standard deviation would be assigned a '-3'.  Measurements that fell inside of the +/- 1 standard deviation would be assigned a '0' value.

Doing this across a series of statistics and with a large enough population creates another layer of insight that was sometimes missed using other variable types.

Comments

Popular posts from this blog

Modeling Credit Risk...

     Here's a link to a presentation I gave back in August on modeling credit risk.  If anyone would like a copy of the slides, go ahead and drop me a line... https://www.gotostage.com/channel/39b3bd2dd467480a8200e7468c765143/recording/37684fe4e655449f9b473ec796241567/watch      Timeline of the presentation: Presentation Begins:                                                                0:58:00 Logistic Regression:                                                                1:02:00 Recent Trends in Probabilities of Default:                              1:10:20 Machine Learning:                                                                  1:15:00 Merton Structural Model:                                                        1:19:30 Stochastic Asset Simulation Model:                                        1:27:30 T-Year Merton Model:                

Modeling Black-Litterman; Part 1 - Reverse Optimization

  "The 'radical' of one century is the 'conservative' of the next." -Mark Twain In this series, I'm going to explore some of the advances in portfolio management, construction, and modeling since the advent of Harry Markowitz's Nobel Prize winning Modern Portfolio Theory (MPT) in 1952. MPT's mean-variance optimization approach shaped theoretical asset allocation models for decades after its introduction.  However, the theory failed to become an accepted industry practice, so we'll explore why that is and what advances have developed in recent years to address the shortcomings of the original model. The Problems with Markowitz For the purpose of illustrating the benefits of diversification in a simple two-asset portfolio, Markowitz's model was a useful tool in producing optimal weights at each level of assumed risk to create efficient portfolios.   However, in reality, investment portfolios are complex and composed of large numbers of holdin

Evidence the SPY is Overbought...

 A quick note on the recent market rally here of late.  It's plain to see the markets have been on a tear for the month of June (and going back into May for the QQQ) as the SPY closed today at its highest level in almost fourteen months. If we start to look at the historical levels, however, it appears the SPY may be overbought in the short-run and susceptible to a mean-reverting pattern. Here's the daily chart of the SPY as of today's (6/15/23) close... When looking at the distance between the closing price and the 50-day moving average (illustrated by the yellow bar), we're noticing a large gap... this can be measured by a statistic I developed which I casually refer to as "variance"... or the distance between current prices and their respective moving averages. Historically, throughout the life of the SPY (which debuted in January of '93), the variance over the 50-day moving average has peaked at a reading of 3.20... today's reading posts up at 2.49