Small-Sample Bias of the Two-Stage Least Squares IVEstimator
Instrumental variables (IV) is one of the first econometric methods most students learn. There are many ways to teach IV, but we usually introduce it using a two-state least squares (2SLS) procedure to help build students’ intuition about what’s going on.
In practice, there are a variety of practical limitations of the 2SLS estimator. For one, it’s biased in small samples. And since applied econometricians live in a world of small sample sizes this can be important in practice. Second, if your instruments aren’t sufficiently correlated with your regressors — a failure of instrument “relevance” — the small-sample bias can be magnified dramatically. In the extreme case of zero instrument relevance the math completely falls apart and the 2SLS estimator isn’t even identified.
Below are some notes summarizing the problem. The first 3 pages show the math for the simple case of one regressor one one instrument. The last 3 pages show the general case for K regressors and M > K instruments:
Small Sample Bias of the 2SLS IV Estimator
Posted by Andrew on Thursday June 13, 2013 | Feedback?
An Easy Derivation of the Volume of SpheresFormula
Working 2,000 years before the development of calculus, the Greek mathematician Archimedes worked out a simple formula for the volume of a sphere:
Of his many mathematical contributions, Archimedes was most proud of this result, even going so far as to ask that the method he used to work out the formula — a diagram circumscribing a cylinder inside a sphere — be imprinted on his gravestone.
Archimedes’ formula may have been a stroke of scientific genius in 250 B.C., but with the help of modern calculus the derivation is extremely simple. In this post I’ll explain one way to derive the famous formula, and explain how it can be done in dimensions other than the usual three.
Consider the diagram below. It’s a sphere with radius r. The goal is to find the volume, and here’s how we do that.
Notice that one thing we can easily find is the area of a single horizontal slice of the ball. This is the shaded disk at the top of the diagram, which is drawn at height z. The disk has a radius of x, which we’ll need to find the area of the disk. To find x, we can form a right triangle with sides z and x, and hypotenuse r. This is drawn in the figure. Then we can easily solve for x.
By the Pythagorean theorem, we know that , so solving for x we have . Then the area of the shaded disk is simply pi times the radius squared, or .
Now that we have the area of one horizontal disk, we want to find the area of all horizontal disks inside the ball summed together. That will give us the volume of the sphere. To do this, we simply take the definite integral of the disk area formula from above for all possible heights z, which are between -r (at the bottom of the ball) and r (at the top of the ball). That is, our volume is given by
Which is the volume formula we were looking for.
This same logic can be used to derive formulas for the volume of a “ball” in 4, 5, and higher dimensions as well. Doing so, you can show that the volume of a unit ball in one dimension (a line) is just 2; the volume in two dimensions (a disk) is , and — as we’ve just shown — the volume in three dimensions (a sphere) is . Continuing on to four, five, and ultimately n dimensions, a surprising result appears.
It turns out the volume of a unit ball peaks at five dimensions, and then proceeds to shrink thereafter, ultimately approaching zero as the dimension n goes to infinity. You can read more about this beautiful mathematical result here.
Addendum: You can find a clear explanation for how the volume-of-spheres formula generalizes to n dimensions on page 888-889 here.
A Simpler Proof of the Bolzano-WeierstrassTheorem
A while back I posted a long proof of the Bolzano-Weierstrass theorem — also known as the “sequential compactness theorem” — which basically says every sequence that’s bounded has a subsequence within it that converges. Here’s a much shorter and simpler version of it.
First we’ll prove a lemma that shows for any sequence we can always find a monotone subsequence — that is, a subsequence that’s always increasing or decreasing.
Lemma. Every sequence has a monotone subsequence.
Proof. Let be a sequence. Define a “peak” of as an element such that for all . That is is a peak if forever after that point going forward, there is no other element of the sequence that is greater than . Intuitively, think of shining a flashlight from the right onto the “mountain range” of a sequence’s plotted elements. If the light hits an element, that element is a peak.
If has infinitely many peaks, then collect those peaks into a subsequence . This is a monotone decreasing subsequence, as required.
If has finitely many peaks, take n to be the position of the last peak. Then we know is not a peak. So there exists an such that . Call this point . We also know that is not a peak either. So there also exists an such that . Call this point .
Continuing, we can create a subsequence that is monotone increasing. In either case — if our sequence has infinite or finitely many peaks — we can always find a monotone subsequence, as required.
Now that we’ve proved the above lemma, the proof of the Bolzano-Weierstrass theorem follows easily.
Theorem (Bolzano-Weierstrass).Every bounded sequence has a convergent subsequence.
Proof. By the previous lemma, every sequence has a monotone subsequence. Call this . Since is bounded by assumption, then the subsequence is also bounded. So by the monotone convergence theorem, since is monotone and bounded, it converges. So every bounded sequence has a convergent subsequence, completing the proof.
The Linear Algebra View of Calculus: Taking a Derivative with aMatrix
Most people think of linear algebra as a tool for solving systems of linear equations. While it definitely helps with that, the theory of linear algebra goes much deeper, providing powerful insights into many other areas of math.
In this post I’ll explain a powerful and surprising application of linear algebra to another field of mathematics — calculus. I’ll explain how the fundamental calculus operations of differentiation and integration can be understood instead as a linear transformation. This is the “linear algebra” view of basic calculus.
Taking Derivatives as a Linear Transformation
In linear algebra, the concept of a vector space is very general. Anything can be a vector space as long as it follows two rules.
The first rule is that if u and v are in the space, then u + v must also be in the space. Mathematicians call this “closed under addition.” Second, if u is in the space and c is a constant, then cu must also be in the space. This is known as “closed under scalar multiplication.” Any collection of objects that follows those two rules — they can be vectors, functions, matrices and more — qualifies as a vector space.
One of the more interesting vector spaces is the set of polynomials of degree less than or equal to n. This is the set of all functions that have the following form:
where a0…an are constants.
Is this really a vector space? To check, we can verify that it follows our two rules from above. First, if p(t) and q(t) are both polynomials, then p(t) + q(t) is also a polynomial. That shows it’s closed under addition. Second, if p(t) is a polynomial, so is c times p(t), where c is a constant. That shows it’s closed under scalar multiplication. So the set of polynomials of degree at most n is indeed a vector space.
Now let’s think about calculus. One of the first methods we learn is taking derivatives of polynomials. It’s easy. If our polynomial is ax^2 + 3x, then our first derivative is 2ax + 3. This is true for all polynomials. So the general first derivative of an nth degree polynomial is given by:
The question is: is this also a vector space? To answer that, we check to see that it follows our two rules above. First, if we add any two derivatives together, the result will still be the derivative of some polynomial. Second, if we multiply any derivative by a constant c, this will still be the derivative of some polynomial. So the set of first derivatives of polynomials is also a vector space.
Now that we know polynomials and their first derivatives are both vector spaces, we can think of the operation “take the derivative” as a rule that maps “things in the first vector space” to “things in the second vector space.” That is, taking the derivative of a polynomial is a “linear transformation” that maps one vector space (the set of all polynomials of degree at most n) into another vector space (the set of all first derivatives of polynomials of degree at most n).
If we call the set of polynomials , then the set of derivatives of this is , since taking the first derivative will reduce the degree of each polynomial term by 1. Thus, the operation “take the derivative” is just a function that maps . A similar argument shows that “taking the integral” is also a linear transformation in the opposite direction, from .
Once we realize differentiation and integration from calculus is really just a linear transformation, we can describe them using the tools of linear algebra.
Here’s how we do that. To fully describe any linear transformation as a matrix multiplication in linear algebra, we follow three steps.
First, we find a basis for the subspace in the domain of the transformation. That is, if our transformation is from , we first write down a basis for .
Next, we feed each element of this basis through the linear transformation, and see what comes out the other side. That is, we apply the transformation to each element of the basis, which gives the “image” of each element under the transformation. Since every element of the domain is some combination of those basis elements, by running them through the transformation we can see the impact the transformation will have on any element in the domain.
Finally, we collect each of those resulting images into the columns of a matrix. That is, each time we run an element of the basis through the linear transformation, the output will be a vector (the “image” of the basis element). We then place these vectors into a matrix D, one in each column from left to right. That matrix D will fully represent our linear transformation.
An Example for Third-Degree Polynomials
Here’s an example of how to do this for , the set of all polynomials of at most degree 3. This is the set of all functions of the following form:
where at a0…a3 are constants. When we apply our transformation, “take the derivative of this polynomial,” it will reduce the degree of each term in our polynomial by one. Thus, the transformation D will be a linear mapping from to , which we write as .
To find the matrix representation for our transformation, we follow our three steps above: find a basis for the domain, apply the transformation to each basis element, and compile the resulting images into columns of a matrix.
First we find a basis for . The simplest basis is the following: 1, t, t^2, and t^3. All third-degree polynomials will be some linear combination of these four elements. In vector notation, we say that a basis for is given by:
Now that we have a basis for our domain , the next step is to feed the elements of it into the linear transformation to see what it does to them. Our linear transformation is, “take the first derivative of the element.” So to find the “image” of each element, we just take the first derivative.
The first element of the basis is 1. The derivative of this is just zero. That is, the transformation D maps the vector (1, 0, 0, 0) to (0, 0, 0). Our second element is t. The derivative of this is just one. So the transformation D maps our second basis vector (0, t, 0, 0) to (1, 0, 0). Similarly for our third and fourth basis vectors, the transformation maps (0, 0, t^2, 0) to (0, 2t, 0), and it maps (0, 0, 0, t^3) to (0, 0, 3t^2).
Applying our transformation to the four basis vectors, we get the following four images under D:
Now that we’ve applied our linear transformation to each of our four basis vectors, we next collect the resulting images into the columns of a matrix. This is the matrix we’re looking for — it fully describes the action of differentiation for any third-degree polynomial in one simple matrix.
Collecting our four image vectors into a matrix, we have:
This matrix gives the linear algebra view of differentiation from calculus. Using it, we can find the derivative of any polynomial of degree three by expressing it as a vector and multiplying by this matrix.
For example, consider the polynomial . Note that the first derivative of this polynomial is ; we’ll use this in a minute. In vector form, this polynomial can be written as:
To find its derivative, we simply multiply this vector by our D matrix from above:
which is exactly the first derivative of our polynomial function!
This is a powerful tool. By recognizing that differentiation is just a linear transformation — as is integration, which follows a similar argument that I’ll leave as an exercise — we can see it’s really just a rule that linearly maps functions in to functions in .
In fact, all m x n matrices can be understood in this way. That is, an m x n matrix is just a linear mapping that sends vectors in into . In the case of the example above, we have a 3 x 4 matrix that sends polynomials in (such as ax^3 + bx^2 + cx +d, which has four elements) into the space of first derivatives in (in this case, 3ax^2 + 2bx +c, which has three elements).
For more on linear transformations, here’s a useful lecture from MIT’s Gilbert Strang.
A Simple Way of Solving Exact Ordinary DifferentialEquations
Exact differential equations are interesting and easy to solve. But you wouldn’t know it from the way they’re taught in most textbooks. Many authors stumble through pages of algebra trying to explain the method, leaving students baffled.
Thankfully, there’s an easier way to understand exact differential equations. Years ago, I tried to come up with the simplest possible way of explaining the method. Here’s what I came up with.
The entire method of solving exact differential equations can be boiled down to the diagram below: “Exact ODEs in a Nutshell.”
Recall that exact ODEs are ones that we can write as M(x.y) + N(x,y)*y’ = 0, where M and N are continuous functions, and y’ is dy/dx. Here is how to read the diagram.
Starting with an exact ODE, we’re on the second line labeled “starting point.” We have functions M and N, and our goal is to move upward toward the top line labeled “goal.” That is, given an exact ODE, we want to find a solution F(x,y) = c whose first partial derivatives are Fx (which is just the function M) and Fy (which is the function N).
Before we do anything, we check that our equation is really exact. To do this, we move to the bottom line labeled “test for exactness.” That is, we take the derivative of Fx = M with respect to y (giving us Fxy = My). And we take the derivative of Fy = N with respect to x (which gives us Fyx = Nx). Set these equal to each other. A basic theorem from calculus says that the mixed partial derivatives Fxy and Fyx will be the same for any function F(x,y). If they’re equal, F(x,y) on the top line is guaranteed to exist.
Now we can solve for the function F(x.y). The diagram makes it easy to see how. We know M(x,y) is just the first partial derivative of F with respect to x. So we can move upward toward F(x,y) by integrating M with respect to x. Similarly, we know the function N(x,y) is just the first partial derivative of F(x,y) with respect to y, so we can find another candidate for F by integrating N with respect to y.
In the end, we’ll have two candidates for F(x,y). Sometimes they’re the same, in which case we’re done. Sometimes they’re different, as one will have a term the other won’t have — a term that got dropped to zero as we differentiated from F(x,y) to either Fx or Fy, since it’s a function of only one of x or y. This is easy to solve: just combine all the terms from both candidates for F(x,y), omitting any duplicate terms. This will be our solution F(x,y) = c.
Try using this method on a few examples here. I think you’ll find it’s much simpler — and easier to remember years later — than the round-about method used in most textbooks.
A More Elegant View of the CorrelationCoefficient
One of the first things students learn in statistics is the “correlation coefficient” r, which measures the strength of the relationship between two variables. The formula given in most textbooks is something like the following:
where x and y are the data sets we’re trying to measure the correlation of.
This formula can be useful, but also has some major disadvantages. It’s complex, hard to remember, and gives students almost no insight into what the correlation coefficient is really measuring. In this post I’ll explain an alternative way of thinking about “r” as the cosine of the angle between two vectors. This is the “linear algebra view” of the correlation coefficient.
A Different View of Correlation
The idea behind the correlation coefficient is that we want a standard measure of how “related” two data sets x and y are. Rather than thinking about data sets, imagine instead that we place our x and y data into two vectors u and v. These will be two n-dimensional arrows pointing through space. The question is: how “similar” are these arrows to each other? As we’ll see below, the answer is given by the correlation coefficient between them.
The figure below illustrates the idea of measuring the “similarity” of two vectors v1 and v2. In the figure, the vectors are separated by an angle theta. A pretty good measure of how “similar” they are is the cosine of theta. Think about what cosine is doing. If both v1 and v2 point in roughly the same direction, the cosine of theta will be about 1. If they point in opposite directions, it will be -1. And if they are perpendicular or orthogonal, it will be 0. In this way, the cosine of theta fits our intuition pretty well about what it means for two vectors to be “correlated” with each other.
What is the cosine of theta in the figure? From the geometry of right triangles, recall that the cosine of an angle is the ratio of the length of the adjacent side to the length of the hypotenuse. In the figure, we form a right triangle by projecting the vector v1 down onto v2. This gives us a new vector p. The cosine of theta is then given by:
Now suppose we’re interested in the correlation between two data sets x and y. Imagine we normalize x and y by subtracting from each data point the mean of the data set. Let’s call these new normalized data sets u and v. So we have:
The question is, how “correlated” or “similar” are these vectors u and v to each other in space? That is, what is the cosine of the angle between u and v? This is simple: from the formula derived above, the cosine is given by:
But since and , this means the cosine of theta is just the correlation coefficient between the two vectors u and v, or:
From this perspective, the correlation coefficient has an elegant geometric interpretation. If two data sets are positively correlated, they should roughly “point in the same direction” when placed into n-dimensional vectors. If they’re uncorrelated, they should point in directions that are orthogonal to each other. And if they’re negatively correlated, they should point in roughly opposite directions.
The cosine of the angle between two vectors nicely fits that intuition about correlation. So it’s no surprise the two ideas are ultimately the same thing — a much simpler interpretation of “r” than the usual textbook formulas.
Deriving the Logistic Population GrowthModel
Developing a good model for population growth is a pretty common problem faced by economists doing applied work. In this post, I’ll walk through the derivation of a simple but flexible model I’ve used in the past known as the “logistic” model of population growth.
The Naïve Model: Exponential Growth
The simplest way of modeling population is to assume “exponential” growth. That is, just assume population grows by some annual rate, forever. If we let “y” be a city’s population and “k” be the annual growth rate, the exponential growth model is given by
This is a simple first-order differential equation. We can solve this for “y” by using a technique called “separation of variables”. First, we separate variables like this:
Then we integrate both sides and solve for y, as follows:
Since C is just an arbitrary constant, we can let e^C just equal C, which gives us
where k is the annual growth rate, t is the number of years from today, and C is the population at time t=0. This is the famous “exponential growth” model.
While the exponential model is useful for short-term forecasts, it gives unrealistic estimates for long time periods. After just a few decades, population would rapidly grow toward infinity in this model. A more realistic model should capture the idea that population does not grow forever, but instead levels off around some long-term level. This leads us to our second model.
A Better Model: Logistic Growth
We can improve the above model by making a simple adjustment. Let “A” be the maximum long-term population a city can reasonably sustain. Then multiply the model above by a factor (1 – y/A), giving us
In this model, the population starts out growing exponentially. But as “y” approaches the maximum level “A”, the term (1 – y/A) approaches zero, slowing down the growth rate. In the long run, growth will slow to a crawl as cities approach their maximum sustainable size — a much more reasonable way to model population growth. This is known as the “logistic” model.
To solve for “y,” we can again use separation of variables. However, we’ll first need to use a trick from algebra known as the “partial fractions decomposition.”
An Aside: The Partial Fractions Decomposition
The partial fractions decomposition is a theorem about rational functions of the form P(x)/Q(x). Here is what it says. If P(x) and Q(x) are polynomials, and P(x)/Q(x) is “proper” — that is, the order of P(x) is less than the order of Q(x) — then we can “decompose” P(x)/Q(x) as follows:
where a1…an are the n roots of the polynomial Q(x), and C1…Cn are constants. Using this theorem, we can decompose hard-to-handle rational functions into much simpler pieces — something we’ll need to do to solve the logistic population model above.
Back to the Model: Solving the Logistic Equation
Recall that the logistic population model is given by:
Separating variables, we have:
The term on the left-hand side is hard to integrate as written. Since it’s a proper rational polynomial function, we can now use the partial fractions decomposition to simplify it. By the theorem above, we can rewrite it as:
To solve for C1 and C2, first multiply both sides by y(1 – y/A) to clear the denominators, like this:
This equation is true for all values of y. To solve for C1 and C2, simply plug in values for y that allow us to solve for them. To solve for C1, let y = 0. This “zeros out” C2 in the equation and lets us solve for C1, as follows:
To solve for C2 we repeat the process, plugging in a value for y that “zeros out” C1. To do this, Let y = A, and solve for C2 as follows:
Using these constants, now we can rewrite our original function using the partial fractions decomposition as follows:
This simpler function can then be plugged into our integration problem above, allowing us to integrate the logistic model and solve for y. Returning to our problem, we have:
Integrating both sides and solving for y, we have:
To solve for “C” in the equation, note that if we let t=0, C = y0/(1 – y0/A) where y0 is the beginning population. Plugging in and solving for y, we then have,
This is the famous “logistic model” of population growth. This basic model can then be used to develop pretty reasonable long-term forecasts for city populations.
Converting Quarterly Data into Monthly Using Cubic SplineInterpolation
(See this post at Columbia Economics, LLC for the related comment thread.)
A common problem with time-series data is getting them into the right time interval. Some data are daily or weekly, while others are in monthly, quarterly or annual intervals. Since most regression models require consistent time intervals, an econometrician’s first job is usually getting the data into the same frequency.
In this post I’ll explain how to solve a common problem I’ve run into: how to divide quarterly data into monthly data. To do so, we’ll use a method known as “cubic spline interpolation.” In the example below we use Matlab and Excel. For Stata users, I’ve posted a Stata do file that illustrates how to work through the below example in Stata.
Cubic Spline Interpolation
One of the most widely used data sources in economics is the National Income and Product Accounts (NIPAs) from the U.S. Bureau of Economic Analysis. They’re the official source for U.S. GDP, personal income, trade flows and more. Unfortunately, most data are published only quarterly or annually. So if you’re hoping to run a regression using monthly observations — for example, this simple estimate of the price elasticity of demand for gasoline — you’ll need to split these quarterly data into monthly ones.
A common way to do this is by “cubic spline interpolation.” Here’s how it works. We start with n quarterly data points. That means we have n-1 spaces between them. Across each space, we draw a unique 3rd-degree (or “cubic”) polynomial connecting the two points. This is called a “piecewise polynomial” function.
To make sure our connecting lines form a smooth line, we force all our first and second derivatives to be continuous; that is, at each connecting point we make them equal to the derivitive on either side. When all these requirements are met — along with a couple end-point conditions you can read about here — we have a (4n-4) x (4n-4) linear system that can be solved for the coefficients of all n-1 cubic polynomials.
Once we have these n-1 piecewise polynomials, we can plug in x values for whatever time intervals we want: monthly, weekly or even daily. The polynomials will give us a pretty good interpolation between our known quarterly data points.
An Example Using MATLAB
While the above method seems simple, doing cubic splines by hand is not. A spline for just four data points requires setting up and solving a 12 x 12 linear system, then manually evaluating three different polynomials at the desired x values. That’s a lot of work. To get a sense of how hard this is, here’s my own Excel file showing what’s involved in fitting a cubic spline to four data points by hand.
In practice, the best way to do a cubic spline is to use MATLAB. It takes about five minutes. Here’s how to do it.
MATLAB has a built-in “spline()” function that does the dirty work of cubic spline interpolation for you. It requires three inputs: a list of x values from the quarterly data you want to split; a list of y values from the quarterly data; and a list of x values for the monthly time intervals you want. The spline() function formulates the n-1 cubic polynomials, evaluates them at your desired x values, and gives you a list of interpolated monthly y values.
Here’s an Excel file showing how to use MATLAB to split quarterly data into monthly. In the file, the first two columns are quarterly values from BEA’s Personal Income series. Our goal is to convert these into monthly values. The next three columns (highlighted in yellow) are the three inputs MATLAB needs: the original quarterly x values (x); the original quarterly y values (y); and the desired monthly x values (xx).
In the Excel file, note that the first quarter is listed as month 2, the second quarter as month 5, and so on. Why is this? BEA’s quarterly data represent an average value over the three-month quarter. That means they should be treated as a mid-point of the quarter. For Q1 that’s month 2, for Q2 that’s month 5, and so on.
The next step is to open MATLAB and paste in these three columns of data. In MATLAB, type ” x = [ “, cut and paste the column of x values in from Excel, type ” ] ” and hit return. This creates an n x 1 vector with the x values. Repeat this for the y, and xx values in the Excel file.
Once you have x, y, and xx defined in MATLAB, type “yy = spline(x,y,xx)” and hit return. This will create a new vector yy with the interpolated monthly y values we’re looking for. Each entry in yy will correspond to one of the x values you specified in the xx vector.
Copy these yy values from MATLAB, paste them into Excel, and we’re done. We now have an estimated monthly Personal Income series.
Here’s an Excel file summarizing the above example for splitting quarterly Personal Income data into monthly using MATLAB. Also, here’s a MATLAB file with the x, y, xx, and yy vectors from the above exercise.
Note: For Stata users, here’s a “do” file with an example that performs the above cubic spline interpolation in mata.
The Linear Algebra View of the FibonacciSequence
The Fibonacci sequence is a beautiful mathematical concept, making surprise appearances in everything from seashell patterns to the Parthenon. It’s easy to write down the first few terms — it starts with 0, 1, 1, 2, 3, 5, 8, 11, with each term equal to the sum of the previous two. But what about the 100th or 100,000th term? Can we find them without doing thousands of calculations?
In a previous post, I explained how simple linear models can help us understand where systems like this are headed in the long run. In this post, I’ll explain how the same tools can help us see the “long run” values of the Fibonacci sequence. The result is an elegant model that illustrates the connection between the Fibonacci sequence and the so-called “golden ratio”, an aesthetic principle appearing throughout art, architecture, music, book design and more.
Setting Up the Model
Each term in the Fibonacci sequence equals the sum of the previous two. That is, if we let Fn denote the nth term in the sequence we can write . To make this into a linear system, we need at least one more equation. Let’s use an easy one: . Putting these together, here’s our system of linear equations for the Fibonacci sequence:
We can write this in matrix notation as the following:
Here’s what this is saying. If we start with the vector on the right and multiply it by this 2 x 2 matrix, the result is the vector on the left. That is, multiplying our starting vector by the matrix above gives us the next element in the sequence. The n-1 element from the right becomes the n element on the left, and the n-2 element becomes the n-1 element. Each time we multiply by this matrix we’re moving the system from one term of the Fibonacci sequence to the next.
As explained here, the general form for dynamic linear models like this is:
where u0 is the initial state, u1 is the final state, and A is the transformation matrix that moves us from one state to the next. As explained in the earlier post, we can use eigenvalues and eigenvectors to solve for the long run values of systems like this.
Here’s how we do that. Let S be a 2 x 2 matrix where the columns are the two eigenvectors of our transformation matrix A. And let be a 2 x 2 matrix with the two eigenvalues of A along the diagonal and zeros elsewhere. By the definition of eigenvalues and eigenvectors, we have the following identity:
Solving for A, we have:
Plugging this into our above equation relating u0 and u1, we have the following system:
This equation relates the initial state vector u0 to the next state u1. But each time we multiply by A it moves us to the next state, and the next state, and so on. Imagine we multiply k times. We get the following general form:
This equation is what we’re looking for. It allows us to quickly find the kth term in the Fibonacci sequence with a simple calculation. It relies only on the initial state vector u0 and the eigenvalues and eigenvectors of the transformation matrix A. That’s our linear model.
Putting the Pieces Together
Now that we have a model, let’s find the S, and other parts that make it up.
The first step is to find the eigenvalues of the A matrix. This is given by:
We find the eigenvalues by solving Ax = lambda x for the vector lambda that makes that equation true (that’s the definition of an eigenvalue). That’s the same as solving Ax – lambda x = 0, or (A – lambda)x = 0. This implies the matrix (A – lambda) is singular or not invertable. So the determinant of (A – lambda) must be zero. That means we can find the eigenvalues by setting the determinant of (A – lambda) equal to zero and solving for lambda (this is called the “characteristic polynomial”). Here’s how we do that:
Plugging this into the quadratic formula with a = 1, b = -1 and c = -1, you’ll get the following two solutions, which are our two eigenvalues:
For a quick check that these are right, note that the trace of A is 1. The sum of the two eigenvalues is also 1. Since the eigenvalues must always sum to the trace, we’ve got the right values here.
The next step is to find the two eigenvectors that correspond to the eigenvalues. Let’s start with . To do this, we write the following:
This implies that . Our “free” variable here is x2, so we’ll let that equal 1. Then we can solve for x1. We get the following:
Using the old algebra trick for the difference of squares — that is, — we can simplify this by multiplying both numerator and denominator by as follows:
So our first eigenvector v1 is equal to:
Following the same process for give us the other eigenvector:
Note that the vectors v1 and v2 are orthogonal to each other. That is, the dot product between them is zero. This comes from a basic theorem of linear algebra: every symmetric matrix will have orthogonal eigenvectors. Since A is symmetric, our two eigenvectors are perpendicular to each other.
Now that we have the eigenvalues and eigenvectors, we can write the S and matrices as follows:
To complete our model, we also need to know the inverse of the S matrix. Thankfully, there’s a simple formula for the inverse of a 2 x 2 matrix. If A is given by:
then the inverse of A is found by transposing the diagonal terms, putting a negative sign on the off-diagonal terms, and multiplying by 1/(determinant of A), or
Using that formula, we get:
As explained in our earlier post, our final step is to write our initial state vector u0 as a combination of the columns of S. That is, we can write:
where c is a 2 x 1 vector of scalars. Solving for c, we get:
For this example, let’s let our initial state vector u0 be (1,1). These are the second and third terms of the Fibonacci sequence. Note that you can use any two subsequent terms for this step — I’m just using (1,1) because I like the way the math works for it. So we have:
Since , that means
Putting everything together, we can write our final model for the Fibonacci sequence:
Multiplying this out, we get the following extensive form of the model:
This equation gives the k+3 and k+2 terms of the Fibonacci sequence as a function of just one variable: k. This allows us to easily find any term we’d like — just plug in k. For example, imagine we want the 100th term. Simply let k = 98, and solve the above for F101 and F100. This is a huge number, by the way — 218,922,995,834,555,000,000 — something you can easily verify in this Excel spreadsheet. (Note that whether this is the 99th or 100th term depends on whether you label 0 or 1 to be the first term of the sequence; here I’ve made zero the first term, but many others use 1.)
So what happens to the above system as k grows very large? Do the terms in the Fibonacci sequence display some regular pattern as we move outward?
All changes from one term to the next are determined by k in the above model. Imagine k grows very large. In the second term in the equation, we can see that
That means that as k gets bigger, the second term in the equation goes to zero. That leaves only the first term. That means as
So as k grows large, the ratio of the k+1 to the k term in the Fibonacci sequence approaches a constant, or
This is a pretty amazing result. As we move far out in the Fibonacci sequence, the ratio of two subsequent terms approaches a constant. And that constant is equal to the first eigenvalue of our linear system above.
More importantly, this value is also equal to the famous “golden ratio”, which appears in myriad forms throughout Western art, architecture, music and more. In the limit, the ratio of subsequent terms in the Fibonacci sequence equals to the golden ratio — something that’s not obvious at all, but which we can verify analytically using our model above.
If you’d like to see how this works in a spreadsheet, here’s an Excel file where you can plug in values for k and find the k+2 and k+1 terms in the sequence.
Deriving the Poisson Distribution from the BinomialDistribution
At first glance, the binomial distribution and the Poisson distribution seem unrelated. But a closer look reveals a pretty interesting relationship. It turns out the Poisson distribution is just a special case of the binomial — where the number of trials is large, and the probability of success in any given one is small.
In this post I’ll walk through a simple proof showing that the Poisson distribution is really just the binomial with n approaching infinity and p approaching zero.
The binomial distribution works when we have a fixed number of events n, each with a constant probability of success p. Imagine we don’t know the number of trials that will happen. Instead, we only know the average number of successes per time period. So we know the rate of successes per day, but not the number of trials n or the probability of success p that led to that rate.
Define a number . Let this be the rate of successes per day. It’s equal to np. That’s the number of trials n — however many there are — times the chance of success p for each of those trials. Think of it like this: if the chance of success is p and we run n trials per day, we’ll observe np successes per day on average. That’s our observed success rate lambda.
Recall that the binomial distribution looks like this:
As mentioned above, let’s define lambda as follows:
Solving for p, we get:
What we’re going to do here is substitute this expression for p into the binomial distribution above, and take the limit as n goes to infinity, and try to come up with something useful. That is,
Pulling out the constants and and splitting the term on the right that’s to the power of (n-k) into a term to the power of n and one to the power of -k, we get
Now let’s take the limit of this right-hand side one term at a time. We’ll do this in three steps. The first step is to find the limit of
In the numerator, we can expand n! into n terms of (n)(n-1)(n-2)…(1). And in the denominator, we can expand (n-k) into n-k terms of (n-k)(n-k-1)(n-k-2)…(1). That is,
Written this way, it’s clear that many of terms on the top and bottom cancel out. The (n-k)(n-k-1)…(1) terms cancel from both the numerator and denominator, leaving the following:
Since we canceled out n-k terms, the numerator here is left with k terms, from n to n-k+1. So this has k terms in the numerator, and k terms in the denominator since n is to the power of k. Expanding out the numerator and denominator we can rewrite this as:
This has k terms. Clearly, every one of these k terms approaches 1 as n approaches infinity. So we know this portion of the problem just simplifies to one. So we’re done with the first step.
The second step is to find the limit of the term in the middle of our equation, which is
Recall that the definition of e = 2.718… is given by the following:
Our goal here is to find a way to manipulate our expression to look more like the definition of e, which we know the limit of. Let’s define a number x as . Now let’s substitute this into our expression and take the limit as follows:
This terms just simplifies to e^(-lambda). So we’re done with our second step. That leaves only one more term for us to find the limit of. Our third and final step is to find the limit of the last term on the right, which is
This is pretty simple. As n approaches infinity, this term becomes 1^(-k) which is equal to one. And that takes care of our last term.
Putting these three results together, we can rewrite our original limit as
This just simplifies to the following:
This is equal to the familiar probability density function for the Poisson distribution, which gives us the probability of k successes per period given our parameter lambda. So we’ve shown that the Poisson distribution is just a special case of the binomial, in which the number of n trials grows to infinity and the chance of success in any particular trial approaches zero. And that completes the proof.
A Simple Explanation of Why Lagrange MultipliersWorks
The method of Lagrange multipliers is the economist’s workhorse for solving optimization problems. The technique is a centerpiece of economic theory, but unfortunately it’s usually taught poorly. Most textbooks focus on mechanically cranking out formulas, leaving students mystified about why it actually works to begin with.
In this post, I’ll explain a simple way of seeing why Lagrange multipliers actually do what they do — that is, solve constrained optimization problems through the use of a semi-mysterious Lagrangian function.
Before you can see why the method works, you’ve got to know something about gradients. For functions of one variable there is — usually — one first derivative. For functions of n variables, there are n first derivatives. A gradient is just a vector that collects all the function’s partial first derivatives in one place.
Each element in the gradient is one of the function’s partial first derivatives. An easy way to think of a gradient is that if we pick a point on some function, it gives us the “direction” the function is heading. If our function is labeled , the notation for the gradient of f is .
The most important thing to know about gradients is that they always point in the direction of a function’s steepest slope at a given point. To help illustrate this, take a look at the drawing below. It illustrates how gradients work for a two-variable function of x1 and x2.
The function f in the drawing forms a hill. Toward the peak I’ve drawn two regions where we hold the height of f constant at some level a. These are called level curves of f, and they’re marked f = a1, and f = a2.
Imagine yourself standing on one of those level curves. Think of a hiking trail on a mountainside. Standing on the trail, in what direction is the mountain steepest? Clearly the steepest direction is straight up the hill, perpendicular to the trail. In the drawing, these paths of steepest ascent are marked with arrows. These are the gradients at various points along the level curves. Just as the steepest hike is always perpendicular to our trail, the gradients of f are always perpendicular to its level curves.
That’s the key idea here: level curves are where , and .
How the Method Works
To see how Lagrange multipliers work, take a look at the drawing below. I’ve redrawn the function f from above, along with a constraint g = c. In the drawing, the constraint is a plane that cuts through our hillside. I’ve also drawn a couple level curves of f.
Our goal here is to climb as high on the hill as we can, given that we can’t move any higher than where the constraint g = c cuts the hill. In the drawing, the boundary where the constraint cuts the function is marked with a heavy line. Along that line are the highest points we can reach without stepping over our constraint. That’s an obvious place to start looking for a constrained maximum.
Imagine hiking from left to right on the constraint line. As we gain elevation, we walk through various level curves of f. I’ve marked two in the picture. At each level curve, imagine checking its slope — that is, the slope of a tangent line to it — and comparing that to the slope on the constraint where we’re standing. If our slope is greater than the level curve, we can reach a higher point on the hill if we keep moving right. If our slope is less than the level curve — say, toward the right where our constraint line is declining — we need to move backward to the left to reach a higher point.
When we reach a point where the slope of the constraint line just equals the slope of the level curve, we’ve moved as high as we can. That is, we’ve reached our constrained maximum. Any movement from that point will take us downhill. In the figure, this point is marked with a large arrow pointing toward the peak.
At that point, the level curve f = a2 and the constraint have the same slope. That means they’re parallel and point in the same direction. But as we saw above, gradients are always perpendicular to level curves. So if these two curves are parallel, their gradients must also be parallel. That means the gradients of f and g both point in the same direction, and differ at most by a scalar. Let’s call that scalar . That is,
Solving for zero, we get
This is the condition that must hold when we’ve reached the maximum of f subject to the constraint g = c.
Now, if we’re clever we can write a single equation that will capture this idea. This is where the familiar Lagrangian equation comes in:
or more explicitly,
To see how this equation works, watch what happens when follow the usual Lagrangian procedure.
First, we find the three partial first derivatives of L,
and set them equal to zero. That is, we need to set the gradient equal to zero.
To find , we take the three partial derivatives of L with respect to x1, x2 and lambda. Then we place each as an element in a 3 x 1 vector. That gives us the following:
Recall that we have two “rules” to follow here. First, the gradients of f and g must point in the same direction, or . And second, we have to satisfy our constraint, or .
The first and second elements of make sure the first rule is followed. That is, they force , assuring that the gradients of f and g both point in the same direction. The third element of is simply a trick to make sure g = c, which is our constraint. In the Lagrangian function, when we take the partial derivative with respect to , it simply returns back to us our original constraint equation.
At this point, we have three equations in three unknowns. So we can solve this for the optimal values of x1 and x2 that maximize f subject to our constraint. And we’re done. So the bottom line is that Lagrange multipliers is really just an algorithm that finds where the gradient of a function points in the same direction as the gradients of its constraints, while also satisfying those constraints.
As with most areas of mathematics, once you see to the bottom of things — in this case, that optimization is really just hill-climbing, which everyone understands — things are a lot simpler than most economists make them out to be.
Deriving the Trigonometric Functions WithoutGeometry
Most people first meet sine and cosine in a basic geometry class. The lesson starts with a right triangle, and derives sine and cosine diagramtically using angles, lines and ratios. This is the “triangle view” of the trigonometric functions.
Most people are suprised to learn that’s not the only way to derive them. While the triangle view of sine and cosine is intuitive, it actually conceals many interesting properties. But more importantly, the focus on lines and angles gives many people the impression that sine and cosine are “something from geometry” — not objects with their own interesting properties and deep connections to other branches of math.
In this post, I’ll explain a more elegant way of deriving sine and cosine that uses no geometry. This is the “real analysis” view. We’ll start with a single assumption, and develop them analytically based on our earlier discussion of the Taylor series expansion (read here).
Imagine a function with the following property. The function’s second derivative is equal to the negative of the function itself. That is,
This is our starting assumption. It’s just another way of saying that sine and cosine — whatever they are — need to have this one property. If we start with sine or cosine and take their derivative twice, we should end up back at the original function with a negative sign on it. Our goal is to show how to get from this assumption to actual algebraic functions for sine and cosine.
Let’s take a closer look at our assumption. It relates the function to its second derivative — a simple differential equation. But what about the other derivitives? To find them, let’s take the next few derivatives of our assumption and see what happens.
Differentiating both sides and solving each time for f”, f”’ and so on, we get the following
We can already see the pattern here. As we take more and more derivatives, we always end up back at either f or f’. The only thing that changes is the positive or negative sign in front.
Now that we have n derivitives of f, we’re ready to use our earlier finding about Taylor series expansions and write an expression for f.
Recall the Taylor expansion of f around the point c is given by
Letting c = 0 — which is technically a Maclaurin rather than a Taylor — we get the following expansion
But as we’ve seen above, we already have expressions for the second, third and so on derivitives of f. They’re all either f, -f, f’ or -f’. Substituting these in, we get the following
Notice the pattern in the signs. The expansion starts with two positive terms, then has two negative terms, then has two positive terms, and so on. The pattern repeats forever, and each term involves only f or f’.
Since the function is really based two parameters — f(0) and f'(0) — we can rewrite this in a more useful way as follows. Label our two parameters c1 and c2. That is, let f(0) = c1, and let f'(0) = c2. Rewriting with this notation we get
Now let’s pick two sets of values for our parameters c1 and c2. These will define two distinct functions, both of which will satisfy our initial assumption.
First, let c1 = 0, and c2 = 1. Plugging into the above, we get the following function for sine
This is the sine function we’re looking for. When x = 0, the above function is zero. When x = pi/2 the function is one. And it works for all real values of x.
For our second case, let c1 = 1, and c2 = 0. Plugging this into our expansion, we get the following for cosine:
This is the cosine function we’re looking for. When x = 0, the function is one. When x = pi/2, it’s equal to zero. And like the sine function above, it works for all .
We can easily check that these both satisfy our initial assumption that f” = -f. And we can also verify another basic relationship between sine and cosine from calculus by taking the first derivative of each, which is that
Now that we’ve developed functions for sine and cosine, we can go on to derive all the other trigonometric functions — tangent, cosecent, and so on — from these.
For more on the “real analysis” view of sine and cosine, see this famous primer, especially Chapter 7, Section 4.
A Sewing-Circle Method for EstimatingPi
Here’s another method for estimating pi that basically requires a hardwood floor, a bag of sewing needles and plenty of time.
Imagine standing over a hardwood floor. Notice the parallel lines between the boards. Now imagine dropping a single sewing needle onto it. It bounces, and comes to rest. Question: What is the probability the needle will come to rest on a line?
The picture below illustrates the problem. The distance between the lines is marked d, and the length of the needle is marked k. Let’s assume k is less than or equal to d. The illustration shows two possible resting points for the needle.
Looking at the picture, it’s easy to see how to approach the problem. All we care about is the horizontal “width” of the needle when it comes to rest, relative to the total distance between the lines d. The ratio of the former to the latter will be the probability of landing on a line.
In the picture, the horizontal width of the needle is given by . We use the absolute value since physical length is always positive. This is marked in the picture.
Here’s how we reason this out. First, the probability that the needle will land at any particular angle is just . And given the angle it happens to land at, the probability it will cross a line is , which is just the horizontal width of the needle divided by the distance between lines. (Note this only works for k <= d.)
Putting these two probabilities together — since both events must occur at the same time — we get the probability that a particular needle at angle theta will cross a line as
But this is only for one theta. What about the others? To find the probability that a dropped needle will hit a line for every possible theta, we have the following integral
Pulling out the constants, we can re-write this as
The key to simplifying this is to note the following relationship between ordinary cosine and the absolute value of cosine. For ordinary cosine, the integral between 0 and 2pi is just zero — half the time it’s positive over that range, and half the time it’s negative. But we want the absolute value of cosine, which is always positive.
The picture below illustrates how to translate between the two.
The picture shows regular cosine plotted between 0 and 2pi. In absolute value terms, the areas under the curve marked 1, 2, 3, and 4 are all equal. The two positive areas exactly equal the two negative areas. That’s another way of saying that the integral of from 0 to 2pi is equal to four times the integral of from 0 to pi/2.
That means we can re-write our integral as
So the probability that a needle will land on a line is just , as long as k is less than or equal to d.
Now imagine we’re clever and choose sewing needles exactly the length of the width of boards in our floor. In that case, k = d. Then the probability it will hit a line becomes
Now we have an expression we can use to estimate pi. The process is simple.
Start tossing needles on the floor. Count how many land on the lines, as a percentage of total throws. That’s P(k,d). Then divide this number into 2. The result will be pretty close to pi. The more needles you throw, the better the estimate will get.
I won’t bore you with the enormous literature of mathematicians actually trying to estimate pi using this awkward method. But for those who want to see how this works with a Java-based simulation, see here.
An Easy Way to Remember the Taylor SeriesExpansion
The Taylor expansion is one of the most beautiful ideas in mathematics. The intuition is simple: most functions are smooth over ranges we’re interested in. And polynomials are also smooth. So for every smooth function, we should be able to write down a polynomial that approximates it pretty well. And in fact, if our polynomial contains enough terms it will exactly equal the original function. Since polynomials are easier to work with than almost any other kind of function, this usually turns hard problems into easy ones.
The Taylor formula is the key. It gives us an equation for the polynomial expansion for every smooth function f. However, while the intuition behind it is simple, the actual formula is not. It can be pretty daunting for beginners, and even experts have a hard time remembering if they haven’t seen it for a while.
In this post, I’ll explain a quick and easy trick I use to re-derive the Taylor formula from scratch whenever I have trouble remembering it.
Starting from Scratch
The idea behind the Taylor expansion is that we can re-write every smooth function as an infinite sum of polynomial terms. The first step is therefore to write down a general nth-degree polynomial. Here it is:
Where a0, a1, … are coefficients on each polynomial term, and c is a constant that represents where along the x-axis we want to start our approximation (if we don’t care where we start, just let c = 0, which is technically known as a Maclaurin rather than a Taylor). This series — known as a “power series” — can be written in closed form as the following:
The goal here is to find a clever way to find the coefficients a0, a1, … in that equation, given some function f and an initial value of c. Here is the logic for doing that.
Polynomials are smooth, so that guarantees they’re differentiable. That is, we can calculate the first, second, third and so on derivatives of them. So starting with our polynomial above, let’s take the first few derivatives of it, like this:
Clearly we’re seeing a pattern already. We’ll use that in a minute. Now that we have n derivatives of f, let’s evaluate them for some number that will cause most of their terms to drop away. This is the key step. If we’re clever, we’ll notice that if we evaluate them at x = c, most of their terms will go to zero. That will leave behind only the coefficients a1, a2, … multiplied by some constant. So here’s that step:
Now we have a set of simple equations we can solve for a1, a2, … Simply divide both sides by n!. That gives us the following:
The pattern here is beautiful. The nth coefficient is just the nth derivative of the original function, evaluated at c, divided by n factorial. Now we have our n coefficients. The next step is to plug them back into our beginning expression for a general nth-degree polynomial, like this:
This equation is what we’re looking for. It gives a polynomial expansion for every smooth function f. We just need to calculate the first n derivatives of f, evaluate them at c, divide each one by n!, and sum up these terms. The result will be a good approximation to our original function. The more terms we add on, the more accurate the polynomial approximation will be.
The Result: the Taylor Formula
The final step is to write this infinite series in closed form. This is the last step in the trick for remembering the formula. Writing the above as a summation, we get our final result:
The lesson here is simple: don’t waste your time learning formulas. Learn methods. If you can remember the basic logic of where the Taylor expansion comes from, you can quickly and easily re-derive the formula from scratch.
To see how this works in a spreadsheet, here’s an Excel file that gives a step-by-step process for the Taylor expansion of f(x) = 2^x at x = 3, using 6 polynomial terms.
Some Interesting Mathematics of the Rule of 72
Most people first encounter the “rule of 72” in a basic finance course. The idea is simple: divide 72 by your growth rate times 100, and the result is the number of years it takes to double the investment. However, there’s some interesting math going on behind this rule that most people never learn.
In this post I’ll explain the hidden math of the rule of 72, show you how it’s derived algebraically, and point out some important limitations.
The key idea behind the rule of 72 is that it’s designed for quick mental calculations. Seventy-two gives a decent ballpark approximation for doubling times. But more importantly, 72 is evenly divisible by lots of common growth rates — 2, 3, 4, 6, 8, 9, 12, 18, and so on. That makes it ideal for back-of-the-envelope approximations.
However, 72 has some major drawbacks. It’s inaccurate in many cases, sometimes giving answers that are off by years. Recognizing this, some textbooks offer more accurate (but harder to use) alternatives. Some recommend a “rule of 70,” while others suggest a “rule of 69.” While these are both harder to use, they’re considerably more accurate in many cases. As we’ll see in a minute, the rule of 69 is probably the most mathematically defensible of all, despite being pretty useless for mental calculation.
Deriving the Optimal Rule of Thumb
So where do these rules come from? In this section, I’ll show you how they’re derived algebraically from basic compound interest formulas.
When we say we’re looking for a rule of thumb for doubling time, here’s what we really mean. We want some constant M (like 72, 70, 69, etc.) that we can divide by our growth rate times 100 and get a good estimate of the time to double an investment. The first step is to write an expression for the value of the investment at time t, given some initial value a0. Then we set this expression equal to twice the initial value, or 2a0. That is
The a0s cancel out on both sides of the above, leaving us with
Taking the natural log of both sides and solving for t, we get
This expression gives the time t it takes to double the investment, given some r and k. What we want is a number M such that when we divide by 100r, it gives us t. That is, we want an M such that M/(100r) = t. Solving for M, we see that M = t(100r). So our rule of thumb M is simply given by 100r times our equation for t above, or
As you can see, the optimal rule of thumb M is a function of both r and k. So for different growth rates and frequency of compounding, you’ll want different rules of thumb. The ubiquitous M=72 is really only a good rule of thumb for a very small number of cases.
The graph below is a plot of the function M(r,k) for r between 1 percent and 30 percent and k between 1 and 30. R and k are shown on the horizontal axes, and the optimal rule M is given on the vertical axis. The colored bands show areas where each rule is optimal. For example, the light green band near the center shows where the rule of 70 is optimal, and the flat red plane shows where the rule of 69 is best, and so on.
As you can see, the optimal rules are all over the map for small values of k. In the extreme case of k=1, the usual rules of 70 and 72 are only close for growth rates between 1 percent and 10 percent. Beyond that they quickly break down. When k=1, the optimal rule M grows linearly as r increases, so no rule holds up well in that range. Overall, 72 looks pretty lousy from this perspective — it’s the small aqua-blue band in the center-left. Seventy looks slightly better — the green band corresponding to 70 covers much more area — but it also only works over a pretty small range.
A Special Case: The Rule for Continuous Compounding
The rule that seems to hold up best is actually the awkward rule of 69, which corresponds to the broad, flat plane of red on the right-hand side of the graph. It works for a huge range of values as the function flattens out to the right and roughly approaches 69. Beyond k=15, the graph is essentially flat regardless of the growth rate r. This is no surprise, as the rule of 69 turns out to be a special case for continuously compounded growth rates.
With continuously compounded growth, k is very large. To see how this affects our optimal rule M, let’s return to our first equation for the time to double our investment.
If growth is continuously compounded, we have the following equation for the value at time t
Again, the a0s cancel out on both sides, giving us
Solving for t we get
Since we’re looking for an M such that M/(100r)=t, or M=(100r)t, we get the following optimal rule M
So with continuously compounded growth, the optimal rule becomes pretty simple. Just divide (100)ln 2 — or roughly 69.3 — by your growth rate times 100. In the graph, that’s the value our function M is approaching as it flattens out to the right. Put differently, the limit of M as k approaches infinity is (100)ln 2 or about 69.3.
We can easily see this result analytically by taking the limit of the function M as k grows to infinity. That is, we can find the following limit
To work out this limit, we need to do some tricky algebra to the denominator. First, let’s move the k from the left side up into the exponent on the log to its right. Then let’s define a new number x such that x = k/r. This trick will let us collapse most of the denominator into the constant e, using the fact that e equals the limit of (1 + 1/x)^x as x grows to infinity.
Substituting in x = k/r, we get
Since x = k/r, letting k grow to infinity is equivalent to letting x grow to infinity. But we know that the term in the denominator (1+1/x)^x just becomes e as x grows to infinity. So we have
And that’s the result we alluded to above. As k grows, the function M converges very quickly to 69.314… So that’s the optimal rule for a huge number of values for r and k — infinitely many. So in some long-run mathematical sense, the most defensible rule actually turns out to be the awkward and practically useless rule of 69.
To see how this argument works in a spreadsheet, here’s an Excel file with the equations, data and graph from above.
A Barroom Method for EstimatingPi
Here’s a method for estimating pi that basically requires a dartboard, faith in the relative frequency interpretation of probability, and plenty of time. Step one, build yourself a dartboard that looks like this:
The area of the circular dart board is . And the area of the square behind the board is .
Start throwing darts at this board. What’s the probability it will land inside the circle? Easy enough — it’s the area of the circle divided by the area of the square, or
Now we have an expression we can use to estimate pi. The process is simple. Throw darts at the board. Be sure to throw randomly — you’ll need a uniform distribution of throws. Count the percentage that fall inside the circle. Then multiply this percentage by four. The result will be pretty close to pi. The more darts you throw, the better the estimate will get.
Here’s a simple piece of Matlab code that will do this simulation for you. For those who want to see how this works in a spreadsheet, here’s an Excel file with a simulation for n = 500 and n = 10,000 using random numbers from random.org for the x and y coordinates of the dart throws.
Update: A reader via email suggests another interesting method known as “Buffon’s Needle Problem,” which you can read about here.
Walking Through the Bolzano-WeierstrassTheorem
(Note: see here for a simpler version of this proof.)
Economic theory relies pretty heavily on a handful of mathematical proofs. One of the most important for proving the existence of equilibria is the Bolzano-Weierstrass theorem.
In this post I’ll give a basic explanation of the Bolzano-Weierstrass theorem, and walk you through a simple proof of it.
Bolzano-Weierstrass is a theorem about sequences of real numbers. Sequences can either be convergent — that is, they can approach some long-run value — or they can be divergent and blow up to infinity. For many equilibrium concepts, it’s necessary to show that a given sequence is convergent and stays within some reasonable set of values as we move far out along the sequence in the long run.
Bolzano-Weierstrass basically says that we’ll always end up with a convergent subsequence if we start with the right kind of sequence to begin with. Specifically, it says that if we start with a closed and bounded sequence — that is, if our sequence is finite and contains its own endpoints — then it’s guaranteed to have a subsequence that converges. Often that means the equilibrium we’re interested in actually exists, which is a good thing.
In the rest of this post, I’ll walk through a typical proof of the theorem. Different variations on this basic proof are used pretty widely in the more mathematical branches of economics such as game theory and general equibrium theory.
Proving the Theorem
Here’s the formal statement of the Bolzano-Weierstrass theorem: Every closed and bounded real sequence has a convergent subsequence.
Proof. Let be a bounded real sequence. Then there exists a closed interval [c, d] such that for all positive integers n — that is, for all .
Now divide the interval [c,d] into the following two intervals:
Since the sequence is bounded, infinitely many terms of it are contained in the original interval [c,d]. Now that we’ve split that interval in two, one of those two intervals must contain for infinitely many positive integers n. Call this interval .
Now repeat this step, dividing into the following:
Similarly, one of these intervals must also contain for infinitely many positive integers n. Call this . Continuing this process, we get a sequence of intervals such that
where each of these intervals contains for infinitely many positive integers n, and the width of each of the k intervals is given by
Now pick a positive integer such that . Since the next interval contains for infinitely many positive integers n, we can now choose an such that . Continuing this process, we can obtain elements such that
This means the newly created sequence is a subsequence of . So we’ve established that every bounded real sequence has a subsequence. The next step is to go further and show that this subsequence actually converges.
To see why this subsequence converges, consider the illustration below. It shows the original interval [c, d] at the top, with the subsequent intervals [c1, d1], [c2, d2], …] we created by repeatedly dividing it below.
Note that each interval is half the length of the previous interval, and that the boundaries of each interval are always moving inward or staying the same — they’re never moving outward. That is, the sequence of lower bounds is always moving inward as k grows. Similarly, the sequence of upper bounds is always pushing inward as k grows. That means both the lower and upper bounds are monotone and bounded. And that guarantees they are both convergent.
Let’s label the value that the lower bounds converge to as . Similarly, let’s label the value the upper bounds converge to as . Using our equation for the width of each interval from above, we know that
Taking the limit of both sides, we see that
So this shows that both the upper and lower bounds converge to the same value of L = M.
We saw above that , so that means for all k. But since and converge to the same limit of L = M, by the so-called “squeeze theorem” we know that must also converge to the same limit of L = M.
That shows that every closed and bounded real sequence — that is, every sequence defined on a compact set — has a subsequence , and that this subsequence is convergent. And that completes the proof.
For more on Bolzano-Weierstrass, check out the Wiki article here.
Using Eigenvectors to Solve Dynamical LinearSystems
Each year, some fraction of Seattle’s population moves to Portland. And some fraction of Portland’s population moves back to Seattle. If this keeps up year after year, what happens in the long run? Will the populations reach a steady state, or will one city empty out into the other?
These are questions we can answer with simple linear algebraic models. In this post I’ll explain how to set up models for problems like this, and how to solve them for their long-run values.
Setting Up the Problem
Here are some basic facts. Imagine Seattle and Portland both have populations of 500,000. Each year, 5 percent of Seattle’s population moves to Portland and 20 percent of Portland’s population moves to Seattle. The rest don’t move. This pattern repeats every year, forever.
The first step is to write this as a linear system. The idea is that we have a system — in this case, the population in both cities — that starts at an initial state u0 at time t = 0. At time t = 1, some transformation happens, taking us to state u1. Then the transformation is repeated at time t = 2, 3, … bringing us to states u2, u3, and so on. The question we’re after is what does the system look like after k steps, and what happens in the long run when k gets large?
In our problem, we can write the initial state u0 as a vector of populations in the two cities. The first entry is Seattle’s population and the second entry is Portland’s:
The population shifts between the two cities — that is, the transformation that happens at time t = 1, 2, … — can be described by this matrix:
Here’s how to read this. The first column and first row describe Seattle’s population, and the second column and second row describe Portland’s. Reading down the first column, it says 95 percent of Seattle’s population stays in Seattle, while 5 percent moves to Portland. Similarly, the second column tells us that 20 percent of Portland’s population moves to Seattle, while 80 percent stays at home. Since the columns of A all sum to 1, this is what’s known as a Markov matrix.
Putting these together, our model works like this. Start with the vector describing the initial state, u0. To move to state u1, we multiply by the matrix A. That is,
To move to state u2, u3, … we again multiply by A, or
That’s our basic model describing the population in the two cites after k steps. In the rest of this post, I’ll explain how to solve it for the long-run equilibrium.
For any linear system, the key to understanding how it behaves over time is through eigenvalues and eigenvectors. We need to get to the bottom of what the matrix A is doing to the system each time it is multiplied. The way to see that is by examining A’s eigenvalues and eigenvectors.
The first step is to find the eigenvalues of A. Since we’ve got a 2×2 matrix, we’ll normally expect to find two of them.
Here’s how to find them. Start with the definition of eigenvalues as,
To solve for we first rewrite this as
Looking at this expression, we definitely know that the matrix isn’t invertible for any non-zero vector x. If isn’t invertible, that means its determinant must be zero. Using that fact, we can solve for the eigenvalues.
In our case, we have a 2×2 matrix which has a pretty simple determinant,
That last equation is called the “characteristic polynomial” of A. It’s what we solve to find the eigenvalues. In this case, it’s a quadratic since we’re working with a 2×2 matrix, which will have two solutions. If the problem was for n cities instead, we’d end up with a characteristic polynomial of degree n.
In this case, we have an easy solution to the polynomial:
These two s are the two eigenvalues of A.
It’s no coincidence that one of our eigenvalues is 1. That’s always the case with Markov matrices. And that makes the characteristic polynomial easy to solve. We always know that the trace of A — the sum of A’s elements down the diagonal — is equal to the sum of the eigenvalues. If one of them is 1, then the other must be 1.75 minus 1 or .75. Those are our two solutions, so we can immediately factor and solve.
Now that we have the two eigenvalues, the next step is to find the eigenvectors x1 and x2. Here’s how we do that. For each eigenvalue and , we find the null space of the matrix . That is, we find the vectors x that make .
We do this by using row reduction on until we get it in row echelon form. Then we can pick off the pivots and free variables, and work out the eigenvectors.
First take the case of . Then the matrix becomes
Adding row one to row two, we get
So our equation becomes,
Since -.05 is attached to our “pivot” and .20 is attached to our “free” variable, we set our free variable x2 equal to 1 and solve for the pivot variable x1, or
So our first eigenvector x1 is
Following the same procedure for the other eigenvalue , we find the second eigenvector x2 is
Creating S and Λ Matrices
The next step is to form a new matrix S which has the eigenvectors found above as its columns. That is, let
Also, let’s form a new matrix which has the eigenvalues we found above along the diagonal and zeros elsewhere. That is, let
We know from the definition of eigenvalues and eigenvectors, . Now that we’ve got our eigenvectors in the columns of the matrix S, and our eigenvalues are along the diagonal of , we can rewrite our basic relationship as
Solving this for A, we can see an important relationship as the system moves from one state to the next:
Note what happens to this equation if we multiply by A over and over, moving from the first state u1 to the second, third, and so on:
Our original model was in the form . But now we’ve found that just equals . So we can rewrite our model as
But we can compress this model even further. It’s possible to write the initial state vector u0 as a linear combination of the two eigenvectors x1 and x2. That is, we can write
which can be solved for c as
This expression for c can now be substituted back into the model above to get
This is the model we’ve been working to find. It says the kth state of our model is equal to the matrix of eigenvectors S times the matrix of eigenvalues raised to the power of k, times some vector c that gives combinations of them.
Interpreting the Model
Expanding out the expression , we can see better why this is a useful way to model our population problem,
What this is saying is that the kth state of our model is just the sum of each of the eigenvectors, times their corresponding eigenvalues raised to the power of k, times some constant c. Written this way, it’s easy to see where this model is headed in the long run. All changes as we move forward in time are determined by the eigenvalues and .
If the absolute value of an eigenvalue is less than one — that is, if -1 < < 1 — it will rapidly decay and go to zero as k grows. If the absolute value is > 1, it will explode and grow without bound as k grows. And if it equals one, it’s what we call a “steady state” of the model since it remains unchanged as k grows.
In our case, because we set up our problem using population proportions that always sum to one, our matrix A was a Markov matrix. It always has an eigenvalue that equals one, and the rest have absolute values less than one. That means our model is guaranteed to have a steady state, as any other factors influencing the model will decay and disappear over time as k grows.
Plugging in the values for our population problem above, here’s our actual solution,
At this point, it’s easy to see what will happen to this model in the long run. Let k grow very large. The term on the right will rapidly shrink to zero — this is the “transitory” component of the model. The term on the left just gets multiplied over and over by 1. So that’s our steady state, or
This is the long-run equilibrium for our model. What it means is that starting with populations of 500,000 each in Seattle and Portland, migration slowly eroded Portland and expanded Seattle. And in the long run, Seattle’s population levels off at 800,000 and Portland’s shrinks to 200,000.
For those who want to see how this works in an Excel spreadsheet, here’s a file with the above model.
[Here’s a post where I explain how to use this method to generate the (k+1)th term of the Fibonacci sequence given some input k.]
The Linear Algebra View of Least-SquaresRegression
Linear regression is the most important statistical tool most people ever learn. However, the way it’s usually taught makes it hard to see the essence of what regression is really doing.
Most courses focus on the “calculus” view. In this view, regression starts with a large algebraic expression for the sum of the squared distances between each observed point and a hypothetical line. The expression is then minimized by taking the first derivative, setting it equal to zero, and doing a ton of algebra until we arrive at our regression coefficients.
Most textbooks walk students through one painful calculation of this, and thereafter rely on statistical packages like Stata or Excel — practically inviting students to become dependent on software and never develop deep intuition about what’s going on.
This is the way people who don’t understand math teach regression. In this post I’ll illustrate a more elegant view of least-squares regression — the so-called “linear algebra” view.
The goal of regression is to fit a mathematical model to a set of observed points. Say we’re collecting data on the number of machine failures per day in some factory. Imagine we’ve got three data points:
(day, number of failures)
The goal is to find a linear equation that fits these points. We believe there’s an underlying mathematical relationship that maps “days” uniquely to “number of machine failures,” or in the form , where b is the number of failures per day, x is the day, and C and D are the regression coefficients we’re looking for.
We can write these three data points as a simple linear system like this:
For the first two points the model is a perfect linear system. When x = 1, b = 1; and when x = 2, b = 2. But things go wrong when we reach the third point. When x = 3, b = 2 again, so we already know the three points don’t sit on a line and our model will be an approximation at best.
Now that we have a linear system we’re in the world of linear algebra. That’s good news, since it helps us step back and see the big picture. Rather than hundreds of numbers and algebraic terms, we only have to deal with a few vectors and matrices.
Here’s our linear system in the matrix form Ax = b:
What this is saying is that we hope the vector b lies in the column space of A, C(A). That is, we’re hoping there’s some linear combination of the columns of A that gives us our vector of observed b values. Unfortunately, we already know b doesn’t fit our model perfectly. That means it’s outside the column space of A. So we can’t simply solve that equation for the vector x.
Let’s look at a picture of what’s going on. In the drawing below the column space of A is marked C(A). It forms a flat plane in three-space. If we think of the columns of A as vectors a1 and a2, the plane is all possible linear combinations of a1 and a2. These are marked in the picture.
By contrast, the vector of observed values b doesn’t lie in the plane. It sticks up in some direction, marked “b” in the drawing. The plane C(A) is really just our hoped-for mathematical model. And the errant vector b is our observed data that unfortunately doesn’t fit the model.
So what should we do? The linear regression answer is that we should forget about finding a model that perfectly fits b, and instead swap out b for another vector that’s pretty close to it but that fits our model. Specifically, we want to pick a vector p that’s in the column space of A, but is also as close as possible to b.
The picture below illustrates the process. Think of shining a flashlight down onto b from above. This casts a shadow onto C(A). This is the projection of the vector b onto the column space of A. This projection is labeled p in the drawing.
The line marked e is the “error” between our observed vector b and the projected vector p that we’re planning to use instead. The goal is to choose the vector p to make e as small as possible. That is, we want to minimize the error between the vector p used in the model and the observed vector b.
In the drawing, e is just the observed vector b minus the projection p, or b – p. And the projection itself is just a combination of the columns of A — that’s why it’s in the column space after all — so it’s equal to A times some vector x-hat.
To minimize e, we want to choose a p that’s perpendicular to the error vector e, but points in the same direction as b. In the figure, the intersection between e and p is marked with a 90-degree angle. The geometry makes it pretty obvious what’s going on. We started with b, which doesn’t fit the model, and then switched to p, which is a pretty good approximation and has the virtue of sitting in the column space of A.
Solving for Regression Coefficients
Since the vector e is perpendicular to the plane of A’s column space, that means the dot product between them must be zero. That is,
But since e = b – p, and p = A times x-hat, we get,
Solving for x-hat, we get
The elements of the vector X-hat are the estimated regression coefficients C and D we’re looking for. They minimize the distance e between the model and the observed data in an elegant way that uses no calculus or explicit algebraic sums.
Here’s an easy way to remember how this works. Doing linear regression is just trying to solve Ax = b. But if any of the observed points in b deviate from the model, A won’t be an invertible matrix. So instead we force it to become invertible by multiplying both sides by the transpose of A. The transpose of A times A will always be square and symmetric, so it’s always invertible. Then we just solve for x-hat.
There are other good things about this view as well. For one, it’s a lot easier to interpret the correlation coefficient r. If our x and y data points are normalized about their means — that is, if we subtract their mean from each observed value — r is just the cosine of the angle between b and the flat plane in the drawing.
Cosine ranges from -1 to 1, just like r. If the regression is perfect, r = 1, which means b lies in the plane. If b lies in the plane, the angle between them is zero, which makes sense since cos 0 = 1. If the regression is terrible, r = 0, and b points perpendicular to the plane. In that case, the angle between them is 90 degrees or pi/2 radians. This makes sense also, since the cos (pi/2) = 0 as well.
For those who want to try this at home, here’s a handy Excel file summarizing the argument here, along with the cut-and-paste formulas you’ll need in Excel.
Deriving Tax Incidence from PriceElasticities
(See this post at Chamberlain Economics, LLC.)
All good economics starts with theory. The world is a complicated place—far too complex to make sense of directly. Economic theory helps collapse that complexity into a few key relationships we can work out mathematically and check against the facts. The first step in every analysis is to sit down with pencil and pad to work out the theory.
To help our clients better understand the economic theory underlying our work, well be posting an ongoing series of articles titled Core Concepts at the Chamberlain Economics, L.L.C. site. The goal is to provide a collection of simple and brief introductions to the core theoretical concepts used in our work.
As the first in the series weve posted Core Concepts: The Economics of Tax Incidence. The piece is designed as a refresher on the basics of tax incidence, and how its derived analytically from elasticities of supply and demand in the marketplace. This idea serves as the foundation for nearly all of our work on tax modeling and policy analysis.