I had recently started going to a new specialist gym that runs 3 classes per
day during the working week and is closed the rest of the time. I’ve been at a
few different times on a few different days, and already I was seeing many of
the same people from the first class. It occurred to me that the chance of
seeing the same faces should somehow scale with the number of people going to
the gym, hence it may be possible to estimate the total number of gym members
from the number of people I repeatedly see. I then realised that sounds very
much like a
mark and recapture (MR)
experiment from Ecology, so I did a little bit of research to see how difficult
it would be to estimate the gym members using MR techniques.
In the most elementary version of an MR experiment the scientist randomly
samples from a population and marks a certain number of individuals. He then
resamples the same population and counts the number of tagged individuals which
have been “recaptured”. The population is assumed to be the same in both cases,
and the samples are assumed uniformly random. A property of random samples is
that they are proportion preserving, which implies that the fraction of tagged
individuals in the sample should be the same as that in the population. Let
(N,n,T,t) be the population size, the sample size, the number of tagged
individuals and the number of tagged individuals recaptured respectively.
Proportion preserving implies that (frac{n}{N} approx frac{t}{T}) and thus
(N approx frac{nT}{t}). This is known as the Lincoln–Petersen estimator.
Applying this estimator directly to my problem would require several auspicious
assumptions, namely that the gym goer population is static, that people mostly
go to the gym only once a week and randomly pick the day and the time. In my
case, it is likely that the gym population over a period of 2-3 weeks is close
to static since it is a specialist gym, but it is unlikely that the members
pick when they’ll go at random. I think typically people follow a routine, I
suspect most will go more than once a week, very few will go more than once a
day, and the frequency of visits will likely differ between members.
In order to take into account a more realistic set of assumptions I will need
a more flexible model. One way of doing it is to cast the problem into a linear
model, and then borrow from the bag of tricks available from the design of
experiments literature to make adjustments. To that effect, recall that a
log-linear model has the underlying form (y = e^{beta X + epsilon}), where
(y,beta,X,epsilon) are the response, coefficients, design matrix and error
term respectively. Using log as a link function I get
(log(y)= beta X + epsilon): a log-linear model. A Poisson regression assumes
that (y) is a count variable and that the term (beta X + epsilon) is
interpreted as a rate of occurrence. However, I can also model rates directly
with a Poisson regression since if
(frac{y}{n}=e^{beta X+epsilon}) then (log(y)=log(n)+beta X+epsilon)
and I am back to a log-linear model. This trick is called “offsetting”, and
it is directly supported by the R glm function using the offset
function within a formula.
I can express an equivalent Poisson regression to the Lincoln-Petersen
estimator above as follows. Let (n=32) (sample size) and (t=5) (recaptures),
then in R:
> data.frame(t=5, n=32) %>%
> glm(t ~ offset(log(n)), data=., family="poisson") %>%
> summary
Call:
glm(formula=t~offset(log(n)), family="poisson", data=.)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8563 0.4472 -4.151 3.31e-05 ***
...
The log transformed estimate for the recapture probability is given by the
intercept (beta_0) which is the same as the response since it is a saturated
model. That is, (exp(beta_0)approxfrac{t}{n}). I can get a population
estimate by putting (beta_0) back into the Lincoln-Petersen equation:
(Napproxfrac{nT}{t} = frac{T}{beta_0}). The reason this is interesting is
because now I can adjust the intercept by adding variables to take other
factors into account. For example, say I sampled faces to remember the first
week at the gym but then calculated recaptures for the following two weeks.
That is, (T) stays the same but (t_i,n_i) become time indexed. I couldn’t jus