Analyzing overlaps between our lists of R users
But our original idea was to predict the number of R users around the world and not to focus on some minor segments, right? Now that we have multiple data sources, we can start building some models combining those to provide estimates on the global number of R users.
The basic idea behind this approach is the capture-recapture method, which is well known in ecology, where we first try to identify the probability of capturing a unit from the population, and then we use this probability to estimate the number of not captured units.
In our current study, units will be R users and the samples are the previously captured name lists on the:
- Supporters of the R Foundation
- R package maintainers who submitted at least one package to CRAN
- R-help mailing list e-mail senders
Let's merge these lists with a tag referencing the data source:
> lists <- rbindlist(list( + data.frame(name = unique(supporterlist), list = 'supporter'), + ...