penppml
is an R package that enables users to fit
penalized Poisson Pseudo Maximum Likelihood (PPML) regressions with
highdimensional fixed effects (HDFE). Supported penalties in the
current version are ridge and lasso. The original application that
motivated the development of penppml
was the estimation of
threeway gravity models of trade with a large number of PTA provision
dummies (Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin,
2021).
To install and load penppml
, you can use the following
commands:
install.packages("penppml")
library(penppml)
::load_all()
devtools#> ℹ Loading penppml
The penppml
package features the trade
data
set, which integrates panel data on bilateral trade flows with
information about specific provisions in trade agreements for 220
exporters and 270 importers. The provisions included in the package are
a subset of 16 out of 305 “essential” provisions featured in the full
data set. More information about the data set and the variables is
included in the corresponding help file, accessible via
?trade
.
exp  imp  time  export  id  agreement  ad_prov_14  cp_prov_23  tbt_prov_07  tbt_prov_33 

ABW  AIA  1988  0  0  0  0  0  0  
ABW  AIA  1996  0  0  0  0  0  0  
ABW  AIA  2012  0  0  0  0  0  0  
ABW  AIA  2016  0  0  0  0  0  0  
ABW  ANT  1988  13087119  0  0  0  0  0  
ABW  ANT  1996  2371576  0  0  0  0  0  
ABW  ARE  1988  0  0  0  0  0  0  
ABW  ARE  1996  0  0  0  0  0  0  
ABW  ARE  2012  0  0  0  0  0  0  
ABW  ARE  2016  0  0  0  0  0  0 
Along with the trade
data set, the package includes an
auxiliary data frame, countries
, which contains basic
information about the country ISO codes included in the main data
set.
iso  name  region  sub.region 

AFG  Afghanistan  Asia  Southern Asia 
ALA  0c5land Islands  Europe  Northern Europe 
ALB  Albania  Europe  Southern Europe 
DZA  Algeria  Africa  Northern Africa 
ASM  American Samoa  Oceania  Polynesia 
AND  Andorra  Europe  Southern Europe 
AGO  Angola  Africa  SubSaharan Africa 
AIA  Anguilla  Americas  Latin America and the Caribbean 
ATA  Antarctica  
ATG  Antigua and Barbuda  Americas  Latin America and the Caribbean 
This enables users to easily filter by region or subregion. For instance, if we want to restrict our analysis to countries in the Americas, we can do the following:
< countries$iso[countries$region %in% c("Americas")]
selected < trade[(trade$exp %in% selected) & (trade$imp %in% selected), (5:6)] # We remove columns 5 and
trade2 # 6 because these variables are not needed in our regressions.
We will show the capabilities of this package using the filtered
trade2
data frame.
The package enables users to run unpenalized PPML regressions with
HDFE, using the hdfeppml
function as follows:
< hdfeppml(data = trade2,
reg1 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")))
#> User did not specify independent variables. By default, the function takes all variables
#> not included in 'dep' or 'fixed' as regressors.
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
#> The following variables have been dropped, due to collinearity: ipr_prov_15 ipr_prov_44
As we can see, the function is designed to be dataframefriendly:
the user sets a data frame of reference in the data
argument and then picks the dependent, independent and fixed effect
variables either by name or by column number within the provided data
frame. Also, note the following points about the syntax:
The fixed
argument can take either a single vector
(just like the dep
argument; each element will be used as a
separate fixed effect) or a list of vectors. The latter option is useful
in cases where the desired fixed effect is the result of the interaction
of two or more variables, as in the gravity model of trade. In those
cases, you can specify any number of separate fixed effects in distinct
elements of the list and, inside each element, which variables in the
data set you want to interact.
As explained in the note, if indep
is empty, the
function uses all remaining columns in the data frame by
default.
Internally, the function will do the necessary transformations of the columns into vectors and matrices, as needed by the algorithm.
Alternatively, more advanced users who prefer to handle data
transformations themselves can use the internal function
hdfeppml_int
. For more information and examples on this
issue, run ?hdfeppml
or ?hdfeppml_int
. Note
also that this applies to all of the main functions in our package: both
the dataframefriendly wrapper and the internal function are available
for use.
The results of our PPML model are:
< data.frame(prov = rownames(reg1$coefficients), b = reg1$coefficients, se = 0)
results $se[!is.na(reg1$coefficients)] < reg1$se
results results


(Note that the functions is automatically dropping perfectly
collinear variables from the estimation algorithm and reporting
NA
as the coefficient.)
If we want to specify the independent variable, we can specify indep as pointing to columns 5 to 7 with the following code.
< hdfeppml(data = trade2, indep=5:7,
reg1_indep dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")))
The mlfitppml
function is a flexible tool for computing
penalized PPML regressions with HDFE. For instance, if we want to fit a
PPML regression with a lasso penalty for several values of the penalty
parameter (lambda) at once, we can run:
< c(0.05, 0.025, 0.01, 0.0075, 0.005, 0.0025, 0.001, 0.00075, 0.0005, 0.00025, 0.0001, 0)
lambdas
< mlfitppml(data = trade2,
reg2 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
lambdas = lambdas)
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
The function returns a list that contains two sets of coefficients
for each value of lambda: the penalized coefficients (in the
beta_pre
element) and the postpenalty or unpenalized
coefficients (in the beta
element; these are calculated by
estimating a postpenalty regression with just the selected variables).
We can plot the regularization path of the penalized coefficients as
follows:
< as.data.frame(reg2$beta_pre)
results names(results) < lambdas
$provision < row.names(results)
results< reshape2::melt(results, id.vars = "provision", variable.name = "lambda",
results value.name = "coefficient")
$lambda < as.numeric(as.character(results$lambda))
results
::ggplot(data = results, mapping = ggplot2::aes(x = lambda, y = coefficient, col = provision)) +
ggplot2::geom_line(show.legend = FALSE) +
ggplot2::scale_x_reverse(expand = ggplot2::expansion(add = c(0, 0.015))) +
ggplot2::theme_classic() +
ggplot2::geom_dl(ggplot2::aes(label = provision),
directlabelsmethod = list(directlabels::dl.trans(x = x + 0.5), "last.bumpup")) +
::labs(x = "Penalty parameter (lambda)", y = "Coefficient",
ggplot2title = "Figure 1: Regularization path for lasso")
The user can also get a penalized PPML regression without highdimensional fixed effects. To do so, simply set the option fixed to NULL.
< c(0.025, 0.01, 0.0075, 0.005, 0.0025, 0.001, 0.00075, 0.0005, 0.00025, 0.0001, 0)
lambdas
< mlfitppml(data = trade2,
reg2_nofe dep = "export",
fixed = NULL,
penalty = "lasso",
lambdas = lambdas)
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
#> Lasso exceeded 50 iterations. Break loop and return last model.
#> Lasso exceeded 50 iterations. Break loop and return last model.
#> Lasso exceeded 50 iterations. Break loop and return last model.
#> Lasso exceeded 50 iterations. Break loop and return last model.
#> Lasso exceeded 50 iterations. Break loop and return last model.
#> Lasso exceeded 50 iterations. Break loop and return last model.
If the user wishes to obtain the penalized estimates for a single
value of lambda, they can either use mlfitppml
as described
above (just setting lambdas == x
, where x
is a
number) or use the penhdfeppml
function, upon which
mlfitppml
is built, directly. For instance:
< penhdfeppml(data = trade2,
reg3 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
lambda = 0.005)
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
We can easily check that the penalized coefficient estimates of
penhdfeppml
and mlfitppml
are equal for a
lambda value of 0.005 (within a numerical tolerance):
all.equal(as.vector(reg3$beta[!is.na(reg3$beta)]), as.vector(reg2$beta_pre[, 5]), tol = 1e05)
#> [1] TRUE
For more details on penhdfeppml
, run
?penhdfeppml
.
mlfitppml
also allows user to use the ridge penalty in
their PPML regressions. Simply run:
< seq(0.0001, 0, length.out = 10)
lambdas
< mlfitppml(data = trade2,
reg4 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "ridge",
lambdas = lambdas)
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
Note that this feature is still in development and may contain bugs.
The mlfitppml
function enables users to carry out
crossvalidation of their models via the xval
and
IDs
arguments. When xval
is set to
TRUE
, the function performs crossvalidation using a
userprovided vector of IDs. For instance, if we want to do kfold cross
validation with k = 20, splitting the data set by agreement (not by
observation), we can do the following:
< unique(trade[(trade$exp %in% selected) & (trade$imp %in% selected), 5])
id < 20
nfolds < data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE))
unique_ids
< merge(trade[(trade$exp %in% selected) & (trade$imp %in% selected), 5, drop = FALSE],
cross_ids by = "id", all.x = TRUE) unique_ids,
Now we activate the crossvalidation option in mlfitppml
and input the ID vector:
< mlfitppml(data = trade2,
reg5 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
lambdas = c(seq(0.5, 0.1, by = 0.1), 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0),
xval = TRUE,
IDs = cross_ids$fold)
Now the function also returns a rmse
element that
includes the crossvalidation results (the average RMSE or root mean
squared error for each value of lambda). Users can employ this tool to
choose the value of the penalty parameter that minimizes the RMSE:
$rmse reg5
< mlfitppml(data = trade2,
reg5_nofe dep = "export",
fixed = NULL,
penalty = "lasso",
lambdas = c(seq(0.5, 0.1, by = 0.1), 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0),
xval = TRUE,
IDs = cross_ids$fold)
Penalty (lambda)  RMSE 

5e01  108.5813 
4e01  108.5813 
3e01  108.5813 
2e01  108.5813 
1e01  108.5813 
5e02  108.5813 
1e02  108.5750 
5e03  108.5743 
1e03  108.5740 
5e04  108.5739 
1e04  108.5739 
0e+00  108.5739 
In this case, the RMSE criterion suggest that the penalty should be set at 0. In other words, the results in Table 3 are the ones that minimize the mean squared error of the regression, according to 20fold crossvalidation.
This package also enables the use of the plugin lasso method,
incorporating coefficientspecific penalty weights calculated
automatically by the package. The most convenient way to do this is to
set method = "plugin"
in mlfitppml
. Note that
the plugin algorithm requires a clustering variable  we are using the
interaction of the exporter and importer variables in the
trade
data set:
< mlfitppml(data = trade2,
reg6 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
method = "plugin",
cluster = c("exp", "imp"))
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
#> No of variables: 3
#> BIC: 9787471.95031103
< data.frame(prov = rownames(reg6$beta), b_pre = reg6$beta_pre, b = reg6$beta, se = 0)
results $se[!is.na(reg6$beta)] < reg6$ses
results results


We can see that the plugin lasso is very strict: only three provisions have nonzero coefficients when the penalties are set at the level suggested by the plugin algorithm. Compare this to the crossvalidation results, which didn’t remove any of the provisions.
We can also set the gamma value as defined in Belloni et al. to something less strict, e.g. 0.3.
< mlfitppml(data = trade2,
reg6_gamma dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
method = "plugin",
cluster = c("exp", "imp"), gamma_val=0.3)
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
#> No of variables: 5
#> BIC: 9458290.22374599
rownames(reg6$beta)[which(reg6_gamma$beta != 0)]
< data.frame(prov = rownames(reg6_gamma$beta), b_pre = reg6_gamma$beta_pre, b = reg6_gamma$beta, se = 0)
results_gamma $se[!is.na(reg6_gamma$beta)] < reg6_gamma$ses
results_gamma results_gamma


When doing this, we see that more provisions than before are selected.
This package also allows users to implement the twostep lasso
described in Breinlich et al. (2021). This method consists in, first,
running a plugin lasso estimation and second, running individual lasso
regressions on each of the variables selected in the first stage. The
iceberg
function is designed precisely with this second
step in mind. It takes a vector of dependent variables and returns a
lasso regression for each one of them:
< iceberg(data = trade2[, (1:4)],
iceberg_results dep = results$prov[results$b != 0],
selectobs = (trade2$time == "2016"))
Currently, the function returns a matrix with coefficient estimates for each of the selected provisions. Support for standard errors (including clustered) is in development as of the current version of the package. The iceberg lasso coefficients are:
iceberg_results
ad_prov_14  tf_prov_45  inv_prov_22  

cp_prov_23  0.0000  0.6143  0.0000 
tbt_prov_07  0.0000  0.1002  0.0000 
tbt_prov_33  0.0042  0.0000  0.0000 
tf_prov_41  0.0010  0.1793  0.0000 
env_prov_18  0.0000  0.0000  0.0000 
et_prov_38  0.0000  0.0000  0.3399 
ipr_prov_15  0.0000  0.0000  0.0000 
ipr_prov_44  0.0000  0.0000  0.0000 
lm_prov_10  0.0000  0.0000  0.0000 
moc_prov_21  0.0000  0.0000  0.0000 
ser_prov_47  0.0000  0.0000  0.5074 
ste_prov_30  0.0000  0.0489  0.0729 
cp_prov_26  0.0047  0.0491  0.0191 
Since PPML coefficients can’t be easily interpreted, you may find it useful to see the raw correlations between the variables in the iceberg lasso step:
< cor(trade2[, results$prov])
provcorr < provcorr[, results$prov[results$b != 0]]) (provcorr
ad_prov_14  tf_prov_45  inv_prov_22  

ad_prov_14  1.0000  0.1471  0.2265 
cp_prov_23  0.1466  0.9744  0.1593 
tbt_prov_07  0.1096  0.7372  0.3075 
tbt_prov_33  0.0121  0.1790  0.1497 
tf_prov_41  0.0044  0.3072  0.0196 
tf_prov_45  0.1471  1.0000  0.1464 
env_prov_18  0.0007  0.0048  0.0754 
et_prov_38  0.3522  0.0792  0.5964 
inv_prov_22  0.2265  0.1464  1.0000 
lm_prov_10  0.0007  0.0221  0.0754 
moc_prov_21  0.3547  0.0804  0.6010 
ser_prov_47  0.3154  0.0652  0.6676 
ste_prov_30  0.1349  0.8065  0.5273 
cp_prov_26  0.0149  0.8290  0.2387 
To illustrate this, we select the trade data, but now not dropping id and agreement variable.
We want to cluster by agreement, that means that we draw observations with the same agreement ID. For this, we need to create a new agreement ID in the following way:
< trade[(trade$exp %in% selected) & (trade$imp %in% selected), ] # Now, we need id and agreement variable
trade3 # Let's cluster by agreement
$alt_id < trade3$id # ID refers to agreement ID
trade3$alt_id[is.na(trade3$alt_id)] < 0 # We set this to zero when the ID is missing, interpreting this as the country pair not being part of any agreement.
trade3
# Create pair ID
< do.call(paste, as.data.frame(t(apply(trade3[1:2], 1, sort))))
v1 $pair < match(v1, unique(v1))
trade3$pair < trade3$pair + 500
trade3
# Create maximal ID from the preexisting IDvariable inside each pair
< within(trade3, {alt_id2 = ave(alt_id,pair,FUN=max)} ) # This creates the maximum of the ID for each pair. This adjusts for the fact that some pairs might have been part of different agreements and we want to take get a unique agreement ID for each pair.
trade3
$alt_id2[trade3$alt_id2==0] < trade3$pair[trade3$alt_id2==0]
trade3unique(trade3$alt_id2) # This cluster variable collects pairs for pairs that are in agreement, uses the pair ID for those that are not.
# Thus, it allows errors to be clustered within agreements.
< factor(trade3$alt_id2)
alt_id2 $clus < alt_id2 #Add the ID to the data trade3
Next, we run Bootstrap Lasso, which means drawing agreement clusters with replacement using the ID just created. In the bootstrap() command, first hdfeppml() is run, using a dummy which is one when any provision was in place and zero otherwise to get initial values of mu. These are then used in plugin Lasso. This is repeated B times (as specified in bootreps) and the command selects those provisions selected by plugin in at least a fraction of cases. This fraction can be specified by the option boot_threshold (Default is 1 pc).
set.seed(123)
< bootstrap(data=trade3, dep="export", cluster_id="clus", fixed=list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), indep=7:22, bootreps=10, colcheck_x = TRUE, colcheck_x_fes = TRUE, boot_threshold = 0.01, post=TRUE, gamma_val=0.01, verbose=FALSE) bs1
1  2  3  4  5  6  7  8  9  10  

ad_prov_14  0  0.63  0.00  0  NA  NA  0  NA  NA  NA 
cp_prov_23  0  0.00  0.00  0  0  0  0  0  0  0 
tbt_prov_07  0  0.00  0.19  0  0  0  0  0  0  0 
tbt_prov_33  0  0.00  0.00  0  0  0  0  0  NA  NA 
tf_prov_41  0  0.00  0.00  0  0  0  NA  0  0  0 
tf_prov_45  0  0.00  0.21  0  0  0  0  0  0  0 
env_prov_18  0  0.00  0.00  0  0  0  NA  NA  NA  0 
et_prov_38  0  0.00  0.00  0  0  0  0  0  0  0 
inv_prov_22  0  0.00  0.00  0  0  0  0  0  0  0 
ipr_prov_15  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
ipr_prov_44  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
lm_prov_10  NA  0.00  0.00  0  0  0  0  0  0  NA 
moc_prov_21  0  0.00  0.00  0  0  0  0  0  0  0 
ser_prov_47  0  0.00  0.00  0  0  0  0  0  0  0 
ste_prov_30  0  0.00  0.31  0  0  0  0  0  0  0 
cp_prov_26  0  0.00  0.00  0  0  0  0  0  0  0 
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin, T. (2021). “Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements”, Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). “Fast Poisson estimation with high dimensional fixed effects”, STATA Journal, 20, 90115.
Gaure, S (2013). “OLS with multiple high dimensional category variables”, Computational Statistics & Data Analysis, 66, 818.
Friedman, J., T. Hastie, and R. Tibshirani (2010). “Regularization paths for generalized linear models via coordinate descent”, Journal of Statistical Software, 33, 122.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). “Inference in high dimensional panel models with an application to gun control”, Journal of Business & Economic Statistics, 34, 590605.