# 1 Introduction

penppml is an R package that enables users to fit penalized Poisson Pseudo Maximum Likelihood (PPML) regressions with high-dimensional fixed effects (HDFE). Supported penalties in the current version are ridge and lasso. The original application that motivated the development of penppml was the estimation of three-way gravity models of trade with a large number of PTA provision dummies (Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin, 2021).

To install and load penppml, you can use the following commands:

install.packages("penppml")
library(penppml)
devtools::load_all()
#> ℹ Loading penppml

# 2 Data Sets

The penppml package features the trade data set, which integrates panel data on bilateral trade flows with information about specific provisions in trade agreements for 220 exporters and 270 importers. The provisions included in the package are a subset of 16 out of 305 “essential” provisions featured in the full data set. More information about the data set and the variables is included in the corresponding help file, accessible via ?trade.

Table 1: International Trade Data Set
exp imp time export id agreement ad_prov_14 cp_prov_23 tbt_prov_07 tbt_prov_33
ABW AIA 1988 0 0 0 0 0 0
ABW AIA 1996 0 0 0 0 0 0
ABW AIA 2012 0 0 0 0 0 0
ABW AIA 2016 0 0 0 0 0 0
ABW ANT 1988 13087119 0 0 0 0 0
ABW ANT 1996 2371576 0 0 0 0 0
ABW ARE 1988 0 0 0 0 0 0
ABW ARE 1996 0 0 0 0 0 0
ABW ARE 2012 0 0 0 0 0 0
ABW ARE 2016 0 0 0 0 0 0

Along with the trade data set, the package includes an auxiliary data frame, countries, which contains basic information about the country ISO codes included in the main data set.

Table 2: Country Data Set
iso name region sub.region
AFG Afghanistan Asia Southern Asia
ALA 0c5land Islands Europe Northern Europe
ALB Albania Europe Southern Europe
DZA Algeria Africa Northern Africa
ASM American Samoa Oceania Polynesia
AND Andorra Europe Southern Europe
AGO Angola Africa Sub-Saharan Africa
AIA Anguilla Americas Latin America and the Caribbean
ATA Antarctica
ATG Antigua and Barbuda Americas Latin America and the Caribbean

This enables users to easily filter by region or subregion. For instance, if we want to restrict our analysis to countries in the Americas, we can do the following:

selected <- countries$iso[countries$region %in% c("Americas")]
trade2 <- trade[(trade$exp %in% selected) & (trade$imp %in% selected), -(5:6)] # We remove columns 5 and
# 6 because these variables are not needed in our regressions.

We will show the capabilities of this package using the filtered trade2 data frame.

# 3 Unpenalized PPML estimation with HDFE

The package enables users to run unpenalized PPML regressions with HDFE, using the hdfeppml function as follows:

reg1 <- hdfeppml(data = trade2,
dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")))
#> User did not specify independent variables. By default, the function takes all variables
#>         not included in 'dep' or 'fixed' as regressors.
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44
#> The following variables have been dropped, due to collinearity:  ipr_prov_15 ipr_prov_44

As we can see, the function is designed to be data-frame-friendly: the user sets a data frame of reference in the data argument and then picks the dependent, independent and fixed effect variables either by name or by column number within the provided data frame. Also, note the following points about the syntax:

• The fixed argument can take either a single vector (just like the dep argument; each element will be used as a separate fixed effect) or a list of vectors. The latter option is useful in cases where the desired fixed effect is the result of the interaction of two or more variables, as in the gravity model of trade. In those cases, you can specify any number of separate fixed effects in distinct elements of the list and, inside each element, which variables in the data set you want to interact.

• As explained in the note, if indep is empty, the function uses all remaining columns in the data frame by default.

Internally, the function will do the necessary transformations of the columns into vectors and matrices, as needed by the algorithm.

Alternatively, more advanced users who prefer to handle data transformations themselves can use the internal function hdfeppml_int. For more information and examples on this issue, run ?hdfeppml or ?hdfeppml_int. Note also that this applies to all of the main functions in our package: both the data-frame-friendly wrapper and the internal function are available for use.

The results of our PPML model are:

results <- data.frame(prov = rownames(reg1$coefficients), b = reg1$coefficients, se = 0)
results$se[!is.na(reg1$coefficients)] <- reg1$se results Table 3: Unpenalized PPML results Provision Coefficient SE ad_prov_14 0.0392 0.3253 cp_prov_23 -0.0317 0.0514 tbt_prov_07 0.3924 0.3171 tbt_prov_33 -0.0484 0.3303 tf_prov_41 0.6611 0.1829 tf_prov_45 0.1357 0.0618 env_prov_18 -0.2564 0.1340 et_prov_38 -0.0104 0.1276 Provision Coefficient SE inv_prov_22 0.3549 0.0617 ipr_prov_15 NA 0.0000 ipr_prov_44 NA 0.0000 lm_prov_10 -0.5864 0.1742 moc_prov_21 0.0930 0.1492 ser_prov_47 -0.2656 0.1132 ste_prov_30 0.0263 0.0636 cp_prov_26 -0.1628 0.0854 (Note that the functions is automatically dropping perfectly collinear variables from the estimation algorithm and reporting NA as the coefficient.) If we want to specify the independent variable, we can specify indep as pointing to columns 5 to 7 with the following code. reg1_indep <- hdfeppml(data = trade2, indep=5:7, dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp"))) # 4 Penalized PPML estimation with HDFE ## 4.1 Lasso regression The mlfitppml function is a flexible tool for computing penalized PPML regressions with HDFE. For instance, if we want to fit a PPML regression with a lasso penalty for several values of the penalty parameter (lambda) at once, we can run: lambdas <- c(0.05, 0.025, 0.01, 0.0075, 0.005, 0.0025, 0.001, 0.00075, 0.0005, 0.00025, 0.0001, 0) reg2 <- mlfitppml(data = trade2, dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), penalty = "lasso", lambdas = lambdas) #> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44 The function returns a list that contains two sets of coefficients for each value of lambda: the penalized coefficients (in the beta_pre element) and the post-penalty or unpenalized coefficients (in the beta element; these are calculated by estimating a post-penalty regression with just the selected variables). We can plot the regularization path of the penalized coefficients as follows: results <- as.data.frame(reg2$beta_pre)
names(results) <- lambdas
results$provision <- row.names(results) results <- reshape2::melt(results, id.vars = "provision", variable.name = "lambda", value.name = "coefficient") results$lambda <- as.numeric(as.character(results$lambda)) ggplot2::ggplot(data = results, mapping = ggplot2::aes(x = lambda, y = coefficient, col = provision)) + ggplot2::geom_line(show.legend = FALSE) + ggplot2::scale_x_reverse(expand = ggplot2::expansion(add = c(0, 0.015))) + ggplot2::theme_classic() + directlabels::geom_dl(ggplot2::aes(label = provision), method = list(directlabels::dl.trans(x = x + 0.5), "last.bumpup")) + ggplot2::labs(x = "Penalty parameter (lambda)", y = "Coefficient", title = "Figure 1: Regularization path for lasso") The user can also get a penalized PPML regression without high-dimensional fixed effects. To do so, simply set the option fixed to NULL. lambdas <- c(0.025, 0.01, 0.0075, 0.005, 0.0025, 0.001, 0.00075, 0.0005, 0.00025, 0.0001, 0) reg2_nofe <- mlfitppml(data = trade2, dep = "export", fixed = NULL, penalty = "lasso", lambdas = lambdas) #> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44 #> Lasso exceeded 50 iterations. Break loop and return last model. #> Lasso exceeded 50 iterations. Break loop and return last model. #> Lasso exceeded 50 iterations. Break loop and return last model. #> Lasso exceeded 50 iterations. Break loop and return last model. #> Lasso exceeded 50 iterations. Break loop and return last model. #> Lasso exceeded 50 iterations. Break loop and return last model. If the user wishes to obtain the penalized estimates for a single value of lambda, they can either use mlfitppml as described above (just setting lambdas == x, where x is a number) or use the penhdfeppml function, upon which mlfitppml is built, directly. For instance: reg3 <- penhdfeppml(data = trade2, dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), penalty = "lasso", lambda = 0.005) #> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44 We can easily check that the penalized coefficient estimates of penhdfeppml and mlfitppml are equal for a lambda value of 0.005 (within a numerical tolerance): all.equal(as.vector(reg3$beta[!is.na(reg3$beta)]), as.vector(reg2$beta_pre[, 5]), tol = 1e-05)
#> [1] TRUE

For more details on penhdfeppml, run ?penhdfeppml.

## 4.2 Ridge regression (in development)

mlfitppml also allows user to use the ridge penalty in their PPML regressions. Simply run:

lambdas <- seq(0.0001, 0, length.out = 10)

dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "ridge",
lambdas = lambdas)
#> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44

Note that this feature is still in development and may contain bugs.

## 4.3 Penalty selection

### 4.3.1 Cross-validation

The mlfitppml function enables users to carry out cross-validation of their models via the xval and IDs arguments. When xval is set to TRUE, the function performs cross-validation using a user-provided vector of IDs. For instance, if we want to do k-fold cross validation with k = 20, splitting the data set by agreement (not by observation), we can do the following:

id <- unique(trade[(trade$exp %in% selected) & (trade$imp %in% selected), 5])
nfolds <- 20
unique_ids <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE))

cross_ids <- merge(trade[(trade$exp %in% selected) & (trade$imp %in% selected), 5, drop = FALSE],
unique_ids, by = "id", all.x = TRUE)

Now we activate the cross-validation option in mlfitppml and input the ID vector:

reg5 <- mlfitppml(data = trade2,
dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
lambdas = c(seq(0.5, 0.1, by = -0.1), 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0),
xval = TRUE,
IDs =  cross_ids$fold) Now the function also returns a rmse element that includes the cross-validation results (the average RMSE or root mean squared error for each value of lambda). Users can employ this tool to choose the value of the penalty parameter that minimizes the RMSE: reg5$rmse
reg5_nofe <- mlfitppml(data = trade2,
dep = "export",
fixed = NULL,
penalty = "lasso",
lambdas = c(seq(0.5, 0.1, by = -0.1), 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0),
xval = TRUE,
IDs =  cross_ids$fold) Table 4: Cross-validation results Penalty (lambda) RMSE 5e-01 108.5813 4e-01 108.5813 3e-01 108.5813 2e-01 108.5813 1e-01 108.5813 5e-02 108.5813 1e-02 108.5750 5e-03 108.5743 1e-03 108.5740 5e-04 108.5739 1e-04 108.5739 0e+00 108.5739 In this case, the RMSE criterion suggest that the penalty should be set at 0. In other words, the results in Table 3 are the ones that minimize the mean squared error of the regression, according to 20-fold cross-validation. ### 4.3.2 Plugin lasso This package also enables the use of the plugin lasso method, incorporating coefficient-specific penalty weights calculated automatically by the package. The most convenient way to do this is to set method = "plugin" in mlfitppml. Note that the plugin algorithm requires a clustering variable - we are using the interaction of the exporter and importer variables in the trade data set: reg6 <- mlfitppml(data = trade2, dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), penalty = "lasso", method = "plugin", cluster = c("exp", "imp")) #> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44 #> No of variables: 3 #> BIC: 9787471.95031103 results <- data.frame(prov = rownames(reg6$beta), b_pre = reg6$beta_pre, b = reg6$beta, se = 0)
results$se[!is.na(reg6$beta)] <- reg6$ses results Table 5: Plugin Lasso results Provision Lasso Coefficient Post-Lasso Coefficient SE ad_prov_14 0.0204 0.4593 0.1397 cp_prov_23 0.0000 0.0000 NA tbt_prov_07 0.0000 0.0000 NA tbt_prov_33 0.0000 0.0000 NA tf_prov_41 0.0000 0.0000 NA tf_prov_45 0.1403 0.2362 0.0760 env_prov_18 0.0000 0.0000 NA Provision Lasso Coefficient Post-Lasso Coefficient SE et_prov_38 0.0000 0.000 NA inv_prov_22 0.0082 0.161 0.0616 lm_prov_10 0.0000 0.000 NA moc_prov_21 0.0000 0.000 NA ser_prov_47 0.0000 0.000 NA ste_prov_30 0.0000 0.000 NA cp_prov_26 0.0000 0.000 NA We can see that the plugin lasso is very strict: only three provisions have non-zero coefficients when the penalties are set at the level suggested by the plugin algorithm. Compare this to the cross-validation results, which didn’t remove any of the provisions. We can also set the gamma value as defined in Belloni et al. to something less strict, e.g. 0.3. reg6_gamma <- mlfitppml(data = trade2, dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), penalty = "lasso", method = "plugin", cluster = c("exp", "imp"), gamma_val=0.3) #> The following variables are dropped because their variation is equal zero: ipr_prov_15 ipr_prov_44 #> No of variables: 5 #> BIC: 9458290.22374599 rownames(reg6$beta)[which(reg6_gamma$beta != 0)] results_gamma <- data.frame(prov = rownames(reg6_gamma$beta), b_pre = reg6_gamma$beta_pre, b = reg6_gamma$beta, se = 0)
results_gamma$se[!is.na(reg6_gamma$beta)] <- reg6_gamma$ses results_gamma Table 6: Plugin Lasso results, different gamma value Provision Lasso Coefficient Post-Lasso Coefficient SE ad_prov_14 0.1761 0.0551 0.1721 cp_prov_23 0.0000 0.0000 NA tbt_prov_07 0.0340 0.3870 0.1716 tbt_prov_33 0.0000 0.0000 NA tf_prov_41 0.0436 0.4284 0.1845 tf_prov_45 0.1752 0.1622 0.0828 env_prov_18 0.0000 0.0000 NA Provision Lasso Coefficient Post-Lasso Coefficient SE et_prov_38 0.0000 0.0000 NA inv_prov_22 0.0652 0.1716 0.064 lm_prov_10 0.0000 0.0000 NA moc_prov_21 0.0000 0.0000 NA ser_prov_47 0.0000 0.0000 NA ste_prov_30 0.0000 0.0000 NA cp_prov_26 0.0000 0.0000 NA When doing this, we see that more provisions than before are selected. ### 4.3.3 Iceberg lasso (in development) This package also allows users to implement the two-step lasso described in Breinlich et al. (2021). This method consists in, first, running a plugin lasso estimation and second, running individual lasso regressions on each of the variables selected in the first stage. The iceberg function is designed precisely with this second step in mind. It takes a vector of dependent variables and returns a lasso regression for each one of them: iceberg_results <- iceberg(data = trade2[, -(1:4)], dep = results$prov[results$b != 0], selectobs = (trade2$time == "2016"))

Currently, the function returns a matrix with coefficient estimates for each of the selected provisions. Support for standard errors (including clustered) is in development as of the current version of the package. The iceberg lasso coefficients are:

iceberg_results
Table 7: Iceberg Lasso coefficients
cp_prov_23 0.0000 0.6143 0.0000
tbt_prov_07 0.0000 0.1002 0.0000
tbt_prov_33 -0.0042 0.0000 0.0000
tf_prov_41 0.0010 0.1793 0.0000
env_prov_18 0.0000 0.0000 0.0000
et_prov_38 0.0000 0.0000 0.3399
ipr_prov_15 0.0000 0.0000 0.0000
ipr_prov_44 0.0000 0.0000 0.0000
lm_prov_10 0.0000 0.0000 0.0000
moc_prov_21 0.0000 0.0000 0.0000
ser_prov_47 0.0000 0.0000 0.5074
ste_prov_30 0.0000 0.0489 0.0729
cp_prov_26 -0.0047 0.0491 -0.0191

Since PPML coefficients can’t be easily interpreted, you may find it useful to see the raw correlations between the variables in the iceberg lasso step:

provcorr <- cor(trade2[, results$prov]) (provcorr <- provcorr[, results$prov[results$b != 0]]) Table 7: Iceberg Lasso correlations ad_prov_14 tf_prov_45 inv_prov_22 ad_prov_14 1.0000 0.1471 0.2265 cp_prov_23 0.1466 0.9744 0.1593 tbt_prov_07 0.1096 0.7372 0.3075 tbt_prov_33 -0.0121 0.1790 0.1497 tf_prov_41 -0.0044 0.3072 -0.0196 tf_prov_45 0.1471 1.0000 0.1464 env_prov_18 -0.0007 -0.0048 0.0754 et_prov_38 0.3522 0.0792 0.5964 inv_prov_22 0.2265 0.1464 1.0000 lm_prov_10 -0.0007 0.0221 0.0754 moc_prov_21 0.3547 0.0804 0.6010 ser_prov_47 0.3154 0.0652 0.6676 ste_prov_30 0.1349 0.8065 0.5273 cp_prov_26 -0.0149 0.8290 0.2387 # 5 Bootstrap Lasso To illustrate this, we select the trade data, but now not dropping id and agreement variable. We want to cluster by agreement, that means that we draw observations with the same agreement ID. For this, we need to create a new agreement ID in the following way: trade3 <- trade[(trade$exp %in% selected) & (trade$imp %in% selected), ] # Now, we need id and agreement variable # Let's cluster by agreement trade3$alt_id <- trade3$id # ID refers to agreement ID trade3$alt_id[is.na(trade3$alt_id)] <- 0 # We set this to zero when the ID is missing, interpreting this as the country pair not being part of any agreement. # Create pair ID v1 <- do.call(paste, as.data.frame(t(apply(trade3[1:2], 1, sort)))) trade3$pair <-  match(v1, unique(v1))
trade3$pair <- trade3$pair + 500

# Create maximal ID from the preexisting ID-variable inside each pair
trade3 <- within(trade3, {alt_id2 = ave(alt_id,pair,FUN=max)} ) # This creates the maximum of the ID for each pair. This adjusts for the fact that some pairs might have been part of different agreements and we want to take get a unique agreement ID for each pair.

trade3$alt_id2[trade3$alt_id2==0] <- trade3$pair[trade3$alt_id2==0]
unique(trade3$alt_id2) # This cluster variable collects pairs for pairs that are in agreement, uses the pair ID for those that are not. # Thus, it allows errors to be clustered within agreements. alt_id2 <- factor(trade3$alt_id2)
trade3\$clus <- alt_id2 #Add the ID to the data

Next, we run Bootstrap Lasso, which means drawing agreement clusters with replacement using the ID just created. In the bootstrap() command, first hdfeppml() is run, using a dummy which is one when any provision was in place and zero otherwise to get initial values of mu. These are then used in plugin Lasso. This is repeated B times (as specified in bootreps) and the command selects those provisions selected by plugin in at least a fraction of cases. This fraction can be specified by the option boot_threshold (Default is 1 pc).

set.seed(123)
bs1 <- bootstrap(data=trade3, dep="export", cluster_id="clus", fixed=list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), indep=7:22, bootreps=10, colcheck_x = TRUE, colcheck_x_fes = TRUE, boot_threshold = 0.01, post=TRUE, gamma_val=0.01, verbose=FALSE)
Table 8: Coefficients of bootstrap repetitions
1 2 3 4 5 6 7 8 9 10
ad_prov_14 0 0.63 0.00 0 NA NA 0 NA NA NA
cp_prov_23 0 0.00 0.00 0 0 0 0 0 0 0
tbt_prov_07 0 0.00 0.19 0 0 0 0 0 0 0
tbt_prov_33 0 0.00 0.00 0 0 0 0 0 NA NA
tf_prov_41 0 0.00 0.00 0 0 0 NA 0 0 0
tf_prov_45 0 0.00 0.21 0 0 0 0 0 0 0
env_prov_18 0 0.00 0.00 0 0 0 NA NA NA 0
et_prov_38 0 0.00 0.00 0 0 0 0 0 0 0
inv_prov_22 0 0.00 0.00 0 0 0 0 0 0 0
ipr_prov_15 NA NA NA NA NA NA NA NA NA NA
ipr_prov_44 NA NA NA NA NA NA NA NA NA NA
lm_prov_10 NA 0.00 0.00 0 0 0 0 0 0 NA
moc_prov_21 0 0.00 0.00 0 0 0 0 0 0 0
ser_prov_47 0 0.00 0.00 0 0 0 0 0 0 0
ste_prov_30 0 0.00 0.31 0 0 0 0 0 0 0
cp_prov_26 0 0.00 0.00 0 0 0 0 0 0 0

# 6 References

Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin, T. (2021). “Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements”, Policy Research Working Paper; No. 9629. World Bank, Washington, DC.

Correia, S., P. Guimaraes and T. Zylkin (2020). “Fast Poisson estimation with high dimensional fixed effects”, STATA Journal, 20, 90-115.

Gaure, S (2013). “OLS with multiple high dimensional category variables”, Computational Statistics & Data Analysis, 66, 8-18.

Friedman, J., T. Hastie, and R. Tibshirani (2010). “Regularization paths for generalized linear models via coordinate descent”, Journal of Statistical Software, 33, 1-22.

Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). “Inference in high dimensional panel models with an application to gun control”, Journal of Business & Economic Statistics, 34, 590-605.