Skip to contents

Tutorial package calidad

The package aims to implement in a simple way the methodologies of INE Chile and ECLAC for the quality assessment of estimates from household surveys.

This tutorial shows the basic use of the package and includes the main functions to create the necessary inputs to implement both quality standards.

Data edition

We will use two datasets:

  • Encuesta Nacional de Empleo (efm 2020)
  • VIII Encuesta de Presupuestos Familiares

Both datasets are loaded into the package and they can be used when the package is loaded in the session 1. The data edition in the case of ENE has the purpuse of creating some subpopulations (work force, unemployed and unemployed).

library(survey)
library(calidad)
library(dplyr)
ene <- ene %>% 
  mutate(fdt = if_else(cae_especifico >= 1 & cae_especifico <= 9, 1, 0), # labour force
         ocupado = if_else(cae_especifico >= 1 & cae_especifico <= 7, 1, 0), # employed
         desocupado = if_else(cae_especifico >= 8 & cae_especifico <= 9, 1, 0)) # unemployed

# One row per household
epf <- epf_personas %>% 
  group_by(folio) %>% 
  slice(1) %>% 
  ungroup()

Sample design

Before starting to use the package, it is necessary to declare the sample design of the survey, for which we use the survey package. The primary sample unit, the stratum and weights must be declared. It is also possible to use a design with only weights, nevertheless in that case the variance will be estimated under simple random sampling assumption. In this case we will declare a complex design for the two surveys (EPF and ENE). Additionally, it may be useful to declare an option for strata that only have one PSU.

# Store original options
old_options <-  options()
# Complex sample design for ENE
dc_ene <- svydesign(ids = ~conglomerado , strata = ~estrato_unico, data = ene, weights = ~fact_cal)
# Complex sample design for EPF
dc_epf <- svydesign(ids = ~varunit, strata = ~varstrat, data = epf, weights = ~fe)
options(survey.lonely.psu = "certainty")

Inputs creation

National Labour Survey (part 1)

To assess the quality of an estimate, the INE methodology establishes differentiated criteria for estimates of proportion (or ratio), on the one hand, and estimates of mean, size and total, on the other. In the case of proportion estimation, it is necessary to have the sample size, the degrees of freedom and the standard error. The other estimates require the sample size, the degrees of freedom, and the coefficient of variation.

The package includes separate functions to create the inputs for estimates of mean, proportion, totals and size. The following example shows how the proportion and size functions are used.

insumos_prop <- create_prop(var = "desocupado", domains = "sexo", subpop = "fdt", design =  dc_ene) # proportion of unemployed people
insumos_total <-  create_size(var = "desocupado", domains = "sexo", subpop = "fdt", design =  dc_ene) # number of unemployed people
  • var: variable to be estimated. Must be a dummy variable
  • domains: required domains.
  • subpop: reference subpopulation. It is optional and works as a filter (must be a dummy variable)
  • design: sample design

The function returns all the neccesary inputs to implement the standard

To get more domains, we can use the “+” symbol as follows:

desagregar <- create_prop(var = "desocupado", domains = "sexo+region", subpop = "fdt", design =  dc_ene)

A useful parameter is eclac_input. It allows to return the ECLAC inputs. By default this parameter is FALSE and with the option TRUE we can activate it.


eclac_inputs <-  create_prop(var = "desocupado", domains = "sexo+region", subpop = "fdt", design =  dc_ene, eclac_input = TRUE)

Household Budget Survey (part 2)

In some cases it may be of interest to assess the quality of a sum. For example, the sum of all the income of the EPF at the geographical area level (Gran Santiago and other regional capitals). For this, there is the create_total function. This function receives a continuous variable such as hours, expense, or income and generates totals at the requested level. The ending “with” of the function alludes to the fact that a continuous variable is being used.

insumos_suma <-  create_total(var = "gastot_hd", domains = "zona", design =  dc_epf)

If we want to assess the estimate of a mean, we have the function create_mean. In this case, we will calculate the average expenditure of households, according to geographical area.

insumos_media <-  create_mean(var = "gastot_hd", domains = "zona", design =  dc_epf)

The default usage is not to disaggregate, in which case the functions should be used as follows:

# ENE dataset
insumos_prop_nacional <- create_prop("desocupado", subpop = "fdt", design = dc_ene)
insumos_total_nacional <-  create_total("desocupado", subpop = "fdt", design = dc_ene)
# EPF dataset
insumos_suma_nacional <- create_total("gastot_hd", design = dc_epf)
insumos_media_nacional <-  create_mean("gastot_hd", design = dc_epf)

Assessment

Once the inputs have been generated, we can do the assessment To do this, we use the assess function.

evaluacion_prop <- assess(insumos_prop)
evaluacion_tot <- assess(insumos_total)
evaluacion_suma <- assess(insumos_suma)
evaluacion_media <- assess(insumos_media)

The output is a dataframe that, in addition to containing the information already generated, includes a column that indicates whether the estimate is unreliable, less reliable or reliable.

The function assess has a parameter that allows us to know if the table should be published or not. Following the criteria of the standard, if more than 50% of the estimates of a table are not reliable, it should not be published.

# Unemployment by region
desagregar <- create_size(var = "desocupado", domains = "region", subpop = "fdt", design =  dc_ene)

# assess output
evaluacion_tot_desagreg <- assess(desagregar, publish = T)
evaluacion_tot_desagreg
#>    region       stat         se  df   n         cv                   eval_n
#> 1       1  13830.218  2105.8402  40  79 0.15226370   sufficient sample size
#> 2       2  33899.555  3457.3783  72 153 0.10198890   sufficient sample size
#> 3       3  13181.473  1288.0572  67 125 0.09771725   sufficient sample size
#> 4       4  38572.642  3502.4280  97 195 0.09080083   sufficient sample size
#> 5       5  87670.219  5253.3214 229 473 0.05992139   sufficient sample size
#> 6       6  41307.341  4015.1321  94 193 0.09720142   sufficient sample size
#> 7       7  35142.800  3171.1718  85 189 0.09023674   sufficient sample size
#> 8       8  65956.207  4167.7798 204 401 0.06319011   sufficient sample size
#> 9       9  34223.136  3409.5040  81 149 0.09962570   sufficient sample size
#> 10     10  20843.453  2212.3500  68 127 0.10614124   sufficient sample size
#> 11     11   3128.436   496.9203  37  59 0.15883985 insufficient sample size
#> 12     12   4560.165   806.3695  28  44 0.17682901 insufficient sample size
#> 13     13 369743.430 19615.0415 318 681 0.05305041   sufficient sample size
#> 14     14  12991.098  1533.0491  61 112 0.11800766   sufficient sample size
#> 15     15   7869.593  1081.7453  52  85 0.13745886   sufficient sample size
#> 16     16  18885.030  2039.7704  66 128 0.10800991   sufficient sample size
#>          eval_df                 eval_cv           label publication
#> 1  sufficient df cv between 0.15 and 0.3 weakly reliable     publish
#> 2  sufficient df              cv <= 0.15        reliable     publish
#> 3  sufficient df              cv <= 0.15        reliable     publish
#> 4  sufficient df              cv <= 0.15        reliable     publish
#> 5  sufficient df              cv <= 0.15        reliable     publish
#> 6  sufficient df              cv <= 0.15        reliable     publish
#> 7  sufficient df              cv <= 0.15        reliable     publish
#> 8  sufficient df              cv <= 0.15        reliable     publish
#> 9  sufficient df              cv <= 0.15        reliable     publish
#> 10 sufficient df              cv <= 0.15        reliable     publish
#> 11 sufficient df cv between 0.15 and 0.3    non-reliable     publish
#> 12 sufficient df cv between 0.15 and 0.3    non-reliable     publish
#> 13 sufficient df              cv <= 0.15        reliable     publish
#> 14 sufficient df              cv <= 0.15        reliable     publish
#> 15 sufficient df              cv <= 0.15        reliable     publish
#> 16 sufficient df              cv <= 0.15        reliable     publish
#>                         pass
#> 1  81.25% reliable estimates
#> 2  81.25% reliable estimates
#> 3  81.25% reliable estimates
#> 4  81.25% reliable estimates
#> 5  81.25% reliable estimates
#> 6  81.25% reliable estimates
#> 7  81.25% reliable estimates
#> 8  81.25% reliable estimates
#> 9  81.25% reliable estimates
#> 10 81.25% reliable estimates
#> 11 81.25% reliable estimates
#> 12 81.25% reliable estimates
#> 13 81.25% reliable estimates
#> 14 81.25% reliable estimates
#> 15 81.25% reliable estimates
#> 16 81.25% reliable estimates
# Reset original options
options(old_options)