statCanR

library(statcanR)

Overview

statcanR provides the R user with a consistent process to identify and collect data from Statistics Canada’s data portal. It allows users to search for and access all of Statistics Canada’ open economic data (formerly known as CANSIM tables) now identified by product IDs (PID) by the new Statistics Canada’s Web Data Service: https://www150.statcan.gc.ca/n1/en/type/data.

Quick start

First, install statcanR:

devtools::install_github("warint/statcanR")

Next, call statcanR to make sure everything is installed correctly.

library(statcanR)

Practical usage

This section presents an example of how to use the statcanR R package and its functions: statcan_search() and statcan_data().

The following example is provided to illustrate how to use these functions. It consists of collecting some descriptive statistics about the Canadian labour force at the federal, provincial and industrial levels, on a monthly basis.

To identify a relevant table, the statcan_search() function can be used by using a keyword or set of keywords, depending on the language the user wishes to search in (between English and French). Below is an example that reveals the data tables we could be interested in:

# Identify with statcan_search() function
statcan_search(c("federal","expenditures"),"eng")

Notice that for each corresponding table, the unique table number identifier is also presented. Let’s focus on the first table out of the two that appear, which contains data on Federal expenditures on science and technology, by socio-economic objectives. Once this table number is identified (‘27-10-0014-01’), the statcan_data() function is easy to use in order to collect the data, as following:

# Get data with statcan_data function
mydata <- statcan_download_data("27-10-0014-01", "eng")

Should you want the same data in French, just replace the argument at the end of the function with “fra”.

Code architecture

This section describes the code architecture of the statcanR package.

The statcan_search() function has 2 arguments:

keywords
lang

The ‘keywords’ argument is useful for identifying what kind of open economic data Statistics Canada has available for users. It can take as inputs either a single word (i.e., statcan_search(“expenditures”)) or a vector of multiple keywords (such as: statcan_search(c(“expenditures”,“federal”,“objectives”))).

The second argument, ‘lang’, refers to the language. As Canada is a bilingual country, Statistics Canada displays all the economic data in both languages. Therefore, users can see what data tables are available in French or English by setting the lang argument with c(“fra”, “eng”).

For all data tables that contain these words in either their title or description, the resulting output will include four pieces of information: The name of the data table, its unique table number identifier, the description, and the release date. In details, the statcan_download_data() function has 2 arguments:

tableNumber
lang

The ‘tableNumber’ argument simply refers to the table number of the Statistics Canada data table a user wants to collect, such as ‘27-10-0014-01’ for the Federal expenditures on science and technology, by socio-economic objectives, as an example.

Just as in the statcan_search() command, ‘lang’, refers to the language, and can be set to “eng” for English or “fra” for French. Therefore, users can choose to collect satistics data tables in French or English by setting the lang argument with c(“fra”, “eng”).

The code architecture of the statcan_download_data() function is as follows. The first step if to clean the table number in order to align it with the official typology of Statistics Canada’s Web Data Service. The second step is to create a temporary folder where all the data wrangling operations are performed. The third step is to check and select the correct language. The fourth step is to define the right URL where the actual data table is stored and then to download the .Zip container. The fifth step is to unzip the previously downloaded .Zip file to extract the data and metadata from the .csv files. The final step is to load the statistics data into a data frame called ‘data’ and to add the official table indicator name in the new ‘INDICATOR’ column.

To be more precise about the functions, below is some further code description:

statcan_search()

keyword_regex <- paste0("(", paste(keywords, collapse = "|"), ")", collapse = ".*"): The first step is to paste (either individually or in vector format) the keywords the user is interested in
matches <- apply(statcan_data, 1, function(row) { all(sapply(keywords, function(x) { grepl(x, paste(as.character(row), collapse = " ")) })) }): Next, the matching relies on the sapply function, which is applied to any character string in a row of a given observation. Therefore, the “keyword” could be found in the title column or the description column (technically, it could even be in the “release date” or “id” columns).
filtered_data <- statcan_data[matches, ] The third step takes only the data where matches were found.
datatable(filtered_data, options = list(pageLength = 10)) Finally, the datatable command presents the filtered data in a clean table.

statcan_download_data()