library(statcanR)

Overview

statcanR provides the R user with a consistent process to collect data from Statistics Canada’s data portal. It provides access to all Statistics Canada’ open economic data (formerly known as CANSIM tables) now identified by product IDs (PID) by the new Statistics Canada’s Web Data Service: https://www150.statcan.gc.ca/n1/en/type/data.

Quick start

First, install statcanR:

devtools::install_github("warint/statcanR")

Next, call statcanR to make sure everything is installed correctly.

library(statcanR)

Practical usage

This section presents an example of how to use the statcanR R package and its function sqs_statcan_data().

The following example is provided to illustrate how to use the function. It consists in collecting some descriptive statistics about the Canadian labour force at the federal, provincial and industrial levels, on a monthly basis.

With a simple web search ‘statistics canada wages by industry metropolitan area monthly’, the table number can easily be found on Statisitcs Canada’s webpage. Here is below a figure that illustrates this example, such as ‘27-10-0014-01’ for the Federal expenditures on science and technology, by socio-economic objectives.

Once the table number is identified, the statcan_data() function is easy to use in order to collect the data, as following:

Should you want the same data in French, just replace the argument at the end of the function with “fra”.

Code architecture

This section describes the code architecture of the statcanR package.

In details, the statcan_data() function has 2 arguments:

  1. tableNumber
  2. lang

The ‘tableNumber’ argument simply refers to the table number of the Statistics Canada data table a user wants to collect, such as ‘27-10-0014-01’ for the Federal expenditures on science and technology, by socio-economic objectives, as an example.

The second argument, ‘lang’, refers to the language. As Canada is a bilingual country, Statistics Canada displays all the economic data in both languages. Therefore, users can choose to collect satistics data tables in French or English by setting the lang argument with c(“fra”, “eng”).

The code architecture of the statcan_data() function is as follows. The first step if to clean the table number in order to align it with the official typology of Statistics Canada’s Web Data Service. The second step is to create a temporary folder where all the data wrangling operations are performed. The third step is to check and select the correct language. The fourth step is to define the right URL where the actual data table is stored and then to download the .Zip container. The fifth step is to unzip the previously downloaded .Zip file to extract the data and metadata from the .csv files. The final step is to load the statistics data into a data frame called ‘sqs_data’ and to add the official table indicator name in the new ‘INDICATOR’ column.

To be more precise about the statcan_data() function, here is below some further code description:

  • tableNumber <- gsub("-", "", substr(tableNumber, 1, nchar(tableNumber)-2)): The first step is to clean the table number provided by the user in order to collect the overall data table related to the specific indicator the user is interested in. In fact, each indicator is an excerpt of the overall table. In addition, following the new Statistics Canada’s Web Data Service, the URL typology defined by REST API is stored in csv files by table numbers without a ‘-’. Also, the last 2 digits after the last ‘-’ refer to the specific excerpt of the original table. Therefore, following the Statistics Canada Web Data Service’s typology, the function first removes the ‘-’ and the last 2 digits from the user’s selection.

  • if(lang == "eng") | if(lang == "fra"): The second step is the ‘if statement’ to get the data in the correct language.

  • urlFra <- paste0("https://www150.statcan.gc.ca/n1/fr/tbl/csv/", tableNumber, "-fra.zip"): The third step is to create the correct url in order to download the respective .Zip file from the Statistics Canada Web Data Service.

  • download(urlFra, download_dir, mode = "wb"): The fourth step is a simple downloading function that extracts .Zip file and download it into a temporary folder.

  • unzip(zipfile = download_dir, exdir = unzip_dir): The fifth step consists in unzipping the .Zip file. The unzipping process gives access to two diffrent .csv files, such as the overall data table and the metadata table.

  • data.table::fread(csv_file): The sixth step consists in loading the data table into a unique data frame. The fread() function from the data.table package is used for its higher performance.

  • data$INDICATOR <- as.character(0) and data$INDICATOR <- as.character(read.csv(paste0(path,"/temp/", tableNumber, "_MetaData.csv"))[1,1]): The seventh step of the sqs_statcan_data() function consists in adding the name of the table from the metadata table.

  • unlink(tempdir()): The eighth step deletes the temporary folder used to download, unzip, extract and load the data.

  • return(sqs_data): Finally, the last step of the statcan_data() function allows to return the value into the user’s environment.