SAS Programming Tips: Combining variables from different datasets without joining the data tables
It's been a standard practice to use look-up tables when we need to bring some variable from a different table to an analysis SAS table which we work with. It is quite simple, especially if we have a common key variable on both tables. We just need to merge/join these two tables by the key, of course making sure that both tables are sorted by that key (well, if proc sql is used then sorting is not explicitly written in the code, but it is still done behind the scene.)
Suppose a dataset A has variables ID_VAR and VAR1; dataset B has (among others) variables ID_VAR and VAR2. We need to bring variable VAR2 to the data table A. Here is the standard SAS code:
proc sort data=A; by ID_VAR; run; proc sort data=B; by ID_VAR; run; data A; merge A B (keep=ID_VAR VAR2); by ID_VAR; run;
Simple, but tedious; and it requires 2 procs sort and 1 merge.
Better SAS Solution
All these sorts and merges can be avoided by creating a dynamic format out of the table B values and then using it to re-create variable VAR2 on the data set A as in the following SAS code example:
data C (keep=fmtname start label); set B (keep = keep=ID_VAR VAR2); retain fmtname 'fmtvar'; start = ID_VAR; label = VAR2; run; proc format cntlin = C; run; data A; set A; VAR2 = put(ID_VAR,fmtvar.); run;
The first two steps of creating the dynamic format can be packaged into a SAS macro as follows:
%macro make_format(dsname=,fmtname=,keyvar=,value=); data _data_ (keep=fmtname start label); set &dsname (keep = &start &label); retain fmtname "&fmtname"; start = &keyvar; label = &value; run; proc format cntlin = _last_; run; %mend make_format;
In this macro, _data_ means that the created data set is named automatically according to the DATAn convention: the first such data set is named data1, the second is named data2, and so on depending on availability. That is a numeric suffix of the data name created is assigned as a minimum number available; it will not overwrite an existing name.
_last_ means the last created data table. It is an automatic data table name that was created the latest. In this case, it will be whatever name was created in the data _data_ step.
Using the _data_ and _last_ in the macro ensures that the macro can be invoked from any SAS code without the risk of inadvertently overwriting a SAS data table used by the calling program.
Then adding a new variable VAR2 from data set B to data set A will result in the following simple SAS code:
%make_format(dsname=B, fmtname=fmtvar, keyvar=ID_VAR, value=VAR2); data A; set A; var2 = put(ID_VAR,fmtvar.); run;
Assumption here is that ID_VAR is a numeric variable. If it is of a character type than the format name should start with $, such as $fmtvar.
Another assumption is that VAR2 is a character variable. If it is numeric then the last data step could be written as:
data A; set A; var2 = input(put(ID_VAR,fmtvar.),best.); run;
No code changes are required in the macro itself regardless of the data types.
In this post, we use "dataset", "data set", "table" and "data table" interchangeably as synonyms.