/*MACRO THAT SIMULATES DATA peekING;
Syntax: %peeking(simtot,n0,n1,every,d,seed,mk);
simtot: number of simulations (typically 100000);
n0: the sample size when peeking begins, e.g., simulating collecting n=20 per cell and adding 5 at a time would lead to n0=20
n1: the sample size when peeking ends if no p<.05 so far (e.g., n1=40 if researcher drops the study if after 40 subjects still not significant).
every: how frequently data is monitored, e.g., simulating collecting n=20 per cell and adding 5 at a time would lead to every=5
d: effect size in cohen-d, d=(m1-m0)/sigma
seed: where random draw begins, set so that every time same macro is run the same result is obtained, lacks intrinsic meaning as a #
mk: is a label with which the final file with p-values is identified to later reference it, if mk=1 the resulting file will be called p1.
* Approach to the computation:
Generate dataset with
rows are simulated studies
columns values of the d.v. for each of N observations
e.g., if user sets n1=40, then there are 40 randomly drawn N(0,1) followed by 40 N(d,1)
Compute the sample means and SDs for the first n0 observations
Then compute it for n0+every
Then compute it for n0+2&every and so on
This will result in [(n1-n0)/every + 1] averages and SDs, for example if you set n0=20, n1=40, every=1, then there will be 21 means of sample 1, and 21 means of sample 2 computed (with n=20, n=21,...n=41)
Compute the t-test and p-value for each of those means and SD comparisons, so there will also be [(n1-n0)/every + 1] p-values per row
Take the first p-value in that set, of a given row, that is p<.05, if any.
/*/;
%macro peeking(simtot,n0,n1,every,d,seed,mk);
*timestamp;
%let a=%sysfunc(time(),time8.) ;
*(1) Generate empty file with simtot rows;
data p&mk;
do i=1 to &simtot;
output;
end;
run;
*(2) n1 random variables for each of two cells;
data p&mk;
set p&mk;
array y1(&n1) y1_1 - y1_&n1;
array y2(&n1) y2_1 - y2_&n1;
do k=1 to &n1;
%let seed1=&seed*110+k;
%let seed2=&seed*120+k;
y1(k)=normal(&seed1);
y2(k)=normal(&seed2) + &d;
end;
run;
*(3) Compute means and sd after between n0 and n1 after every peek;
data p&mk;
set p&mk;
%let ktot=%eval( 1+ (&n1-&n0)/&every);
*Create arrays for each peek;
*means;
array av1(&ktot) av1_1-av1_&ktot;
array av2(&ktot) av2_1-av2_&ktot;
*sds;
array sd1(&ktot) sd1_1-sd1_&ktot;
array sd2(&ktot) sd2_1-sd2_&ktot;
*Compute the stats and p-values;
%do k=1 %to &ktot;
%let n1_temp=%eval(&n0+(&k-1)*&every );
*means;
av1_&k=mean(of y1_1-y1_&n1_temp);
av2_&k=mean(of y2_1-y2_&n1_temp);
*sds;
sd1_&k=std (of y1_1-y1_&n1_temp);
sd2_&k=std(of y2_1-y2_&n1_temp);
*se, t and p-value;
se_&k=(sd1_&k.**2/&n1_temp +sd2_&k.**2/&n1_temp)**.5;
t_&k=(av1_&k-av2_&k)/se_&k;
p_&k=2*(1-cdf("t",abs(t_&k),2*(&n1_temp)-2));
*if direciton of test is as predicted, y2>y1, p-value is <.05 and none before has, keep p-value and associated d.f. also;
if av2_&k>av1_&k and p_&k<.05 and p=. then do;
p=p_&k;
df=2*&n1_temp-2;
t=t_&k;
end;
%end;
run;
data sig&mk;
set p&mk;
if p ne .;
keep p df t;
run;
*KEEPING TRACK OF COMPUTATION TIME;
%put STARTED : &a;
%put ENDED: %sysfunc(time(),time8.) ;
%mend;