*3.1. Datasets for Mentor and Student*

#### 3.1.1. Dataset for Mentor Training

CIFAR-10 [45] is an established dataset used for object recognition. It consists of 60,000 (32 × 32) RGB images from 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images in the official data. The mentor is a pretrained model on the CIFAR-10 dataset. We use the test set (from this dataset) to measure the success rate of our mentor and student models. Note that the training set of the CIFAR-10 dataset is never used in the training process by the student (to conform to our assumption that the student has no access to the data used by the mentor for training), and the test subset, as mentioned above, is used for validation only (without training).

#### 3.1.2. Dataset for Mimicking Process

ImageNet [46] is a dataset containing complex, real-world size images. In particular, ImageNet\_ILSVRC2012 contains more than 1.2 million (256 × 256) RGB images from 1000 categories. We use this dataset (without the labels, i.e., an unlabeled dataset) for the mimicking process. Each image is down-sampled (32 × 32) and fed into the mentor model, and the prediction of the mentor model is recorded (for later mimicking by the student). Note that any large unlabeled image dataset could be used instead, and we used this common large dataset for convenience only.

#### *3.2. Composite Data Generation*

Our goal is to create a diverse dataset that will allow observing the predictions of the mentor on many possible inputs. By doing so, we would gain insights into the way the mentor behaves for different samples. That is, the more adequate the input space sample is, the better the performance of the mimicking process becomes. The entire available unlabeled data, which is the down-sampled ImageNet, is contained in an array *dataArr*. For each training example to be generated, we randomly choose two indexes *i*1, *i*2, such that 0 <= *i*1, *i*<sup>2</sup> < *N*, where is *N* equal to the number of samples we create and use for training the student model. In our composite method, we choose *N* = 1,000,000, so the amount of generated training samples created in each epoch is 1, 000, 000. Next, we randomly choose a ratio *p*. Once we have *i*1, *i*2, and *p*, we generate a composite sample, which is created by combining two existing images in the dataset. The ratio *p* determines the relative influence of the two random images on the generated sample:

$$x\_{-
\text{gen}} = p\* 
\cdot 
\operatorname{dataArr}[i\_1] + (1-p)\* 
\cdot 
\operatorname{dataArr}[i\_2]$$

where the label of *x*\_*gen* is a "one-hot" vector; i.e., the index containing the '1' (corresponding to the maximal softmax value) represents the label predicted by the mentor. The dataset is generated for every epoch; hence, our composite dataset changes continuously, and it is dynamic. We gain the predictions of a mentor model on new images during the entire training process (with less overfitting). Note that even though in our data-generating mechanism, we create a composite of two random images (with a random mixture between them), it is possible to create composite images of *N* images where *N* > 2 as well.

Algorithm 1 provides the complete composite data-generation method, which is run at the beginning of each epoch. Figure 1 is an illustration of composite data samples created by Algorithm 1.

#### *3.3. Student Model Architecture*

The mentor neural network (which we intend to mimic) is an already trained model that reaches 90.48% test accuracy on the CIFAR-10 test set. Our goal in choosing an architecture for the student is to be generic, such that it would perform well, regardless of the mentor we try to mimic. Thus, with small adaptations to the input and output size, we created a modification of the VGG-16 architecture [47] for the student model. In our model, we use two dense layers of size 512 each and another dense layer of size 10 for the softmax output (while in the original VGG-16 architecture, there are two dense layers of size 4096 and another dense layer of size 1000 for the softmax layer). Table 1 presents the architecture of our student model.

#### **Algorithm 1** Composite Data Generation.


(**a**) 75% cat 25% dog (**b**) 70% horse 30% kangaroo (**c**) 30% horse 70% ship

(**d**) 40% ship 60% parrot (**e**) 50% tiger 50% dog (**f**) 20% car 80% elephant

**Figure 1.** Illustration of images created using our composite data-generation method. The images and their relative mixture are random. Using this method during each epoch we create an entirely new dataset, with random data not seen before by the model.
