Prunability of Multi-Layer Perceptrons Trained with the Forward-Forward Algorithm
Abstract
1. Introduction
- Rigorous mathematical definition of three different neural network architectures compatible with the Forward-Forward algorithm;
- The first open-source definition and implementation of the FFRNN network;
- FFLib—an open-source library for testing, benchmarking and deploying the Forward-Forward algorithm;
- Sensitivity analysis of FF networks to training hyperparameters in terms of model sparsity and performance;
- Analysis of sparsity dynamics during training;
- Analysis of one-shot pruning robustness of FF networks compared to networks trained with the BP algorithm.
2. Methodology
2.1. Goodness Optimization and Negative Data
- be a flattened input image (e.g., MNIST image [17]);
- be a control vector (e.g., one-hot encoding of the class);
- denote the concatenation of two vectors a and b;
- denote uniform random sample of the set S;
- be the set of all possible one-hot encodings of length m.
- A batch of positive data where is a vector made by concatenating the i-th image with its correct label;
- A batch of negative data where is a vector made by concatenating the i-th image with an incorrect label chosen uniformly at random from all labels except the correct one.
2.2. The Forward-Forward Neural Network
2.3. The Forward-Forward Neural Network + Classifier Layer
2.4. The Forward-Forward Recurrent Neural Network
2.5. Sparsity Metrics
- -normalized negative entropy is based on information theory and measures how evenly the absolute magnitudes are distributed:
- -normalized negative entropy emphasizes the contribution of larger values:
- The Hoyer sparsity metric combines the and norms:
- The Gini index captures the inequality in the distribution of component values, based on their rank after they are sorted in ascending order:
2.6. Pruning Methodology
3. Results
3.1. Sensitivity Analysis of Sparsity
3.2. Structural and Functional Sparsity Analysis
3.3. Inverse Optimization
3.4. Prunability
4. Discussion
5. Conclusions
- Comparing the performance of the architectures with and without fine-tuning after the pruning procedure.
- Researching the possibility of using techniques, such as quantization, knowledge distillation or low-rank factorization, on FF networks to achieve even smaller model sizes.
- How does adding regularization, such as peer normalization loss, affect the sparsity, performance and pruning outcomes?
- How robust are CNNs trained with the FF algorithm under pruning?
- What is the true compressibility of FF networks when the methodology is extended with compression algorithms such as ZIP, LZMA or arithmetic coding?
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
BP | Backpropagation Algorithm |
BPNN | Backpropagation Neural Network |
CE | Cross-Entropy Loss |
CIFAR | Canadian Institute for Advanced Research |
CNN | Convolutional Neural Network |
CUDA | Compute Unified Device Architecture |
FF | Forward-Forward Algorithm |
FFNN | Forward-Forward Neural Network |
FFNN2 | Forward-Forward Neural Network with 2 Layers |
FFNN3 | Forward-Forward Neural Network with 3 Layers |
FFNN+C | Forward-Forward Neural Network + Classifier |
FFRNN | Forward-Forward Recurrent Neural Network |
GNN | Graph Neural Network |
HSV | Hue Saturation Value |
IoT | Internet of Things |
LZMA | Lempel-Ziv-Markov Chain Algorithm |
MLP | Multi-Layer Perceptron |
MNIST | Modified National Institute of Standards and Technology |
References
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Hochreiter, S. Untersuchungen zu Dynamischen Neuronalen Netzen. Master’s Thesis, Technische Universitat, Munchen, Germany, 1991. Volume 91. p. 31. [Google Scholar]
- Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
- Hinton, G. The Forward-Forward Algorithm: Some Preliminary Investigations. arXiv 2022, arXiv:2212.13345. [Google Scholar] [CrossRef]
- Carreira-Perpinan, M.A.; Hinton, G. On contrastive divergence learning. In International Workshop on Artificial Intelligence and Statistics; PMLR: Bridgetown, Barbados, 2005; pp. 33–40. [Google Scholar]
- Hinton, G.E.; Sejnowski, T.J. Learning and relearning in Boltzmann machines. Parallel Distrib. Process. Explor. Microstruct. Cogn. 1986, 1, 282–317. [Google Scholar]
- Tosato, N.; Basile, L.; Ballarin, E.; Alteriis, G.D.; Cazzaniga, A.; Ansuini, A. Emergent representations in networks trained with the Forward-Forward algorithm. Trans. Mach. Learn. Res. 2025. Available online: https://openreview.net/forum?id=JhYbGiFn3Y (accessed on 14 August 2025).
- Miller, J.e.K.; Ayzenshtat, I.; Carrillo-Reid, L.; Yuste, R. Visual stimuli recruit intrinsically generated cortical ensembles. Proc. Natl. Acad. Sci. USA 2014, 111, E4053–E4061. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y. A theory for the sparsity emerged in the Forward Forward algorithm. arXiv 2023, arXiv:2311.05667. [Google Scholar] [CrossRef]
- Gandhi, S.; Gala, R.; Kornberg, J.; Sridhar, A. Extending the Forward Forward Algorithm. arXiv 2023, arXiv:2307.04205. [Google Scholar] [CrossRef]
- Giampaolo, F.; Izzo, S.; Prezioso, E.; Piccialli, F. Investigating Random Variations of the Forward-Forward Algorithm for Training Neural Networks. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 18–23. [Google Scholar] [CrossRef]
- Ororbia, A.; Mali, A. The Predictive Forward-Forward Algorithm. arXiv 2023, arXiv:2301.01452. [Google Scholar] [CrossRef]
- Chen, X.; Liu, D.; Laydevant, J.; Grollier, J. Self-Contrastive Forward-Forward Algorithm. Nat. Commun. 2025, 16, 5978. [Google Scholar] [CrossRef] [PubMed]
- Ghader, M.; Reza Kheradpisheh, S.; Farahani, B.; Fazlali, M. Enabling Privacy-Preserving Edge AI: Federated Learning Enhanced with Forward-Forward Algorithm. In Proceedings of the 2024 IEEE International Conference on Omni-layer Intelligent Systems (COINS), London, UK, 29–31 July 2024; pp. 1–7. [Google Scholar] [CrossRef]
- Reyes-Angulo, A.A.; Paheding, S. Forward-Forward Algorithm for Hyperspectral Image Classification. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 16–22 June 2024; pp. 3153–3161. [Google Scholar] [CrossRef]
- Nikov, M. FFLib: Forward-Forward Neural Networks Library. 2025. Available online: https://github.com/mitkonikov/ff (accessed on 14 August 2025).
- Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
- Householder, A.S. A Theory of Steady-State Activity in Nerve-Fiber Networks: I. Definitions and Preliminary Lemmas. Bull. Math. Biophys. 1941, 3, 63–69. [Google Scholar] [CrossRef]
- Hinton, G. How to represent part-whole hierarchies in a neural network. arXiv 2021, arXiv:2102.12627. [Google Scholar] [CrossRef] [PubMed]
- Huang, Y.; Rao, R.P. Predictive coding. Wiley Interdiscip. Rev. Cogn. Sci. 2011, 2, 580–593. [Google Scholar] [CrossRef] [PubMed]
- Hurley, N.; Rickard, S. Comparing measures of sparsity. IEEE Trans. Inf. Theory 2009, 55, 4723–4741. [Google Scholar] [CrossRef]
- Kuzma, T.; Farkaš, I. Computational analysis of learned representations in deep neural network classifiers. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv 2019, arXiv:1803.03635. [Google Scholar] [CrossRef]
- Ansel, J.; Yang, E.; He, H.; Gimelshein, N.; Jain, A.; Voznesensky, M.; Bao, B.; Bell, P.; Berard, D.; Burovski, E.; et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24), San Diego, CA, USA, 27 April–1 May 2024. [Google Scholar] [CrossRef]
- Scodellaro, R.; Kulkarni, A.; Alves, F.; Schröter, M. Training Convolutional Neural Networks with the Forward-Forward algorithm. arXiv 2024, arXiv:2312.14924. [Google Scholar] [CrossRef]
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Nair, V.; Hinton, G. CIFAR-10 and CIFAR-100 (Canadian Institute for Advanced Research). 2009. Available online: http://www.cs.toronto.edu/~kriz/cifar.html (accessed on 14 August 2025).
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Ankner, Z.; Renda, A.; Dziugaite, G.K.; Frankle, J.; Jin, T. The Effect of Data Dimensionality on Neural Network Prunability. arXiv 2022, arXiv:2212.00291. [Google Scholar] [CrossRef]
Parameter | BPNN | FFNN | FFNN+C | FFRNN |
---|---|---|---|---|
Optimizer | Adam | Adam | Adam | Adam |
Activation | ReLU | ReLU | ReLU | ReLU |
Learning rate | 0.0001 | 0.02 | 0.02 | 0.02 |
Loss threshold | N/A | 20 | 20 | 20 |
Epochs (MNIST) | 60 | 60 | 30+30 | 60 |
Epochs (FashionMNIST) | 60 | 60 | 30+30 | 60 |
Epochs (CIFAR-10) | 120 | 120 | 60+60 | 120 |
Batch size | 128 | 128 | 128 | 128 |
N/A | N/A | N/A | 10 | |
N/A | N/A | N/A | 3 | |
N/A | N/A | N/A | 8 |
Network | MNIST | FashionMNIST | CIFAR-10 |
---|---|---|---|
BPNN | 98.31 ± 0.08 | 89.77 ± 0.27 | 56.05 ± 1.11 |
FFNN | 98.18 ± 0.08 | 87.84 ± 0.10 | 51.80 ± 0.52 |
FFNN+C | 97.95 ± 0.10 | 87.09 ± 0.32 | 50.81 ± 0.47 |
FFRNN | 98.28 ± 0.10 | 88.69 ± 0.19 | 43.83 ± 0.77 |
M | Network | Layer | MNIST | FashionMNIST | CIFAR-10 |
---|---|---|---|---|---|
HOYER | BPNN | 1 | 0.182 ± 0.001 | 0.202 ± 0.000 | 0.238 ± 0.001 |
2 | 0.221 ± 0.001 | 0.251 ± 0.001 | 0.278 ± 0.002 | ||
3 | 0.185 ± 0.002 | 0.263 ± 0.003 | 0.253 ± 0.003 | ||
FFNN | 1 | 0.494 ± 0.001 | 0.547 ± 0.002 | 0.697 ± 0.003 | |
2 | 0.442 ± 0.004 | 0.464 ± 0.003 | 0.559 ± 0.012 | ||
L1 | BPNN | 1 | 0.017 ± 0.000 | 0.019 ± 0.000 | 0.020 ± 0.000 |
2 | 0.019 ± 0.000 | 0.023 ± 0.000 | 0.025 ± 0.000 | ||
3 | 0.025 ± 0.000 | 0.037 ± 0.001 | 0.036 ± 0.001 | ||
FFNN | 1 | 0.046 ± 0.000 | 0.046 ± 0.000 | 0.053 ± 0.001 | |
2 | 0.055 ± 0.001 | 0.059 ± 0.001 | 0.080 ± 0.002 | ||
L2 | BPNN | 1 | 0.523 ± 0.000 | 0.526 ± 0.000 | 0.529 ± 0.000 |
2 | 0.527 ± 0.000 | 0.532 ± 0.000 | 0.537 ± 0.000 | ||
3 | 0.533 ± 0.000 | 0.551 ± 0.001 | 0.547 ± 0.001 | ||
FFNN | 1 | 0.600 ± 0.000 | 0.626 ± 0.000 | 0.682 ± 0.001 | |
2 | 0.555 ± 0.001 | 0.560 ± 0.001 | 0.577 ± 0.003 | ||
GINI | BPNN | 1 | 0.387 ± 0.001 | 0.404 ± 0.000 | 0.440 ± 0.001 |
2 | 0.425 ± 0.001 | 0.451 ± 0.001 | 0.473 ± 0.001 | ||
3 | 0.390 ± 0.002 | 0.468 ± 0.003 | 0.467 ± 0.004 | ||
FFNN | 1 | 0.578 ± 0.001 | 0.558 ± 0.002 | 0.603 ± 0.005 | |
2 | 0.677 ± 0.003 | 0.693 ± 0.004 | 0.775 ± 0.008 |
Network Type | Loss | Layer | Min | Max | Mean | Std |
---|---|---|---|---|---|---|
BPNN | CE | Layer 1 | −0.14 | 0.10 | −0.0005 | 0.0249 |
Layer 2 | −0.12 | 0.13 | 0.0026 | 0.0207 | ||
Layer 3 | −0.13 | 0.10 | −0.0059 | 0.0352 | ||
FFNN2 | Layer 1 | −176.08 | 103.74 | −0.97 | 8.06 | |
Layer 2 | −62.85 | 20.29 | −0.77 | 4.14 | ||
Layer 1 | −215.04 | 91.84 | −0.49 | 7.21 | ||
Layer 2 | −41.44 | 15.28 | −0.44 | 3.11 | ||
FFNN3 | Layer 1 | −176.18 | 106.26 | −0.97 | 8.06 | |
Layer 2 | −64.33 | 19.11 | −0.75 | 4.15 | ||
Layer 3 | −79.26 | 23.53 | −0.49 | 3.68 | ||
Layer 1 | −218.23 | 92.61 | −0.49 | 7.20 | ||
Layer 2 | −40.78 | 16.68 | −0.43 | 3.11 | ||
Layer 3 | −43.37 | 19.47 | −0.12 | 2.50 | ||
FFNN+C | Layer 1 | −122.65 | 77.63 | −0.61 | 5.10 | |
Layer 2 | −44.54 | 15.38 | −0.37 | 2.33 | ||
Layer 1 | −113.77 | 73.05 | −0.26 | 4.65 | ||
Layer 2 | −20.74 | 12.11 | −0.22 | 1.75 | ||
FFRNN | Layer 1 | −201.94 | 93.29 | −1.37 | 10.34 | |
Layer 2 | −149.80 | 69.57 | 1.98 | 10.01 | ||
Layer 1 | −169.16 | 73.14 | −1.31 | 8.18 | ||
Layer 2 | −109.88 | 61.73 | 2.08 | 7.84 |
Network | ± | ± | p-Value | ||
---|---|---|---|---|---|
FFNN2 | 98.15 ± 0.10 | 98.18 ± 0.08 | 98.33 | 98.36 | 4.92 × 10−1 |
FFNN+C | 97.96 ± 0.14 | 97.95 ± 0.10 | 98.17 | 98.13 | 8.60 × 10−1 |
FFRNN | 97.80 ± 0.12 | 98.28 ± 0.10 | 97.94 | 98.44 | 1.34 × 10−7 |
FFNN3 | 98.01 ± 0.09 | 98.22 ± 0.08 | 98.16 | 98.33 | 1.86 × 10−5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nikov, M.; Strnad, D.; Podgorelec, D. Prunability of Multi-Layer Perceptrons Trained with the Forward-Forward Algorithm. Mathematics 2025, 13, 2668. https://doi.org/10.3390/math13162668
Nikov M, Strnad D, Podgorelec D. Prunability of Multi-Layer Perceptrons Trained with the Forward-Forward Algorithm. Mathematics. 2025; 13(16):2668. https://doi.org/10.3390/math13162668
Chicago/Turabian StyleNikov, Mitko, Damjan Strnad, and David Podgorelec. 2025. "Prunability of Multi-Layer Perceptrons Trained with the Forward-Forward Algorithm" Mathematics 13, no. 16: 2668. https://doi.org/10.3390/math13162668
APA StyleNikov, M., Strnad, D., & Podgorelec, D. (2025). Prunability of Multi-Layer Perceptrons Trained with the Forward-Forward Algorithm. Mathematics, 13(16), 2668. https://doi.org/10.3390/math13162668