1. Introduction
In the contemporary data-driven world, any organization can have access to data, but how well they make use of it to generate insights will dictate their market position. To take a competitive edge over rivals, it is not just enough to have efficient transactional systems; it has become essential to analyze the historical data promptly and propose necessary actions for the business to take. Previous research has highlighted the significance of data-driven decisions and their potential to realize values over decisions based solely on opinions [
1]. Other research has pointed out that understanding the data is key, and data science is the discipline that helps expand knowledge about available data and generate insights from these data. Furthermore, research explained that data science helps to attain data-driven decisions [
2].
Data science is an umbrella term for various in-depth studies that can be classified into three major areas: data analytics, data mining, and machine learning (ML) [
3]. The data science paradigm encompasses discrete roles and responsibilities, such as data engineers, data analysts, and data scientists, as well as external dependencies such as product owners, project sponsors, business analysts, IT managers, c-level executives, etc. A person with a specific persona does not need to be an expert in other trades or have a cross-functional skillset. However, it would be advantageous if a person with an understanding of business requirements and objectives had access to a platform that allowed them to perform the basic operations of data-science-related roles without prior coding, analytics, or data engineering experience.
On the other hand, Alsharef et al. emphasized that developing an ML model necessitates domain expertise and advanced ML programming skills [
4]. They highlighted the difficulties in finding trained ML experts in the market; hence, automatic ML is seen as an asset that bridges the gap between data-science use cases and a lack of appropriate ML resources [
4]. This is where the no-code/low-code ML platform comes into play. Before getting there, we first need to understand the evolution of ML and how we landed on advanced ML platforms with low or no code. Both low-code and no-code approaches aid in the rapid development of ML models, the automation of data pipelines, and the visualization of the findings. However, they differ greatly in terms of the type of audience willing to use this service. Developers can leverage existing building blocks and libraries while still having the flexibility to customize the task as required with the low-code approach. Conversely, no-code is primarily intended for domain experts with minimal to no prior software development knowledge [
5]. With the no-code approach, users can use drag-and-drop functionality to execute the desired task, with minimal to no flexibility to customize. We can categorize cloud-native and cloud-agnostic ML platforms as low-code platforms since they allow us to build custom ML models by writing code in ML platform-native notebooks.
When it comes to developing ML models using the AutoML service, specifically, we must categorize it as a no-code since we anticipate the ML platform to conduct all the tasks in the ML lifecycle automatically with very few inputs from its users initially. Low-code ML platforms can be used by different personas, including data scientists and ML developers. Added to this, no-code AutoML services can also be used by persons with strong business or data domain knowledge, such as data engineers, data analysts, business analysts, or product owners. We can even form a cross-functional team comprised of all the aforementioned personas to create ML models using AutoML services; this, in the long run, would yield multiple benefits in terms of saving time and money. Research has supported the importance of emerging low-code cloud data platforms and their vital role in the speed of digitalization [
6].
In this research, we examine similarities, differences, advantages, and limitations in leveraging some of the cloud-based low/no-code ML platforms. The Gartner Magic Quadrant published in 2020 for cloud-native AI developer service providers listed Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) as the top three leaders [
7]. Hence, we chose these cloud-based platforms and their cloud-native ML services for further research. When it came to cloud-agnostic ML platforms, we chose Databricks for future investigation, as it is an enterprise-scale open-source unified data engineering, ML, and AI platform that is already integrated with all three cloud-native ML platforms listed [
8]. These ML platforms can handle the entire ML lifecycle. In the following sections, we highlighted our findings by developing ML models without writing a single line of code using ML services available on the above-mentioned cloud platforms.
2. Problematization
Previous research has demonstrated the dire need to have a methodology for automatically selecting the optimum ML model and tuning the hyperparameters to improve ML model performance [
9]. Building an ML model is quite a laborious task. ML developers must try out multiple algorithms and tweak hyperparameters constantly to derive the best ML model for solving a given business problem. This requires not only a thorough understanding of developing ML models but also time-consuming and intensive computing for data processing. Luo also emphasized the importance of the skillsets required for building state-of-the-art ML models manually, i.e., by a human attendant. Even with higher competencies, we cannot reduce the time spent refining the model and its hyperparameters to derive the best results [
9]. While trying out different experiments to find the right model requires computation and substantial time, we may end up ramping up the computational resources as needed. The table below lists some of the ML algorithms and their corresponding hyperparameters.
From
Table 1, we can understand that there are multiple hyperparameters connected to each ML algorithm. For decision trees, the max_depth parameter defines how far the leaf nodes can be split; when the maximum value set for this parameter is reached, it will stop splitting the node any further; the min_impurity_split parameter defines the minimum impurity level that can be at max up to the value set for this parameter; the min_samples_leaf parameter defines the minimum number of samples for each leaf node to be formed; and the max_leaf_node parameter defines the maximum number of leaf nodes we can have [
9].
For random forest, n_estimators define the number of decision trees to be generated, and max_features define the maximum number of features to be selected for each split.
For the support vector machine, the kernel defines how the input data will be represented, the penalty value is a regularization constant, and the tol parameter defines the stopping criteria for the model when no significant improvements are noticed on two consecutive iterations of training the model [
3].
For k-nearest neighbor, n_neighbors define how many neighbors should be related, and the metric parameter defines the distance metric, for example, Euclidean distance.
For Naïve Bayes, the kernel density estimator defines the kind of data distribution to be considered, and window width is used for smoothing the kernel window size.
For stochastic gradient boosting, learning_rate defines how fast the ML model should learn and understand the pattern of the given data distribution; n_estimators define the number of trees or steps; subsample defines the subset of data to be considered; and max_depth defines the maximum depth of each tree [
9].
For a neural network, we must find the ideal number of hidden layers, how many nodes should be present in each hidden layer, what would be the activation function, the number of epochs for trying out the maximum number of training iterations, and finally the learning rate [
9].
Previous research on tuning hyperparameters for deep learning models by implying different optimization techniques symbolizes the complexity and expertise needed in transferring the previous learning to every new iteration of testing ML model performance [
10,
11]. Another research highlighted that tuning hyperparameters is time-consuming [
12]. They also support the notion that finding the optimal hyperparameter value for an ML model requires multiple iterations of testing. Although not covering all parameters,
Table 1 shows how complex it would be to select the best algorithm to address a business case by trying out different ML and ensemble models. This requires in-depth knowledge to address questions such as: Which type of ML algorithm to use? How to configure the hyperparameters? How to evaluate the model? How to select the best model? How to deploy the model to a different endpoint? Such a list is not comprehensive; however, the list can go on and on based on the type of business case we are trying to achieve. Another important aspect is how fast these questions can be answered, because time is an important factor when considering market competitiveness. Additionally, to train and test multiple models, there is a need for scaling the computational resources. This is where cloud-based ML platforms come into the picture, which can address all the questions easily and in less time. Another advantage of a cloud-based ML platform is that we are not required to be masters of all trades; we could just have basic knowledge about data and business use cases and still be able to develop a classic ML model using the automatic ML features offered by different cloud vendors. Lastly, when the model is being trained, resources are scaled automatically in real time based on the requirements. As highlighted by Bahri et al., the automatic ML service helps in choosing the best ML model and tuning hyperparameters through multiple iterations of testing and different combinations of values [
12].
5. Cloud-Based ML Platforms
Regarding cloud-based ML platforms, the three cloud vendors (AWS, GCP, and MS Azure) provide different services to address the use case of building an end-to-end ML model lifecycle without writing a single line of code and providing the least number of inputs. This helps business stakeholders who do not possess prior programming knowledge to be able to develop ML models easily.
In
Figure 1, we have depicted the ML architecture based on the Azure ecosystem. Azure Data Lake Generation 2 can act as a data warehouse for storing structured, semi-structured, and unstructured data. We can even store all types of data in Azure blob storage. However, if we intend to use Azure Synapse, then it is a prerequisite to use Azure Data Lake instead. Regarding data transformation requirements, Azure offers the Synapse service, which acts as a lakehouse. This means it has the capabilities to store the data in conjunction with Azure Data Lake and is yet able to query using transact-standard query language (T-SQL) upon the metadata of data stored. Azure Synapse also supports atomic, complete, isolated, and durable (ACID) transactions. Synapse also consists of different features, such as data from different source connectors that can be ingested into Azure using linked services. Similarly, we can transform the data using some of the operations connectors in the pipeline. Synapse can invoke the Azure ML service for building manual or automatic ML models. Azure offers an ML service for building the ML model either using a pre-built model, through notebooks, or via the AutoML option. When the best model is built and evaluated, it is ready to be deployed at an endpoint. This is where the Azure container registry service comes into play; it takes care of containerizing the ML model and saving the container image, which can then be deployed using an orchestration service built by Azure, which is the Azure Kubernetes service. When the model is built and deployed on the endpoint, it will be continuously observed by Azure Monitor. When the model performance has decreased, due to significant change in the underlying trained dataset, this would trigger auto-training of the ML model again. If the auto-trained model has a low performance score, it is time to build a new or ensemble model. All the users registered with Azure Active Directory when they login to the Azure portal once will not be prompted again to access any other service to which they have access until they log out of the portal or time out due to being idle for a long time. Further, all the sensitive assets, such as passwords, authentication, or access keys, can be stored in the Azure key vault. Only the users having access to the key vault can assess the secrets stored inside it [
13,
14,
15].
In
Figure 2, we have depicted the ML platform architecture in the AWS ecosystem. AWS supports different types of data from heterogeneous sources. Data should first be uploaded into the Amazon S3 bucket or Amazon EC2 instance. When the data are within the AWS premise, they can be transformed using Amazon SageMaker Studio. AWS has developed SageMaker as a unified ML platform that can handle the end-to-end ML lifecycle. If the requirement is to generate automated ML, then we can make use of SageMaker’s AutoPilot service. The AutoPilot takes care of training the model, evaluating the model, and choosing the best model based on the evaluation metrics score. When the best model is identified and tested, we can register it in the Amazon elastic container registry, then we will have a container image of our model that can be deployed to any endpoint. The Amazon Cloud Watch service is used to monitor AWS services. The Amazon single sign-on service is used to authenticate users to the AWS portal; once a user signs into the portal, they will not be prompted again to login when they access any of the AWS services to which they have access. The Amazon IAM service takes care of granting required privileges on resources to a role or a user [
14,
16,
17]. Researchers [
18] have explained the two phases of an AutoPilot job as candidate generation and candidate exploration. The candidate generation phase is responsible for splitting the dataset into train, test, and validation, exploring the data distribution, and performing necessary pre-processing. The candidate exploration phase is responsible for tuning different hyperparameters and finding the right values based on model performance metrics.
In
Figure 3, we have depicted the ML platform architecture based on the GCP ecosystem. Like Azure and AWS, GCP supports all types of data. The prerequisite for generating an ML model is that we upload the data to Google Cloud Storage and create a dataset. Google has developed vertex AI for the unified ML platform experience. We can perform all the data transformation tasks from vertex AI by creating data pipelines with the help of native data-engineering task templates. For generating ML models automatically, we can make use of the Google AutoML service. The AutoML service trains, evaluates, and chooses the best model automatically without writing a single line of code. When the model is ready, we can register it with the Google container registry. Then, the container image can be deployed to an endpoint using the orchestration service called Google Kubernetes. The Google monitoring service monitors all Google resources and triggers auto-healing when required. For identity and access management, we can use the cloud IAM service. For storing the secrets, we can use Hashicorp Vault integrated with GCP [
14,
16,
17,
19]. With AutoML in GCP, when it comes to image data, it could belong to any of the following categories: single-label classification, multi-label classification, object detection, and segmentation. With tabular data, we can choose between either regression, classification, or prediction. For natural language processing (NLP) business cases and text data, we can choose between the following categories: single-label classification, multi-label classification, entity extraction, and sentiment analysis. For video-related data, we can choose between the following categories: action recognition, classification, and object tracking.
All the cloud-based ML platforms provide respective cloud-native security, monitoring, and deployment solutions. In principle, they all support identity and access management for granting role-based access and have vault services for storing the secrets, access keys, and certificates.
In
Table 3, we present a summary to compare the three platforms (AWS, GCP, and MS Azure).
7. Experimental Results
We studied three cloud ML platforms in this research, and all our further findings are connected to these cloud platforms. In the following section, we present our findings from each cloud platform and will descriptively compare the results. One thing to keep in mind here is that our research is focused mainly on generating automatic ML (AutoML) models without writing a single line of code or with low-level code. Even though it is possible to use other data formats, we chose tabular data for further analysis for simplicity and practical reasons.
7.2. Low-Code ML Platform Based on AWS
Amazon has published an open-source AutoML library called AutoGluon. Through which developers can create ML models for tabular, text, and image data with just a few lines of code. In addition, Amazon has developed an in-house fully managed ML service called Amazon Sagemaker.
In our case, we used the Sagemaker autopilot service to create the automatic ML model. The prerequisite for using the autopilot service is to upload the data to an Amazon S3 bucket so the Sagemaker autopilot service can consume the data for further analysis. Creating a new ML project is called an experiment in Sagemaker’s terminology. We first uploaded our dataset to the Amazon S3 bucket. Then, we created a new experiment, where we provided the experiment name, chose the S3 bucket name, and picked our dataset from the list. Then, we provided the target feature name; in our case, we chose the feature price. Then, it is possible to deploy the best model automatically to the desired endpoint by enabling the auto-deployment feature and specifying the endpoint name or leaving the default name. It is also possible to provide the output directory name, which must be present in the S3 bucket where all the autopilot output logs will be stored. The above-mentioned options are the basic setting options; they are enough to create an auto-ML model. However, we have the possibility of restraining the ML generation behavior by tuning/tweaking the advanced setting option. We can define the following vital options as part of advanced settings:
ML problem type: we can choose the following options: auto, binary classification, multi-class classification, and regression.
Experiment run type: we can choose between executing the whole experiment or copying the generated code into a notebook and executing the commands cell-wise.
Runtime: we can define how long the experiment can execute, how many maximum models it can generate, and the maximum time it can spend generating each model.
Access: we can restrict access to any IAM role.
Encryption: we can enable encryption for data present at the S3 bucket level.
Security: we can use a virtual private cloud connection if we desire to have a highly secure private connection.
When we initiate a new experiment through autopilot, it automatically takes care of the following tasks: pre-processing, candidate definition generation, feature engineering, model training, explainability report generation, insights report generation, and the option to deploy the model to the desired endpoint.
The AutoPilot job generated different ML models and chose the best model with the least mean squared error (MSE). This model is built on the XGBoost algorithm. It took about two hours to generate all the models and choose the best model from the pool. The best model was automatically deployed to the endpoint specified. When we navigated to our model, it gave us richer information to understand the output results. It provided details on the explainability of the model, performance metrics, artifacts, and endpoints. The AutoPilot job also generates the feature importance based on the best model.
We noticed that distance, type of property, and number of rooms are considered the most important features of this model. As part of the automatic model build, AutoPilot automatically tested and tweaked hyperparameters to generate the best model.
There exists a list of artifacts generated from the AutoPilot job, which includes the input dataset, split of the training and validation sets, preprocessed training and validation sets, Python code for the feature engineering task, zipped folders consisting of all the feature engineering models, ML algorithm models, and other explainability artifacts. All the output data are stored inside the directory name that we specified earlier during experiment creation.
7.3. Low-Code ML Platform Based on GCP
We used GCP’s Vertex AI for generating the AutoML model. It is a prerequisite that we have our dataset within Google Cloud for models to consume. Hence, the first step is to upload the dataset from the local machine to the Google Cloud. It is a mandate to create a dataset in vertex AI if we want to create a new model for the dataset. While creating the dataset, vertex AI can fetch the data from the local machine, Google Cloud, or a big query. However, it will create a dedicated directory within Google Cloud to store the dataset.
After creating the dataset, we can start training the model. While creating a training model, we must first choose the dataset that has been uploaded to GCP Storage. Based on the type of dataset, we will be given the option to choose the objective of the business problem. In our case, as we have tabular data, we are presented with regression or classification options, and we chose regression. Then, we also have the option to choose whether the model should be created automatically without any interference from humans or whether a pre-built model based on TensorFlow, Scikit-Learn, or XGBoost frameworks should be used.
In the next step, we must provide a name for the new model, and we also have the option to either create a new model or retrain an existing model. Then, we should choose the target field; in our case, we chose the price feature. When it comes to splitting the data, Vertex AI provides us with three different ways to split the data. The first option is to choose the data for training, testing, and validation at random; the second option is to choose them manually; and the third option is to choose the data in chronological order: the first 80% would be assigned to training; the next 10% would be assigned to validation; and the last 10% would be assigned to the test set.
In the next step, we can define different training options, such as changing the data type of a feature that is auto-detected or excluding a feature from further analysis.
We also have the option of adjusting the weight of the dataset for all the features based on the weight of a particular feature in the dataset; if not, by default, equal weight will be assigned by AutoML to balance the dataset. Then, we can optimize the training model based on RMSE, MSE, or RMSLE. RMSE can be chosen if we intend to give high importance to extreme values; MSE can be chosen if we intend to exclude extreme values as outliers; or RMSLE can be chosen if we intend to penalize error based on the relative weight.
As the last step, we have the option to choose the maximum node hours for training the model. The minimum number of hours that can be chosen is one, and we can choose a higher value based on our requirements. Based on the value, the model will be allowed to train by autoscaling the required computing resources. With this, we can train a new model or retrain an existing one. With the four steps mentioned above, we can create a new model and train it without writing a single line of code. Model training will be allowed to execute until the budget node time is specified, and then it will automatically be stopped; no intervention is required.
When the AutoML job has generated new ML models, we can see additional details such as when the model training has started and until when it is allowed to execute; on which region compute resources were allocated for training the model, type of encryption key, dataset details, and data split details; whether we have trained the model with custom-built or AutoML; and, finally, what type of problem we are trying to address; in our case, a regression problem.
The trained model also generated the feature importance matrix. As per feature importance, we noticed that region name, land size, distance, and type of property are considered the most important features in deciding the price of the property.
We have the option of exporting our model as a TensorFlow-saved model docker container. By creating the model as a container, we can deploy it elsewhere promptly. We can also directly deploy our model at any desired endpoint as we wish. When the model is deployed to an endpoint, we are given the option to test our model from the respective endpoint without any need for manual testing, creating test strategies, or creating test cases. We also have the option of performing predictions in batches and storing the results in the specified cloud storage directory.
7.5. Low-Code ML Platform Based on Databricks
It is cloud-agnostic, as we have the freedom to choose the data residency of our choice. Data can be stored and hosted on any of the cloud-service-provider ecosystems (AWS, GCP, or Azure). It is a prerequisite to mount the storage on any of the cloud platforms and create a Databricks cluster for computing before trying to ingest the data. When the prerequisites are met, we can easily ingest the data by either providing the dataset’s path from the filestore or dragging and dropping the file from the local system. Once the dataset is uploaded to Databricks, we can perform different actions with it. For example, we can create an AutoML job with the given dataset to create an ML model, or we can create a table from the dataset and explore the data by executing a Spark or SQL query against the respective table. In our case, we chose an AWS S3 bucket as a data storage area for our Databricks AutoML experiments.
Configuring the AutoML experiment is smooth with Databricks; we must provision a Databricks cluster for computing, choose the type of problem, choose the dataset, provide the target class (in our case, its price), and name the experiment to keep track of it. Databricks takes care of imputing the missing data if we leave the default Auto option. Apart from the basic details, we can choose the evaluation metric, whether it should be MSE, RMSE, MAE, or R-Squared. In our case, we chose R-Squared as an evaluation metric. Then, we can choose between three different training frameworks recommended by Databricks: LightGBM, Scikit-learn, and XGBoost. We chose all three frameworks for better comparison. We can set the timeout period for how long the experiment should run. Then, it is a good idea to provide a time feature value from the dataset, which should be of the date/time data type; in our case, we chose the date feature. Databricks uses the time feature to split the data into training, testing, and validation sets. We can also provide the data storage location for storing the experiment results in the persistent storage area.
When the AutoML job ends, we have the option of viewing the Python code generated by the Databricks AutoML job for each model in depth by either opening the notebook for the respective model or viewing the data exploration notebook to understand the different data exploratory actions performed by the AutoML job. Models are sorted based on the test_r2 score in descending order, starting from the best model to the model that performed less well.
If we want to obtain more details about a particular auto-generated ML model, then we can simply click on the hyperlink; it will take us to a separate page consisting of end-to-end details about the model description, parameters provided, evaluation metrics considered, and all the artifacts generated during the model creation. It is also possible to register the model from this page, so it can be exposed to the outside world as a Rest API endpoint. We have noticed there are 124 hyperparameters set by the AutoML job; these parameters are constantly tuned by testing different values by generating different models and validating the results against evaluation metrics. In our case, the AutoML job generated more than 100 models within 60 min. Moreover, for each model evaluation metric, artifacts required for deployment and inference were accessible both from the Databricks user interface and in the AWS S3 bucket.
We noticed that all the Databricks-related artifacts are available in the persistent Amazon S3 bucket storage. To deploy the model to an endpoint or containerize it, we can use the artifacts generated as an AutoML job. We can use the model file in conjunction with the dependent pickle file to deploy the model to any endpoint, the conda file to install the necessary libraries, and the Python environment requirement files to install the necessary Python libraries. Detailed model inference is found by opening the model notebook. This notebook briefly describes importing the required libraries, ingesting data, pre-processing, splitting the data into training/test and validation sets, training the model, generating feature importance, and evaluating the model against different performance metrics.