The Cloud Native Computing Foundation (CNCF) recently published a white paper on Cloud Native Artificial Intelligence. It broadly discusses the challenges and benefits of using the cloud for artificial intelligence (AI) and machine learning (ML) applications. This article summarizes the paper and analyzes its presentation.
Summary
The Cloud Native Computing Foundation (CNCF) is a subsidiary of the Linux Foundation and it advocates for the adoption of scalable applications in the cloud by “fostering and sustaining an ecosystem of open source, vendor-neutral projects.” Their white paper surveys the current gaps and opportunities in using the cloud to operate AI and ML applications. The paper chiefly covers the following topics:
- Current AI and ML techniques
- Cloud Native technologies
- Challenges in using the cloud for AI/ML applications
- Emerging opportunities and solutions
The CNCF envisions Cloud Native as a system of modular microservices designed for re-usability, scalability, deployability, and resilience. While describing the two main AI tools used today, Predictive AI and Generative AI, it identifies these tools’ elastic requirements for data storage, access, compute, bandwidth, latency, and security as a hardware challenge. It proposes cloud infrastructure as a solution that can not only meet these needs, but also efficiently enhance performance.
The paper’s review of the challenges involved begins with listing the typical stages in a ML pipeline. These being data preparation, model training, model registry, model serving, and observability. The paper elaborates on challenges for each pipeline, but notable ones are as follows:
- Data Preparation: There is no industry standard on ML workload interfaces. Developers typically develop their own scripts using local datasets and then engineers rewrite and deploy them to the production environment. If results are not as expected, the developers debug their own scripts and the cycle repeats.
- Model Training: Training AI and ML applications require significant amount of computing resources especially in the form of GPUs. The paper states that using dynamic allocation and de-allocation of GPU compute using Cloud Native requires careful management. Because GPU resources are expensive and in-demand, careful scheduling of GPU-dependent AI workloads is necessary for maximum efficiency.
- Model Serving: Different AI tools (such as LLMs, text-to-video, etc.) have different compute and latency requirements. Cloud Native is based on microservices and breaking up the ML pipeline into separate microservices makes synchronization a challenge. Moreover, this creates a barrier-to-entry for novice developers who cannot afford to focus on just their ML scripts. They would also have to learn how to implement their applications on the cloud.
- Observability and User Experience: The abundance of tools and projects, across AI/ML and Cloud Native, makes choosing complementary solutions difficult. A baseline reference implementation may help developers get started. Also, long-running AI workloads on the cloud will need periodic diagnoses to ensure the models do not drift.
The paper recommends the implementation of a reference combination of tools to give a product-like experience. This would help novice developers and new businesses capitalize on AI/ML tools with little setup time. The paper also lists specific projects within the Cloud Native AI Landscape that address different stages of the AI/ML pipeline.
Finally, the paper recognizes that educational institutions are wary of how students may exploit these technologies. This discourages students from experimenting with them and thereby deprives them of a vital advantage in a world witnessing rapid advancements in these fields. The paper invites students and novice researchers to get involved in this emerging landscape by learning about the underpinnings of these technologies. It encourages them to envision how responsible and progressive research in these fields will help them in their current and future careers.
Commentary
Artificial Intelligence and Machine Learning are vast subjects, and so is Cloud Infrastructure. Cloud as a computing platform gained much prominence in the 2010s and it is natural that it would be an obvious consideration for the substantial computing needs of AI/ML applications. This is especially so as smaller businesses take a keener interest in leveraging these applications and wider industry trends are only capturing the interest of younger generations. In such a scenario, a white paper addressing challenges and opportunities in this space is welcome.
With that said, however, a white paper is most effective when it sufficiently identifies its audience. This helps present just enough background information to bring the readers up to speed and it serves as a springboard to envision a future state that is of interest to them. This white paper is geared to a general audience rather than a more well-defined one. The paper acknowledges its readership includes everyone from those with only cursory knowledge to those with advanced knowledge of the topics. It isn’t specifically targeted at executives, or researchers, or teachers, or even students but everyone who has a passing interest in the topic. Informational white papers, like this one, don’t necessarily have to cater to a narrow segment of a larger audience but defining one anyway helps sequence the paper from identification of a problem to possible solutions and to future states.
It’s worth noting that the paper gives a reasonably good overview of current AI/ML techniques. It cleverly explains the different stages of a typical AI/ML pipeline while discussing the challenges of implementing them on the cloud. While discussing the opportunities and solutions, however, the paper does not follow this same template. The paper instead provides recommendations, lists opportunities, and provides a snapshot of a number of current projects that may apply to the audience. I think this is a missed opportunity and it may have made for smoother reading, and more impactful messaging, if the paper had structured its solutions section just like the challenges section. Committing to a pattern in a written piece of work is a good way to make it memorable.
Finally, I would like to comment on a few stylistic choices in the paper.
- The 4 images used in the paper don’t have a cohesive style. When creating a white paper, it is good practice to have a designer redesign all the illustrations provided in a consistent theme or format.
- The paper could also benefit from a clearer delineation between sub-sections and sub-sub-sections.
- I strongly recommend publishers not to cite Wikipedia pages. It’s better to verify the Wikipedia page’s citations and cite those instead.
Disclaimer: I am not affiliated with the Cloud Native Computing Foundation. I do not endorse or refute the contents of the Cloud Native Artificial Intelligence white paper. If there are any errors or concerns about the contents of this article, please contact me.
If this is your first time visiting this website, welcome! I am Nimmit Prabhackar and I am a white paper specialist. Are you looking to have a white paper written for your organization? If so, please visit the solutions page to learn more about how I can assist you. For more white paper reviews or to read my other articles, please visit the journal page.