Intelligent Data Flow Automation for AI Systems via Advanced Engineering Practices

Arunkumar Medisetty

doi:10.70153/IJCMI/2021.13101

Authors

Arunkumar Medisetty Staff Software Engineer , The Home Depot Author

DOI:

https://doi.org/10.70153/IJCMI/2021.13101

Keywords:

AI Data Pipeline, Data Engineering Automation, Data Orchestration, , Real-time Streaming, DataOps, Machine Learning Infrastructure, Data Quality, Workflow Automation

Abstract

Modern Artificial Intelligence (AI) systems demand seamless, scalable, and intelligent data flows to support real-time analytics, model training, and automated decision-making. However, traditional data pipelines are often rigid, manual, and inefficient, leading to delays, data silos, and suboptimal model performance. This research explores how advanced data engineering techniques—such as real-time data streaming, automated ETL/ELT processes, data orchestration, schema evolution, and intelligent data validation can automate and optimize the end-to-end data flow in AI systems. A comprehensive framework is proposed that integrates Apache Kafka, Apache Airflow, Delta Lake, and ML-based metadata management into a unified automation stack. Case studies across healthcare, finance, and IoT domains are used to demonstrate measurable improvements in pipeline efficiency, data quality, system scalability, and AI model readiness. The results underscore the transformative potential of advanced data engineering in enabling adaptive, self-healing, and intelligent data infrastructures that power modern AI ecosystems.

Downloads

Download data is not yet available.

Author Biography

Arunkumar Medisetty, Staff Software Engineer , The Home Depot

Arunkumar Medisetty

Staff Software Engineer

The Home Depot

2161 Newmarket Pkwy SE, Marietta, GA 30067

Email: arunkumar.medisetty@yahoo.com

References

Abadi, M., et al. (2016). TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) (pp. 265–283). USENIX Association.

Armbrust, M., et al. (2015). Spark SQL: Relational data processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM.

Baylor, D., et al. (2017). TFX: A TensorFlow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1387–1395). ACM.

Carbone, P., et al. (2015). Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38(4), 28–38.

Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide. O’Reilly Media.

Chintapalli, S., et al. (2016). Benchmarking streaming computation engines: Storm, Flink, and Spark Streaming. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 1789–1792). IEEE.

Dehghani, Z. (2020). Data mesh: Delivering data-driven value at scale. O’Reilly Media.

Diamantopoulos, T., et al. (2017). Efficient data management in machine learning systems. Journal of Big Data, 4(1), 1–26.

Gulzar, M. A., et al. (2019). Automated data pipeline generation for machine learning. In 2019 IEEE International Conference on Big Data (Big Data) (pp. 3002–3011). IEEE.

Heller, J., & Heise, A. (2017). Data quality monitoring for machine learning pipelines. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 1–10). IEEE.

Kreps, J., et al. (2011). Kafka: A distributed messaging system for log processing. In Proceedings of the 6th International Workshop on Networking Meets Databases (NetDB ’11) (pp. 1–7). ACM.

Mishra, P., et al. (2018). Delta Lake: High-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 13(12), 3411–3424.

Polyzotis, N., et al. (2017). Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17) (pp. 1723–1726). ACM.

Schelter, S., et al. (2018). Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), 1781–1794.

Shukla, A., et al. (2019). Machine learning data pipelines: Challenges and opportunities. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW) (pp. 1–6). IEEE.

Simitsis, A., & Wilkinson, K. (2010). Optimizing analytic data flows for multiple execution engines. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (pp. 1251–1254). ACM.

Stonebraker, M., & Cetintemel, U. (2015). "One size fits all": An idea whose time has come and gone. Communications of the ACM, 48(12), 61–67.

Vartak, M., et al. (2016). ModelDB: A system for machine learning model management. In Proceedings of the 7th ACM SIGMOD Workshop on Databases and Artificial Intelligence (pp. 1–4). ACM.

Xin, R. S., et al. (2013). Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp. 13–24). ACM.

Zaharia, M., et al. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65.

Intelligent Data Flow Automation for AI Systems via Advanced Engineering Practices

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

References

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

Similar Articles

Make a Submission

IJCMI-LOGO

ISSN-IJCMI

Information

Google Scholar-IJCMI

Academia-IJCMI

zbmath-IJCMI

Language

Keywords

Cross Ref-IJCMI