Skip to content
English - United States
  • There are no suggestions because the search field is empty.

The art and practice of data science pipelines.

By Biswas, S., & Biswas, S.

Biswas, S., & Biswas, S. (2022). The art and practice of data science pipelines. Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, 21-30. https://sumonbis.github.io/uploads/pipeline-ICSE22.pdf

This research paper examines the structural organization of data science pipelines by comparing how they are described in academic literature with how they are actually implemented in both small-scale and large-scale software projects. By analyzing 71 theoretical models, 105 competitive Kaggle solutions, and 21 mature open-source GitHub repositories, the authors identify a series of foundational stages—such as acquisition, preparation, modeling, and deployment—that define the data science lifecycle. A key theme of the study is the discovery of a gap between theory and practice, noting that real-world code often lacks modularity and suffers from tangled stages, where data processing and model training are inextricably mixed. Furthermore, the paper highlights a significant need for integration tools and standardized frameworks, particularly for large-scale projects that must bridge the divide between a model's initial development and its eventual post-deployment phase. Ultimately, the work provides a canonical terminology for these processes to help developers improve the reusability and maintenance of data-intensive systems.