Introduction

In the ever-evolving landscape of Data Science (DS) and Artificial Intelligence (AI), advanced models like GPT-3.5/4, Gemini, and LLaMa have revolutionized the approach to solving complex tasks.

Over the past year, various implementations and research have leveraged Large Language Models (LLMs) in innovative ways. For instance, an article on Unite AI [2] highlights healthcare-based applications through LLM integration into clinical workflows to assist in patient data gathering and analysis, significantly improving diagnostic accuracy and patient care. Advanced models like LLaVA-Med can answer inquiries related to biomedical images, enhancing the capabilities of healthcare providers​. Ding et al. (2024)[1] highlight sample implementations in finance around personalized financial advice, helping customers make informed decisions about investments, insurance, and retirement plans​. Similarly, we notice the advent of personalization in education and learning, by incorporating context understanding of the student’s needs. In most industries, we see definitive applications in customer support, content creation, multimodal data processing, and linguistic intelligence [3].

Considering the improved data generation and comprehension capabilities of LLMs, they can be utilized across the entire DS pipeline. This motivates us to consider reimagining DS pipelines and solutions, and to explore how they can leverage data processing and generation capabilities through large language models. LLMs can generate high-quality synthetic data to address issues of data scarcity, imbalance, and privacy concerns, effectively augmenting datasets during the data generation phase [22]. They streamline data preparation by automating cleaning, labeling, and annotation processes through intelligent suggestions, reducing manual effort and enhancing efficiency. During EDA, LLMs assist in uncovering hidden patterns and insights by interpreting complex datasets through natural language queries, enabling a deeper understanding of the data. In model training and optimization, LLMs contribute to feature engineering and hyperparameter tuning by providing insights based on their vast learned knowledge. Finally, they enhance model evaluation and interpretation by explaining model predictions and suggesting improvements, thereby increasing transparency and trust. This article explores how the capabilities of LLMs can be strategically integrated into each phase of the DS model development pipeline, redefining data science practices and paving the way for more efficient and intelligent model development.

Model Development Pipeline

The standard DS model development/training pipeline is depicted in the diagram below. Each step is crucial to facilitate model development and AI-powered automation. Broadly, we can cluster this pipeline into two blocks: a) data and b) modeling.

Data 

The data block comprises data acquisition (availability and access), cleaning, and preparation for modeling.

Synthetic Data Generation

The advent of stronger generation capabilities can be leveraged here to fill in data gaps, specifically for complex data formats like text, image, and audio. Traditional methods often rely on statistical techniques to generate synthetic datasets, which can be limited due to their inability to capture complex data patterns. In contrast, LLMs such as GPT-4 have shown the capability to generate high-quality, diverse synthetic data that maintains the statistical properties of real data.

There has been a core development effort to leverage the GenAI model’s “generation” capability to facilitate LLM training. A case in point is the Nemotron-4 340B model by NVIDIA, which helps generate synthetic structured data. This model employs reward mechanisms to filter and enhance the quality of the data generated, ensuring it is aligned with real-world datasets and specific training requirements​ [12]. An example of research by Stanford NLP before the advent of the new class of LLMs, using GPT-3, demonstrated tabular data generation through fine-tuning GPT-3 using textually transformed tabular training data. Statistical evidence depicted improved outcomes for both similar data and new data generation [13]. This work could exceptionally improve its outcome if the experimentation is further extended with the current models (GPT-4, Gemini, LLaMA, etc.). Guo et al. [4] showcases a simple example of the generation of textual data that can align with the required domain and context, and further leverage it for data preparation through automated labeling. This is further helpful in industries like healthcare where data privacy and confidentiality are a concern. Chintagunta et al. demonstrated that GPT-3 could generate synthetic electronic health records (EHRs) that preserved the statistical properties of real patient data while ensuring anonymity [6]. There is scope for further experimentation with improved models like GPT-4, LLaMA 3.1, and Claude Sonnet 3.1, which has proven enhanced GenAI capabilities. Using a small sample embedded in the context, contextual synthetic data generation can help simulate data to address volume issues without the risk of confidential data leaks. Further, advanced techniques such as attribute-controlled prompts and the verbalizer method enhance the diversity and relevance of the generated data [5].

Data generation could target resolution of specific data issues like imbalance, skewness, and absence. Kim et al. in their paper explore the effectiveness of LLMs in generating realistic tabular data to mitigate class imbalance [14]. They propose using a CSV-style format for prompts to maximize token efficiency, balancing classes within the prompts, and grouping class-specific data examples. These techniques allow the LLM to better understand the relationship between features and produce data that more closely mimics real-world distributions. The authors describe the In-Context Learning ability of LLMs, using examples to generate new, synthetic data without needing extensive retraining. This makes the method more efficient and scalable. There can be similar examples of image data generation to mitigate issues of privacy in the case of medical imaging, where images generated from textual descriptions can be used to train diagnostic models while maintaining patient privacy. Similarly, generating product images based on text prompts helps retailers create synthetic data for product catalog training or marketing visuals. A sample example is a work in progress in our AI labs, where we aim to train an image assessment model specific to relief provision in disasters. Publicly available images are useful, but extending the training data with an image generation model helps add variation and enhance data volume. Another example from a recent client engagement by Sahaj, we leveraged a larger GenAI model to transform unstructured PDFs into input-output sequence pairs to fine-tune smaller LLMs for domain-specific customization.

However, it’s crucial to note that synthetic data generation is not without challenges. Concerns about the potential amplification of biases present in the training data of LLMs necessitate careful validation and filtering of generated data [7]. Despite these challenges, using LLMs for synthetic data generation represents a significant advancement in data augmentation techniques, offering a promising solution to data scarcity issues across various domains of machine learning and artificial intelligence.

We looked at how LLMs can enhance synthetic data generation, but it’s crucial to emphasize how this improved data then feeds into the subsequent stages of the pipeline. Further, we expand on the data-prep stage and explore sample research in augmenting this stage of the pipeline.

Data Preparation

Multimodal LLMs can also be leveraged for initial data preparation and exploration. Data annotation, which is a crucial step in training data prep, is a manual-effort intensive task, and can leverage LLMs for quicker turnaround with reduced efforts. [8] explores zero-shot scenarios, in which LLMs generate annotations based on carefully crafted prompts without prior examples. At the same time, few-shot learning involves providing a few examples to guide the annotation process. This approach leverages In-Context Learning (ICL), where LLMs use a combination of instructions and demonstration samples to generate accurate annotations​. Studies have explored relative comparison and improvement in overall efficiency of data annotation. [14] discusses LLMs significantly accelerating and improving the accuracy of data labeling tasks by automating initial annotations, which human annotators can then review and refine, thereby aiding human expertise. This semi-automated process not only reduces the time and cost associated with data labeling but also helps maintain consistency across large datasets. For instance, in named entity recognition tasks, LLMs can pre-annotate text with potential entities, allowing human annotators to focus on verification and edge cases rather than starting from scratch. [17] demonstrate another example in sentiment analysis, where LLMs can provide initial sentiment labels that humans can then validate or adjust.

There are studies that investigate multiple research efforts in leveraging LLMs as annotators; where we can infer the utility of usage in terms of effort efficacy, however, some limitations are highlighted too. Notable limitations include issues with representativeness, potential biases, sensitivity to variations in prompts, and a preference for the English language [9].

In conclusion, leveraging multimodal LLMs for data annotation offers substantial benefits by significantly reducing manual effort and accelerating the data preparation process. These models can automate initial annotations, enhancing accuracy and consistency across large datasets, and allow human experts to focus on verification and complex cases. However, it’s essential to be mindful of their limitations, including potential biases, sensitivity to prompt variations, and language preferences. Careful oversight and a collaborative approach between LLMs and human annotators are crucial to maximize the advantages while mitigating the drawbacks in data annotation tasks.

Data Exploration

Similarly, exploratory data assessment, mainly data understanding, could also leverage LLMs for data description and parametric understanding. The capability of LLMs to view data of different varieties, and perform data unification, can help in diverse data EDA. An interesting usage that has been reported in some sample cases as well, is integrating the domain expert’s view with the raw data while analyzing the data. For example, transactional datasets can represent relational data between business functions, which can be interpreted from a functional point of view. The addition of a contextual prompt can integrate the domain point of view with the data being analyzed [10]. Instruction prompting can be used as a paired assessment for data with functional and target objectives and may therefore contribute to discovering patterns more relevant to the problem statement. The natural language interface to explore data can be used to expedite data discovery, specifically when the data is from a different domain and requires a certain level of subject matter expertise to assess and understand or explore relationships between various features/parameters in them. There are specific tools developed for leveraging this, like LIDA. LIDA integrates LLMs to automatically generate visualization goals, identify important features, and suggest relevant data transformations. This automation streamlines the EDA process, enabling data scientists to uncover patterns, anomalies, and relationships within datasets quickly [11].

Exploratory data assessment includes data definition and understanding that leads to parametric evaluation for feature engineering. Ease of mixed data summary, through natural language prompts, helps explore data slicing, combinatorial featurization, and basket analysis [16]. Prompt strategies to explore data in sections help highlight hidden features and patterns that might be visible post-modeling. For example, instead of performing a market basket analysis through algorithms like Apriori, visualization of independent data columns through prompt instruction can provide a high-level view of probable relationships. This will further help refine hypothesis definition while designing model strategy.

However, there are some considerations that warrant caution while leveraging LLMs for EDA. Since we aren’t relying on statistical measures, interpretations of patterns and relationships are inferential and not deterministic. Considering LLM-based instruction prompting as initial explorations that help define a series of experiments related to featurization, hypothesis formulation, or causality can be a better use that later gets validated through concrete experimentation.

Post Model Training

Model training itself is an independent exercise, which can leverage LLMs to explore packages and code snippets to expedite training. Therefore, we further focus on discussing how LLMs can help at the post modeling stage.

Model evaluation is crucial after training to ensure that the model aligns with the objectives defined earlier in the pipeline. The approach for model evaluation varies depending on the task and data type. For example, a text classification model typically uses metrics like F1 score, accuracy, and precision-recall, whereas tasks such as text summarization require metrics like ROUGE and BLEU. While LLMs may not directly provide evaluations like traditional metrics, they can act as interpreters of model outputs. For example, in medical image classification, LLMs can assist by summarizing model outcomes, providing insights into how the model arrived at certain decisions, and explaining results based on input parameters. In the domain of text generation, the evaluation of LLMs has been more thoroughly explored and is significantly more evident than in other applications [17][18]. When it comes to feedback, LLMs can be valuable in two key scenarios:

Providing Feedback to Non-LLM-Based Pipelines: LLMs can simulate human feedback, offering critiques of model outcomes and suggesting improvements, which can be particularly useful for AI models that require human feedback to improve [19]. A sample scenario could be dialogue systems or chatbots that aren’t generative models, LLMs can help evaluate how well the system interacts with users by mimicking human responses and offering feedback on how natural or useful the conversation is. Additionally, in medical imaging, LLMs could critique the performance of a classification model by identifying misclassified areas in an X-ray and offering possible reasons for those errors, though this would need experiments to verify.

Offering Feedback within LLM-Based Pipelines or for Generative Tasks: Feedback within an LLM-based pipeline, particularly for generative tasks, has been more extensively explored. For example, in generative text tasks, LLMs can evaluate their own output, as seen in systems like CritiqueLLM, which leverages LLMs to act as critics, comparing model outputs against reference data and providing structured feedback. A more advanced method known as self-feedback refinement involves LLMs generating iterative feedback on their own outputs, refining the evaluation over multiple rounds to ensure higher quality and relevance​ [20][21].

These approaches are particularly beneficial in tasks like text generation, where quality and coherence of output are critical, but it is not yet widely applicable in more deterministic domains like classification or regression tasks. There are notable limitations to LLM-based evaluation methods, particularly in terms of subjectivity and the difficulty of capturing the full scope of language generation quality [18] as well as the non-deterministic nature of the model itself.

Conclusion

In conclusion, LLMs have emerged as transformative tools across various stages of the data science pipeline, offering new capabilities for data generation, preparation, exploration, and post-model training evaluation. By enabling synthetic data generation, LLMs help fill critical gaps in data availability, enhance the diversity of training sets, and address issues like data imbalance, while aiding maintain privacy and confidentiality. They can streamline data preparation tasks, such as data annotation and exploration, allowing for rapid and more consistent analysis. In the exploratory data analysis phase, LLMs enable richer contextual understanding and domain integration, providing insights that guide hypothesis formulation and feature engineering. Their natural language interfaces offer a more intuitive way to explore and unify diverse datasets, speeding up the discovery of relationships and patterns that might otherwise go unnoticed.

Moreover, in the post-model training phase, LLMs have proven their value as interpreters and critics of model outcomes, particularly in the domain of data generation. While LLMs are not direct replacements for traditional evaluation metrics, their ability to provide qualitative feedback, simulate human judgment, and act as aid to human feedback adds a new dimension to model evaluation. As the field evolves, the role of LLMs in providing feedback, refining outputs, and monitoring model performance will continue to expand, making them essential tools for building scalable, robust, and human-aligned AI systems.

However, challenges remain, such as addressing biases, ensuring transparency in model interpretations, and refining LLMs’ feedback mechanisms in non-generative tasks. Despite these hurdles, sample literature and reference examples makes it evident that LLMs offer immense potential to revolutionize AI model development, and further research and experimentation will unlock even greater efficiencies and possibilities across industries.

References

  1. Ding, Q., Ding, D., Wang, Y., Guan, C., & Ding, B. (2024). “Unraveling the landscape of large language models: a systematic review and future perspectives”, Journal of Electronic Business & Digital Economics, Vol. 3 No. 1, pp. 3-19. https://doi.org/10.1108/JEBDE-08-2023-0015

  2. https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/

  3. https://springsapps.com/knowledge/large-language-model-statistics-and-numbers-2024

  4. https://ar5iv.labs.arxiv.org/html/2403.04190v1

  5. https://ar5iv.labs.arxiv.org/html/2406.14541

  6. Chintagunta, B., Katariya, N., Amatriain, X., & Kannan, A. (2021). Medically Valid Synthetic Healthcare Records Using Generative AI. arXiv preprint arXiv:2112.00160.

  7. Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L. A., … & Huang, P. S. (2021). Challenges in Detoxifying Language Models. arXiv preprint arXiv:2109.07445.

  8. https://ar5iv.labs.arxiv.org/html/2402.13446

  9. https://aclanthology.org/2024.nlperspectives-1.11.pdf

  10. https://towardsdatascience.com/how-llms-will-democratize-exploratory-data-analysis-70e526e1cf1c

  11. https://aclanthology.org/2023.acl-demo.11.pdf

  12. https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

  13. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/final-reports/final-report-169369314.pdf

  14. Wang, Y., et al. (2023). “LLM-assisted Data Annotation: A Survey.” arXiv preprint arXiv:2303.13839.

  15. https://arxiv.org/abs/2305.15005

  16. https://towardsdatascience.com/how-llms-will-democratize-exploratory-data-analysis-70e526e1cf1c

  17. https://ar5iv.labs.arxiv.org/html/2401.07103

  18. https://datasciencedojo.com/blog/evaluating-large-language-models-llms/

  19. https://ar5iv.labs.arxiv.org/html/2306.09821

  20. https://ar5iv.labs.arxiv.org/html/2311.18702

  21. https://ar5iv.labs.arxiv.org/html/2303.17651

  22. https://arxiv.org/pdf/2403.02990

  23. Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., … & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences, 103, 102274.