Model Development and Training

Model Development and Training -

Model Development Tools -

Model Training & Hyperparameter Tuning

Steps involving in training a machine-learning model and optimizing its performance...


Once model is ready need to measure the metrics -
Precision - Correctly Predicted Positive Instances/Total Predicted Positives
Recall - Correctly Predicted Positive Instances/Total Actual Positives


Model Traning with a larger dataset - Hyperparameter Tuning for iterative model improvement.


Train Model with the larger dataset.
Evaluate Precision and Recall at each stage.
Training data partitioning - Split -- 70% Training, 15% Validation, 15% Testing.
Hyperparameter Optimization(Grid Search)
Cross-Validation(5-Fold Cross Validation)
Performance Evaluation(85% Accuracy - F1-Score:0.80)

CPUs vs GPUs Comparison

GPUs are needed for vector operations

CPU vs GPU -

MLflow Development Lifecycle -

_images/mlflow-development-lifecycle.png

MLFlow Intro

mlflow - Managing ML experiments


1. Model Selection - (Select from different machine learning models)
2. Run experiments on the selected models
3. Test the same-data with different ML Models

The important step is visualizing the data and trying to understand which model is the best fit for this usecase.
This is an iterative process - and this is where MLflow is actually helpful.

mlflow is an open-source platform for managing the machine learning lifecycle, from tracking experiments to model deployment.

MLFlow modules
--------------------->
1. Tracking  - Record and query experiments(code, data, config, results).
2. Projects - Packaging format for reproducible runs on any platform.
3. Models - General format that standardizes the deployment paths.
            Once we have finalized the model, we can save it in MLflows format and deploy it in any format, whether it is flask, fastapi or BentoML
4. Model Registry - Centralized and collaborative model lifecycle management.

Install MLflow

mlflow is an opensource platform for the complete machine-learning lifecycle.

pip install mlflow

(.venv) bharathkumardasaraju@4.Mode-Development-and-Training$ mlflow ui --port 5001
[2025-03-04 21:05:55 +0800] [94896] [INFO] Starting gunicorn 23.0.0
[2025-03-04 21:05:55 +0800] [94896] [INFO] Listening at: http://127.0.0.1:5001 (94896)
[2025-03-04 21:05:55 +0800] [94896] [INFO] Using worker: sync
[2025-03-04 21:05:55 +0800] [94897] [INFO] Booting worker with pid: 94897
[2025-03-04 21:05:55 +0800] [94898] [INFO] Booting worker with pid: 94898
[2025-03-04 21:05:55 +0800] [94899] [INFO] Booting worker with pid: 94899
[2025-03-04 21:05:55 +0800] [94900] [INFO] Booting worker with pid: 94900

MLflow UI -

MLflow UI Supports Multiple Models -

MLflow Log - Langchain

import mlflow
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate

mlflow.set_experiment(experiment_id="0")
mlflow.langchain.autolog()

# Ensure that the "OPENAI_API_KEY" environment variable is set
llm = OpenAI()
prompt = PromptTemplate.from_template("Answer the following question: {question}")
chain = prompt | llm

# Invoking the chain will cause a trace to be logged
chain.invoke("What is MLflow?")

MLflow Log - LlamaIndex

import mlflow
from llama_index.core import Document, VectorStoreIndex

mlflow.set_experiment(experiment_id="0")
mlflow.llama_index.autolog()

# Ensure that the "OPENAI_API_KEY" environment variable is set
index = VectorStoreIndex.from_documents([Document.example()])
query_engine = index.as_query_engine()

# Querying the engine will cause a trace to be logged
query_engine.query("What is LlamaIndex?")

MLflow Log - AutoGen

import os
import mlflow
from autogen import AssistantAgent, UserProxyAgent

mlflow.set_experiment(experiment_id="0")
mlflow.autogen.autolog()

# Ensure that the "OPENAI_API_KEY" environment variable is set
llm_config = {"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}
assistant = AssistantAgent("assistant", llm_config=llm_config)
user_proxy = UserProxyAgent("user_proxy", code_execution_config=False)

# All intermediate executions within the chat session will be logged
user_proxy.initiate_chat(assistant, message="What is MLflow?", max_turns=1)

MLflow Log - OpenAPI

import mlflow
from openai import OpenAI

mlflow.set_experiment(experiment_id="0")
mlflow.openai.autolog()

# Ensure that the "OPENAI_API_KEY" environment variable is set
client = OpenAI()

messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "Hello!"}
]

# Inputs and outputs of the API request will be logged in a trace
client.chat.completions.create(model="gpt-4o-mini", messages=messages)

MLflow Log - Custom App

import mlflow

mlflow.set_experiment(experiment_id="0")

@mlflow.trace
def foo(a):
    return a + bar(a)

# Various attributes can be passed to the decorator
# to modify the information contained in the span
@mlflow.trace(name="custom_name", attributes={"key": "value"})
def bar(b):
    return b + 1

# Invoking the traced function will cause a trace to be logged
foo(1)

Model Development and Training -

MLflow Setup Demo

Model development and Training

MLflow

Jupyter Notebook
Colab Notebook
Amazon SageMaker
GCP Vertex.ai


Build ML model - Train it and - Test it.

mlflow ui --host 0.0.0.0

MLflow Example Demo -

Requirements

mlflow
scikit-learn

Validation Script

from mlflow.tracking import MlflowClient

client = MlflowClient()
for rm in client.search_registered_models():
    print(f"Model name: {rm.name}")

Example MLflow Python Script

import mlflow
import mlflow.sklearn
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score
import numpy as np

# Set the MLflow tracking URI to the remote MLflow server
mlflow.set_tracking_uri("http://localhost:5001")

# Create synthetic data for regression
X, y = make_regression(n_samples=100, n_features=4, noise=0.1, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set the experiment name
mlflow.set_experiment("ML Model Experiment")

def log_model(model, model_name):
    with mlflow.start_run(run_name=model_name):
        # Train the model
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        evs = explained_variance_score(y_test, y_pred)

        # Log metrics
        mlflow.log_metric("mse", mse)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("mae", mae)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("explained_variance", evs)

        # Log model
        mlflow.sklearn.log_model(model, model_name)
        
        print(f"{model_name} - MSE: {mse}, RMSE: {rmse}, MAE: {mae}, R2: {r2}, Explained Variance: {evs}")

# Linear Regression Model
linear_model = LinearRegression()
log_model(linear_model, "Linear Regression")

# Decision Tree Regressor Model
tree_model = DecisionTreeRegressor()
log_model(tree_model, "Decision Tree Regressor")

# Random Forest Regressor Model
forest_model = RandomForestRegressor()
log_model(forest_model, "Random Forest Regressor")

print("Experiment completed! Check the MLflow server for details.")

ML Model Experiment -

Run ML Model Experiment -

_images/Run-ML-model-experiment-python.png

Experiments in MLflow

Run the experiments locally and store the results in mlflow service.

using scikit-learn to train mlmodel...
mlflow
scikit-learn

datascience package scikit-learn

so when the example-mlflow.py runs means when the model gets trained it stores all the results in remote mlflow server.


ML model:

1. Train a model
2. Make some Predictions
3. Calculate metrics
4. Log metrics
5. Log model

Now important part of run multiple models on the same data
1. Linear Regression Model
2. Decision Tree Regressor Model
3. Random Forest Regressor Model

And we are storing all of the metrics generated by these models in our experiment.

So we have one experiment with three model metrics.

Since i run mlflow on the port 5001 .. update this code mlflow.set_tracking_uri("http://localhost:5001")

(.venv) bharathkumardasaraju@4.Mode-Development-and-Training$ mlflow ui --port 5001
[2025-03-04 21:05:55 +0800] [94896] [INFO] Starting gunicorn 23.0.0
[2025-03-04 21:05:55 +0800] [94896] [INFO] Listening at: http://127.0.0.1:5001 (94896)
[2025-03-04 21:05:55 +0800] [94896] [INFO] Using worker: sync
[2025-03-04 21:05:55 +0800] [94897] [INFO] Booting worker with pid: 94897
[2025-03-04 21:05:55 +0800] [94898] [INFO] Booting worker with pid: 94898
[2025-03-04 21:05:55 +0800] [94899] [INFO] Booting worker with pid: 94899
[2025-03-04 21:05:55 +0800] [94900] [INFO] Booting worker with pid: 94900



(.venv) bharathkumardasaraju@2.demo-experimenting-storing-result-in-mlflow$ python3.12 ./example-mlflow.py
2025/03/05 02:51:40 INFO mlflow.tracking.fluent: Experiment with name 'ML Model Experiment' does not exist. Creating a new experiment.
2025/03/05 02:51:42 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
Linear Regression - MSE: 0.010896702038864626, RMSE: 0.10438726952490245, MAE: 0.08916727215687235, R2: 0.9999982925287189, Explained Variance: 0.9999983106045686
🏃 View run Linear Regression at: http://localhost:5001/#/experiments/984637440833736174/runs/fd6aa9c3311b474084d34fac4796f65a
🧪 View experiment at: http://localhost:5001/#/experiments/984637440833736174
2025/03/05 02:51:44 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
Decision Tree Regressor - MSE: 1371.940640735993, RMSE: 37.03971707148953, MAE: 31.852194743661595, R2: 0.7850221805576543, Explained Variance: 0.785065525456845
🏃 View run Decision Tree Regressor at: http://localhost:5001/#/experiments/984637440833736174/runs/6b0a7299af0e49b3b44ac213d065e8ca
🧪 View experiment at: http://localhost:5001/#/experiments/984637440833736174
2025/03/05 02:51:45 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
Random Forest Regressor - MSE: 692.3449318626951, RMSE: 26.31244823011905, MAE: 20.74315269960315, R2: 0.8915122131858741, Explained Variance: 0.8919768998456485
🏃 View run Random Forest Regressor at: http://localhost:5001/#/experiments/984637440833736174/runs/9b087190fcdf4840a9a1174237fe430b
🧪 View experiment at: http://localhost:5001/#/experiments/984637440833736174
Experiment completed! Check the MLflow server for details.
(.venv) bharathkumardasaraju@2.demo-experimenting-storing-result-in-mlflow$

Evaluate ML Models -

_images/Evaluate-ml-models-in-mlflow.png

Compare ML Model Runs -

Model Artifacts -

_images/decision-tree-regressor-model-artifacts.png

Store Model in Registry

mlflow model artifact and versioning

so if we choose in our experiment to go-ahead with the Decision-Tree Regressor model...
So now we need select that models artifacts and store it in Model Registry, basically the model.pkl file for the Decision Tree Regressor Model.

So when we deploying the model ...means serving the model.pkl in the API .. treat it like a jar or pypi package.

New ML Deployment -

Open Model File

import pickle

with open("model.pkl", "rb") as file:
    model = pickle.load(file)

print("model name is", model)
print("model params are", model.get_params())


if hasattr(model, "feature_importances_"):
    print("Feature Importances:", model.feature_importances_)

Open Model File v1

import pickle
import json

with open("model.pkl", "rb") as file:
    model = pickle.load(file)

print("\n📌 Model Name:", model.__class__.__name__)
print("\n🔧 Model Parameters:")
print(json.dumps(model.get_params(), indent=4))

if hasattr(model, "feature_importances_"):
    print("\n📊 Feature Importances:")
    for idx, importance in enumerate(model.feature_importances_):
        print(f"  Feature {idx}: {importance:.4f}")

Data Handling in MLflow

Grid Search technique used for optimizing hyperparameters in the training process

below two metrics should be monitored to ensure that a model is balanced between precision and recall

Precision and recall assess how well the model balances capturing relevant instances and minimizing false positives.
True Positive Rate (Recall) and False Negative Rate (inversely related to precision) help monitor model performance.

After training a model, the precision is high, but recall is low. Which adjustment should be made to improve recall?
Lowering the threshold increases the number of true positives, thereby improving recall.

MLflow Runs Directory

`mlruns/`_ – MLflow runs directory in MLflow Example

Supporting Folders

mlruns/ (root) – MLflow runs directory
mlruns/ (example) – MLflow run directory used in specific example
mlartifacts/ – MLflow model artifacts directory