As a continuation of the MATLAB diagnostic test and the conclusion to the triology of virion diagnostics using nanoparticles, my partner and I aimed to upgrade the virion quantification system by using artificial intelligence to include a feature that provides information on the prognosis, severity, and infectability based off of different factors such as age, weight, etc and the number of virions based off of our original project. Ultimately, The model we developed analyzes an infant's health and biological parameters to predict infection severity and other useful infromation that could be tracked. I trained the model on over 1 million samples generated from various combinations of weight, age, and virion count, computing a severity index as the output. To ensure robustness, I introduced a dataset with added variance to simulate real-world noise. A unique aspect of my approach was incorporating domain-specific biological knowledge, where weight was treated as the dominant health indicator. Using Bayesian Ridge Regression, I captured the non-linear relationships among these factors. The final model achieved an average R² score of 0.72 and a mean-squared error (MSE) of 0.122, demonstrating strong predictive capabilities after validating on 20% of the dataset.
Programming Languages: I implemented the model using Python, with Pandas for data handling and NumPy for efficient numerical computations.
Model Implementation: The Bayesian Ridge Regression model was developed using Scikit-learn, allowing for hyperparameter tuning of key parameters (α and λ) to ensure optimal model fitting for the complex relationships between the input variables.
Severity Index Calculation: Using domain-specific biological insights, the severity index formula emphasized weight deviations and age-related immunity differences. The formula incorporated coefficients that accounted for the squared impact of weight and a linear relationship for age, with virion count serving as a multiplying factor.
Dataset Generation: I generated a dataset of over 1 million samples, each comprising three key input variables: weight, age, and virion count. The severity index was calculated based on a combination of these factors. I introduced a small 0.01% variance to simulate real-world noise in the data, making the model more resilient to discrepancies typically encountered in practical healthcare scenarios.
Statistical Validation: I used mean-squared error (MSE) and coefficient of determination (R²) as key metrics to assess the model's performance. The model was tested across multiple iterations, producing consistent and reliable results with an average MSE of 0.122 and an R² of 0.72.
Matplotlib and Seaborn: These libraries were used to generate visual representations of the model's performance, including scatter plots, 3D visualizations, and error analysis. Visual models helped identify key trends and the accuracy of predictions over different weight and age ranges.
Scikit-learn: This library was crucial for implementing the Bayesian Ridge Regression model, enabling easy model training, testing, and hyperparameter tuning. The model's parameters were adjusted to fit the dataset's non-linear relationships between weight, age, and virion count.
Pandas: Used for data manipulation and preprocessing. The dataset of over 1 million samples was loaded, cleaned, and structured using Pandas, facilitating easy integration with the regression model.
NumPy: Utilized for efficient computation of the severity index and statistical metrics. Its array-based operations enabled rapid data manipulation and mathematical transformations on large datasets.
Bayesian Ridge Regression: I chose to use Bayesian Ridge Regression models because it is ideal for modeling non-linear relationships like the ones present between weight, age, and virion count in this project. By using priors and regularization, I was able to avoid overfitting while remaining flexible to new data.
Hyperparameter Tuning: I used a Gamma distribution to model the priors for the hyperparameters α (alpha) and λ (lambda), ensuring optimal regularization and precision in the final model. This was critical for achieving a balance between bias and variance, enabling the model to generalize well on unseen data.