Sensitive Information Detection in MLOps: Why It Matters More Than Ever

4 min readNov 9, 2023

In our journey as a startup, we’ve always been driven by the needs of our users and customers. And what we’ve observed lately is a heightened demand for scanning .ipynb files.

Through our interactions, we’ve encountered numerous instances where sensitive information was mishandled in Jupyter notebooks. To the untrained eye, a commented-out API key or a hardcoded variable might seem harmless. But, in reality, they pose significant security risks. Often, it’s not just about the direct use of that secret but the insight it provides into the inner workings of an organization.

During the comprehensive lifecycle of a project, these notebooks, like many other files, pass through various hands. They move between departments, units, and teams; sometimes even making their way to external consultants or third-party agencies.

Our mission at CodeThreat is to empower organizations with the tools they need to tackle such challenges head-on.

As AI and machine learning become deeply ingrained in everyday operations, it’s our collective responsibility to ensure that we’re progressing securely, maintaining the trust of stakeholders, and upholding the highest standards of cybersecurity.

Spotlight: Embedded Credentials in Notebooks

While developing, a data scientist might, for speed and ease, drop database credentials directly into a Jupyter notebook. A typical code might look something like this:

# Connect to database
database_URL = "jdbc:mysql://database/predictive_model"
username = "DataPro"
password = "PassXYZ"

When you commit changes to a repository, the CI/CD pipeline is automatically triggered. CodeThreat’s CI/CD integration promptly examines all .ipynb files within the commit. It should detect any hardcoded credentials and it instantly flags an alert to notify you of the security risk.

💡 Pro Tip: Rather than directly embedding sensitive data, resort to environment variables or specialized secret management tools. This not only conceals the actual values but also ensures that the credentials remain dynamic, based on the environment in which the code is executed.MLOps & Data Security: Why Should You Care?

Exponential AI Growth: AI’s expanding role increases the risk surface area, intensifying potential vulnerabilities.
The Price of Secrets: Hardcoded credentials can be malicious goldmines. And the cost? Both reputational and financial.

Spotlight: Model Training Data Leakage

During the training of a machine learning model, a data scientist inadvertently included personal identifiable information (PII) in a public dataset. This dataset, used in a Jupyter notebook for model training, was pushed to a public repository during the model development phase. The exposure was not detected until an external security audit was conducted, leading to a potential breach of GDPR compliance and risking hefty fines.

Spotlight: Exposed Cloud Storage Keys

Exposed Cloud Storage Keys In an effort to streamline access to cloud storage during the model training process, an API key for an Amazon S3 bucket was hardcoded into an .ipynb file. This notebook was then shared across teams via a version control system. An automated process designed to scrape repositories for such keys led to unauthorized access and the potential compromise of sensitive company data stored in the cloud.

CodeThreat and CI/CD Integrations

CodeThreat now works seamlessly with a variety of CI/CD platforms including Azure DevOps, Jenkins, and GitHub Actions. With these integrations, you can incorporate CodeThreat’s security scanning into your preferred development environment, ensuring consistent security practices across different systems.

For gitlab, You may check the details here: https://github.com/CodeThreat/codethreat-gitlab-plugin

🔍 Instantaneous Scans: Every commit activates real-time security checks. 🚀 Empowering Developers: Real-time feedback refines coding practices. ⏱ Pace Over Everything: Robust security that doesn’t hamper development.

Start with Integration

Effortless Automation: Implement Codethreat’s security scans within your pipeline by using GitLab’s include function, which pulls the necessary configurations from a remote YAML template:

include:
  - 'https://raw.githubusercontent.com/CodeThreat/codethreat-gitlab-plugin/main/templates/codethreat.gitlab-ci.yaml'

Custom Integration for Precise Control: For those who prefer a bespoke setup, manually integrate the YAML configurations from Codethreat’s plugin into your existing .gitlab-ci.yml to maintain granular control over your security protocols.

Configuration

Specialized Codethreat Variables: Set environmental variables specific to Codethreat to tailor your security scanning needs:

CT_BASE_URL: Your Codethreat instance URL.
CT_TOKEN: Your Codethreat API token.
CT_ORGANIZATION: Your organization's name within Codethreat.
FAILED_ARGS: Configure thresholds for security scan failures to suit your organization's risk appetite.

Example Configuration:

variables:
  FAILED_ARGS: '{
    "max_number_of_critical": 5,
    "max_number_of_high": 4,
    "weakness_is": "*.injection.*",
    "condition": "OR",
    "sync_scan": true
  }'

For a default setup without custom failure conditions:

variables:
  FAILED_ARGS: '{}'

Dynamic Scanning Options

Automated Scans (codethreat-sast-scan): Codethreat provides vigilant, automated scanning for a range of codebase activities including merge requests, branch updates, and more.

Manual Scans (codethreat-sast-scan-manual): The manual scan option offers the flexibility to conduct security assessments on demand, providing your team with the tools to maintain rigorous security standards.

Integration Outcomes

With Codethreat integrated into your GitLab CI/CD pipeline:

Proactive Security: Scans are automatically initiated in response to defined events, ensuring continuous security vigilance.
Insights and Oversight: Access comprehensive scan outcomes in the GitLab pipeline logs or through Codethreat’s own interface for an in-depth analysis.