Article

Table of Contents
Theme:
Was this article helpful?

0  out of  5 found this helpful

Try Vultr Today with

$50 Free on Us!

Want to contribute?

You could earn up to $600 by adding new articles.

How to Set up a TensorFlow Workspace on a Vultr Cloud GPU Instance

Author: Mayank Debnath

Last Updated: Fri, Dec 9, 2022
Machine Learning Vultr Cloud GPU Vultr Marketplace

Introduction

TensorFlow is a popular open-source machine learning platform that helps users implement deep learning and machine learning models to solve common business problems. TensorFlow offers an ecosystem for developers and enterprises to build scalable machine learning applications. For example, it's used to train neural networks referred to as stateful dataflow graphs where each graph node represents neural network operations or multi-dimensional arrays.

This article demonstrates the steps to deploy a temporary or a persistent TensorFlow workspace using the official Docker image and the NVIDIA Docker Toolkit.

Prerequisites

Before you begin, you should:

  • Deploy a Vultr Cloud GPU instance with the NVIDIA NGC Marketplace Application.

  • Point a subdomain to your server using an A record. This article uses tensorflow.example.com for demonstration.

Verify the GPU Availability

The Vultr Cloud GPU servers feature NVIDIA GPUs for machine learning, artificial intelligence, and so on. They come with licensed NVIDIA drivers and the CUDA Toolkit, which are essential for the proper functioning of the GPUs. This section demonstrates the steps to verify the GPU availability on the server and inside a container.

Execute the nvidia-smi command on the server.

# nvidia-smi

The above command outputs the information about the connected GPU. It includes information such as the driver version, CUDA version, GPU model, available memory, GPU usage, and so on.

Execute the nvidia-smi command inside a container.

# docker run --rm --gpus all nvidia/cuda:10.2-base nvidia-smi

The above command uses the official nvidia/cuda image to verify the GPU access inside a container. The NVIDIA Docker Toolkit enables you to use the GPU inside the containers using the --gpus option. The --rm option removes the container from the disk once the container ends.

Deploy a Temporary Workspace

The Vultr Cloud GPU servers offer access to high-end GPUs that you can leverage for training your machine learning models, saving a lot of time without paying the upfront cost of the hardware. This section explains the steps to deploy a temporary TensorFlow workspace on a Vultr Cloud GPU server.

Disable the firewall.

# ufw disable

The above command disables the firewall to allow inbound connections on all ports.

Deploy a new Docker container.

# docker run -p 8888:8888 --gpus all -it --rm -v /root/notebooks:/tf/notebooks tensorflow/tensorflow:latest-gpu-jupyter

The above command uses the official tensorflow/tensorflow image with the latest-gpu-jupyter tag that contains the GPU-accelerated TensorFlow environment and the Jupyter notebook server. Copy the token from the output of this command to access the Jupyter notebook interface.

The following is the explanation for each parameter used in the above command.

  • -p 8888:8888: Expose the connection on port 8888.

  • --gpus all: GPU access inside the container.

  • -it: Interactive session. Allow keyboard interrupt.

  • --rm: Remove the container when stopped.

  • -v /root/notebooks:/tf/notebooks: Store all the notebooks in the /root/notebooks directory.

Verify the GPU availability using the TensorFlow module.

  1. Open http://PUBLIC_IP:8888 in your web browser and use the copied token to log in to the interface.

  2. In the Jupyter Notebook interface, navigate to the directory where you want to create your new notebook.

  3. Click the "New" button in the top right corner of the interface, and select the "Python 3" option from the dropdown menu. This will create a new Python 3 notebook in the selected directory.

  4. Give your notebook a name by clicking the "Untitled" title at the top of the page and typing in a new name.

  5. To add a new cell, click the "Insert" menu at the top of the interface and select the "Insert Cell Below" option. You can then enter your code into the cell and run it by clicking the "Run" button in the toolbar at the top of the page or by pressing SHIFT + ENTER.

Run the following code in a new cell.

import tensorflow as tf

tf.config.list_physical_devices()

Output.

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),

 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

The output confirms that the GPU is available in the TensorFlow module. You can stop the workspace by opening the terminal window and pressing CTRL + C. Stopping the container will not delete the notebooks. You can find them in the /root/notebooks directory.

Deploy a Persistent Workspace

Deploying the TensorFlow Workspace on a Vultr Cloud GPU server provides more than just access to high-end GPU. The Jupyter notebook interface allows you to work with others on a machine-learning project, offering more flexibility and scalability than a local setup. It also allows you to access and manage your machine learning resources from anywhere with an internet connection. This section demonstrates the steps to deploy a persistent TensorFlow workspace on a Vultr Cloud GPU server using Docker Compose.

Create and enter a new directory named tensorflow-workspace.

# mkdir ~/tensorflow-workspace

# cd ~/tensorflow-workspace

The above commands create and enter a new directory named tensorflow-workspace in the /root directory. You use this directory to store all the configuration files related to the TensorFlow Workspace, such as Nginx configuration, SSL certificate, and so on.

Create a new file named docker-compose.yaml.

# nano docker-compose.yaml

Add the following contents to the file.

services:



  jupyter:

    image: tensorflow/tensorflow:latest-gpu-jupyter

    restart: unless-stopped

    volumes:

      - "/root/notebooks:/tf/notebooks"

    deploy:

      resources:

        reservations:

          devices:

            - capabilities: [gpu]



  nginx:

    image: nginx

    restart: unless-stopped

    ports:

      - 80:80

      - 443:443

    volumes:

      - ./nginx/nginx.conf:/etc/nginx/nginx.conf

      - ./nginx/dhparam.pem:/etc/ssl/certs/dhparam.pem

      - ./certbot/conf:/etc/letsencrypt

      - ./certbot/www:/var/www/certbot





  certbot:

    image: certbot/certbot

    container_name: certbot

    volumes:

      - ./certbot/conf:/etc/letsencrypt

      - ./certbot/www:/var/www/certbot

    command: certonly --webroot -w /var/www/certbot --force-renewal --email YOUR_EMAIL -d tensorflow.example.com --agree-tos

The above configuration defines three services. The jupyter service runs the container that contains the GPU-accelerated TensorFlow workspace, and it uses the volumes attribute to store all the notebooks in the /root/notebooks directory. The nginx service runs a container using the official Nginx image that acts as a reverse proxy server between clients and the jupyter service. The certbot service runs a container using the official Certbot image that issues a Let's Encrypt SSL certificate for the specified domain name. Replace YOUR_EMAIL with your email address.

Save the file and close the file editor using CTRL+X then ENTER.

Create a new directory named nginx.

# mkdir nginx

Create a new file named nginx.conf.

# nano nginx/nginx.conf

Add the following contents to the file.

events {}



http {

    server_tokens off;

    charset utf-8;



    server {

        listen 80 default_server;

        server_name _;



        location ~ /.well-known/acme-challenge/ {

            root /var/www/certbot;

        }

    }

}

The above configuration instructs the Nginx server to serve the ACME challenge generated by Certbot. You must perform this step for the Certbot container to verify the ownership of the domain name and issue an SSL certificate for it. You swap this configuration in the later steps to set up the reverse proxy server.

Save the file and close the file editor using CTRL+X then ENTER.

Create a new file named dhparam.pem using the openssl command.

# openssl dhparam -dsaparam -out nginx/dhparam.pem 4096

The above command generates a DHparam or Diffie-Hellman parameter, a key exchange algorithm to secure communications between two parties. You use this as another layer of security to protect the server from getting hacked or attacked by malicious individuals who might try to intercept or decrypt the communications between the server and the client.

Before starting the services, you must point your domain name to the server's IP address using an A record.

Deploy the Docker Compose services.

# docker-compose up -d

The above command starts the services defined in the docker-compose.yaml file in detached mode. This means that the services will start in the background, and you can use your terminal for other commands.

Verify the SSL issuance.

# ls certbot/conf/live/tensorflow.example.com/

The above command outputs the list of contents inside the directory created by Certbot for your domain name. The output should contain the fullchain.pem and the privkey.pem files. It may take up to five minutes to generate the SSL certificate. If this command takes longer than that, you can troubleshoot by viewing the logs using the docker-compose logs certbot command.

Stop the nginx container.

# docker-compose stop nginx

The above command stops the nginx container so you can swap the Nginx configuration in the next steps.

Swap the Nginx configuration.

# rm -f nginx/nginx.conf

# nano nginx.conf

Add the following contents to the file.

events {}



http {

    server_tokens off;

    charset utf-8;



    map $http_upgrade $connection_upgrade {

        default upgrade;

        '' close;

    }



    server {

        listen 80 default_server;

        server_name _;



        return 301 https://$host$request_uri;

    }



    server {

        listen 443 ssl http2;



        server_name tensorflow.example.com;



        ssl_certificate     /etc/letsencrypt/live/tensorflow.example.com/fullchain.pem;

        ssl_certificate_key /etc/letsencrypt/live/tensorflow.example.com/privkey.pem;



        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

        ssl_prefer_server_ciphers on;

        ssl_dhparam /etc/ssl/certs/dhparam.pem;

        ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA';

        ssl_session_timeout 1d;

        ssl_session_cache shared:SSL:50m;

        ssl_stapling on;

        ssl_stapling_verify on;

        add_header Strict-Transport-Security max-age=15768000;



        location / {

            proxy_pass http://jupyter:8888;

            proxy_set_header X-Real-IP $remote_addr;

            proxy_set_header Host $host;

            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;



            proxy_http_version 1.1;

            proxy_set_header Upgrade $http_upgrade;

            proxy_set_header Connection $connection_upgrade;

            proxy_set_header X-Scheme $scheme;



            proxy_buffering off;

        }



        location ~ /.well-known/acme-challenge/ {

            root /var/www/certbot;

        }

    }

}

The above configuration uses the SSL certificate generated by Certbot and additional SSL parameters to increase the security of the workspace. It configures a reverse proxy server that channels the incoming traffic to the jupyter container on port 8888. It also defines a location block to serve ACME challenge files for SSL renewals using Cron.

Save the file and close the file editor using CTRL+X then ENTER.

Start the nginx container.

# docker-compose start nginx

The above command starts the nginx container that uses the new configuration. You can confirm the deployment of the workspace by opening https://tensorflow.example.com in your web browser.

Fetch the token from the Docker logs.

# docker-compose logs jupyter

The above command outputs the logs generated by the jupyter container. It contains the token to access the Jupyter notebook interface.

You can also set up a password for accessing the Jupyter notebook interface by following the steps given below.

  1. Open the Jupyter notebook interface on your web browser using https://tensorflow.example.com.

  2. Scroll down to the "Setup a Password" section.

  3. Enter the token fetched from the Docker Compose logs.

  4. Enter the password you want to use for protecting the interface. Ensure you use a strong password to protect your environment from brute-force attacks.

  5. Click the "Log in and set new password" button.

Append an entry to the Cron table.

# crontab -e

The above command opens the Cron table editor. cron is a built-in job scheduler in the Linux operating system to run the specified commands at a scheduled time. Refer to How to Use the Cron Task Scheduler to learn more.

Add the following lines to the table.

0 5 1 */2 *  /usr/local/bin/docker-compose start -f /root/tensorflow-workspace/docker-compose.yml certbot

5 5 1 */2 *  /usr/local/bin/docker-compose restart -f /root/tensorflow-workspace/docker-compose.yml nginx

The above statements define two tasks that start the certbot container to regenerate the SSL certificate and restart the nginx container to reload the configuration using the latest SSL certificate.

Exit the editor using ESC then !WQ and ENTER.

Enable the firewall.

# ufw allow 80,443,22 proto tcp

# ufw enable

The above commands enable the firewall and allow the incoming connection on port 80 for HTTP traffic, 443 for HTTPS traffic, and 22 for SSH connections.

Conclusion

This article demonstrated the steps to deploy a temporary or a persistent TensorFlow workspace using the official Docker image and the NVIDIA Docker Toolkit. You can deploy a temporary workspace to leverage the high-end hardware offered by Vultr to perform resource-hungry tasks like training a model or performing visualizations. You can also deploy a persistent workspace for an efficient remote development environment, as the Jupyter notebook interface allows you to work with others on a machine-learning project.

More Information

Want to contribute?

You could earn up to $600 by adding new articles.