Virtual Environments

data democratization
venv
renv
reproducibility
How to configure R and Python package dependencies to this repo for reproducibility
Author

Frank Aragona

Published

February 1, 2023

Modified

December 10, 2024

Introduction

Hex logo that says "Data Science in a Box"

Virtual environments allow us to execute code while accounting for software/package version differences we have on our local machines. This repo uses virtual environments to configure a user’s R and Python software and packages to the repo’s specific package versions.

For example, say you have dplyr version 2.0 but this repo uses dplyr version 1.1, you may not be able to run the scripts as intended by the author since the functions in dplyr 1.1 may be different than in 2.0. The virtual environment will allow a given user to use only this repo’s version of dplyr so they can run the code as intended.

Important

There are two different virtual environments for this repo, one for R and another for Python. Your workflows for opening R and Python and how you install packages needs to utilize the virtual environments so that all machines are able to use your code.

Python - Conda Environment Setup

This repo has a file named environment.yml. It contains a list of all packages and versions used for Python software. This file can be used as a set of instructions for your local machine when configuring your local environment.

Step 1: Set Up Anaconda

You should already have Anaconda installed on your machine if not..

under construction

Step 2: Open Anaconda Prompt

You may have different Anaconda prompts (prompts aligned with different shells, like PowerShell, bash, etc). There should be a generic Anaconda prompt. Open that one:


If the first line in the prompt doesn’t start with (base), write:

conda deactivate

and it will bring you back to your base environment.

Step 3: Change Directories

Change the directory of the prompt to the repo’s directory. The code is

cd C:/Users/XXXXXXX/Projects/Sequencing_2.0

If you are already in your user directory, you can just type

cd projects/sequencing_2.0

capitalization doesn’t matter


Notice that the folder path is now changed to the sequencing repo folder.

Step 4: Copy the repo env

Now we’re ready to create a new environment based on the repo’s environment.

Type: conda env create --name seq_env --file=environment.yml

Note that how you name your environment doesn’t really matter, but name it something that resembles the repo. This will save the headache of having random environments for random repos that you can’t remember..


  • conda env create will create a new environment in your C:/Users/XXXXX/Anaconda3/envs file path
  • --name or -n will name that environment, in this case seq_env
  • --file=environment.yml this code will take the file in the sequencing 2.0 repo and use it to create this environment. It is essentially a copy of the software versions in the file.

Note: I sped up the gif below. The whole process may take a few minutes

Step 5: Activate the environment

You can switch between environments in the conda prompt or in a programming IDE (or both? idk). To activate and switch the env, write:

conda activate <env_name>

in this case

conda activate seq_env


Note

The environment your in will show on the left of the prompt message. In this case it says (seq_env) instead of (base). That way you know what env your working in

Python - Programming Setup

Now that your virtual conda environment is set up, let’s set up your programming IDE to be able to recognize your environment.

IDE Setup - VS Code

VS Code has a lot built in to use a conda environment. Since your env is already activated, if you have VS Code installed, you can type

code

into the anaconda prompt and it will open a VS Code window

Step 1: Select a Python Interpreter

First we need to select a python interpreter, which is in our env. On your keyboard, press

CTRL+SHIFT+P

This should bring up a window with an option that says Python: Select Interpreter. You may need to search for it.

Click it and you should see your new environment seq_env in the list. Click it

Step 2: Write code

Now your VS Code is using your environment and the python version/packages in that environment. Check to see that your terminal is using the correct env.

Open the terminal (terminal > new terminal) and confirm that you are in a cmd prompt in the terminal. On the right side of the terminal it should say cmd. If it says powershell or something else, let’s change it.. See the pic below. There’s a drop down that gives you shell types. Change your default to Command Prompt


Also notice in the picture that my environment now switches to seq_env. Yours should do the same. You should now be able to run code in a python script. Notice that your terminal will change to run python. If you get an error, write python in the terminal and hit enter. It will change your terminal a little. Now you can run python code and it will output to this terminal.

IDE Setup - PyCharm

PyCharm also works great with a conda environment.

Step 1: Select a Python Interpreter

You may also be able to open a PyCharm window from an Anaconda prompt like with VS Code (if it’s installed in your env). To do so, write pycharm in the prompt and it should open a new window with the env activated.

If that doesn’t work, open PyCharm and on the bottom right there is a python version and interpreter selected. Click it and open “Add New Interpreter” > “Add local interpreter”. This opens a new window. Click “Conda Environment” and under “Interpreter” click the dropdown. You should be able to see your new environment there. If not, click away and click the dropdown again. It’s weird sometimes.

Then click okay. Close and reopen the Python Console window and it should have your environment path for the python.exe. Also, the Python Libraries window should have all of the libraries in your environment now.


Notice that now there are a list of interpreters for you to use. You can now switch back and forth between environments. This is great if you have other repos to use or want to test out new packages that aren’t in the main environment.

Python - Installing New Packages

Let’s say you want to add a new python package to the repo. I recommend doing this in an Anaconda prompt and then saving it over the environment.yml. Then you can push the new environment.yml with the new changes to github. Use these steps:

Step 1: Install a new package

Go to the Anaconda prompt, make sure you’re in the repo file path (cd projects/sequencing_2.0) and make sure you’re in the right conda env (conda activate seq_env).

Now, install the package. Usually packages can be installed with pip install or conda install or conda install -c conda forge <package>. This depends on the packages. Some need pip, others need conda. Google it to find out. Here i’m going to download a package from NCBI to demonstrate. The package is called ncbi-datasets-cli.

  • This package uses conda-forge to install. Type in conda install -c conda-forge ncbi-datasets-cli
  • It will give you a message Y/N to confirm. Type “y” and enter

Step 2: Save the package to the repo

Now we need to save this package to the repo’s environment.yml

  • Type conda env export > environment.yml
  • Since the package is in your environment, this code is exporting your new environment to repo’s one.
  • Now push to github

R Package Management -renv

Managing R packages is much easier than managing Python packages. The R environment uses a package called renv. It will save a list of packages to a lock file, similar to environment.yml. Then, every time you or another teammate opens the R project in your repo the renv package will activate in the background and determine if any packages are not aligned with the repo’s lock file. It will ensure that everyone is using the same package versions

Important

Your repo should have a .Rproj file at the root of the directory. If it doesn’t you can create it by opening Rstudio > File > New Project... > Existing Directory (or New Directory) Make sure .Rproj files are NOT in your .gitignore

Creating renv in a project

Step 1: Open the .Rproj in your repo

The R project will open up Rstudio at the root of you directory path.

Step 2: Initialize renv for the repo

Now that we’re in the root of your repo directory, let’s initialize renv.

First install renv - install.packages("renv")

Then in your console write renv::init() and run it.

If you already have an existing repo, you will probably see warnings and errors in the renv::init like I did in the gif above. Not to worry! Read the warnings and follow the instructions. Usually you will need to re-install a package. If you get this warning:

These may be left over from a prior, failed installation attempt.
Consider removing or reinstalling these packages.
  • Then run renv::install("THAT PACKAGE"). It will install the package again,
  • and then you need to update the lock file (more on that later) by running renv::snapshot().

Now the package will be installed correctly

renv::init() will:

  1. Search through all R scripts in your repo and find all packages used
  2. Create a snapshot of those packages
  3. Save all packages in the repo in a new renv libraries path (similar to your C drive R libraries paths)
  4. Create a .gitignore within the renv libraries path so that you don’t get spammed with thousands of libraries in your git commit
  5. Create a lock file - this is like the environment.yml for conda. Think of it as instructions for which packages your repo is using
  6. It also saved things like an activate R script which will activate that renv every time the repo is opened from the .Rproj

Step 3: Push to Github

Now look at your git stage and you will see all the files renv created.

We have

  1. .Rprofile that contains an renv activate.R script - this will activate the repo’s renv every time the project is opened
  2. The renv.lock file shows information on each package used in the repo and is used to update collaborator’s environments to match the lock file.
  3. renv/.gitignore I don’t feel like explaining this one right now - i’ll write more later
  4. renv/activate.R this will activate the env whenever the R project is opened
  5. renv/settings.dcf I have no clue what this is

Using renv in a project

Step 1: Open the .Rproj for your repo

Any time you need to code or run code from the repo, open up the .Rproj file that contains the sequencing 2.0 project. In your file explorer, go to the repo and open Sequencing_2.0 .Rproj


This will open up an R window with the repo file path as a root directory. It will also utilize the renv. Your console should say something about renv, like this

Step 2: Load renv packages

The first time you use renv you will need to configure it to your local machine. To do this, type:

  • renv::restore() in your console.
  • This will create a new environment for your R in your local machine using the lock file packages.


Now you’re ready to use the scripts! Way less complicated than conda

R Installing New Packages

If you need to install a new package and want to put it in the repo, you will need to update the lock file. To do this:

  • renv::install(), pak::pkg_install("<PACKAGE_NAME>") (the safe way) or even install.packages()
  • renv::snapshot() this will overwrite the lock file with the packages you added
  • Then push to github
Warning

There may be version dependency issues when installing a package and running a script. You may need to use renv::history() to see the previous hash of the lock file and use renv::revert() to revert the lock file back to its previous, stable state. More on this here https://solutions.posit.co/envs-pkgs/environments/upgrades/

The video at the bottom of this page explains in detail renv and its capabilities https://solutions.posit.co/envs-pkgs/environments/upgrades/

Using Reticulate with Conda Env

Now that your R and Python environments are set up, if you have code that uses the reticulate package in R, it might still be pointing to your base Python environment. So, if you need to write python code in R (using reticulate), the code may break. Here’s what you need to do to make sure your reticulate python path is pointing towards your conda environment:

  1. Open an anaconda prompt
  2. Activate your env and then write where python and it will provide you with a python.exe for that particular env. Copy that path
  3. Open the .Rprofile file in your repo and add this code: Sys.setenv(RETICULATE_PYTHON = PATH_TO_ENV_PYTHON))
    • Note, if you have other team members, make this code flexible to their usernames. It could look something like this Sys.setenv(RETICULATE_PYTHON = file.path(Sys.getenv("USERPROFILE"),"Anaconda3/envs/seq_env/python.exe")) where Sys.getenv("USERPROFILE") will add C/Users/XXXX/ and it will automatically add the user’s name
  4. Now restart your R
  5. Open back up R and without running any other code or loading any libraries run reticulate::py_config() in your console. This should now show your conda environment path being used for reticulate