3 Tools

Before we continue with creating your results—i.e., actually start developing software—let’s take a quick tour of some tools that can make your software engineering journey smoother.

Although the code examples in this book use Python, the general principles discussed here apply to most programming languages.

Programming Languages

Different programming languages suit different needs. Here’s a quick overview of some popular ones used in science and engineering:

R: Commonly used for statistics, with rich functionality to create data visualizations, fit statistical models (like different types of regression), and conduct advanced statistical tests (like ANOVA). The poplar Shiny framework also makes it possible to create interactive dashboards that run as web applications.
MATLAB: Once dominant in engineering, used for simulations. But due to its high licensing costs, MATLAB is being replaced more and more by Python and Julia.
Julia: Gaining traction in scientific computing for its speed and modern syntax.
Python: A versatile language with strong support for data science, AI, web development, and more. Its active open source community has created many popular libraries for scientific computing (numpy, scipy), machine learning (scikit-learn, TensorFlow, PyTorch), and web development (FastAPI, streamlit).

Due to its broad applicability and popularity in industry, Python is used for the examples in this book. However, you should choose the programming language that is most popular in your field as this will make it easier for you to find relevant resources (e.g., tailored libraries) and collaborate with colleagues.

There are plenty of great books and other resources available to teach you programming fundamentals, which is why this book focuses on higher level concepts. Going forward we’ll assume that you’re familiar with the basic syntax and functionality of your programming language of choice (incl. key scientific libraries). For example, to learn Python essentials, you can work through this tutorial.

Version Control

Version control is essential in software development to keep track of code changes and collaborate effectively. Think of it as a time machine that lets you revert to any version of your code or examine how it evolved.

Why Use Version Control?

Track changes: See what you’ve modified and when, with the ability to revert if necessary.
Review collaborators’ changes: When working with others, reviewing their changes before they are merged with the main version of the code (in so-called pull or merge requests) ensures quality and provides opportunities to teach each other better ways of doing things.
Not just for code: Version control can be used for any kind of file. While it’s less effective for binary formats like images or Microsoft Word documents where you can’t create a clean “diff” between two versions, you should definitely give it a try when writing your next paper in a text-based format like LaTeX.

Git

The go-to tool for version control is Git. While desktop clients exist, you can also use git directly in the terminal as a command line tool.

If you’re new to Git, this beginner’s guide is a great place to start.

Essential git commands

git init: Start a new repository in the current folder.
git status: View changes.
git diff: View differences between file versions before committing.
git add [file]: Stage files for a commit.
git commit -m "message": Save staged changes.
git push: Upload changes to a remote repository (e.g., on GitHub).
git pull: Download changes from a remote repository.
git branch: Create or list branches.
git checkout [branch]: Switch branches.
git merge [branch]: Combine branches.

By default, your repository’s files are on the main branch. Creating a new branch is like stepping into an alternate universe where you can experiment without affecting the main timeline. When making a major change or adding a new feature, it’s good practice to create a new branch, like new-feature, and implement your changes there. Once you’re satisfied with the result, you can merge the changes back into the main branch.

This approach keeps the main branch stable and ensures you always have a working version of your code. If you decide against your new feature, you can simply abandon the branch and start fresh from main. By creating a merge request (MR) once your new-feature branch is ready, you or a collaborator can review the changes thoroughly before merging them into main.

To publish your code or collaborate with others, your repository (i.e., the folder under version control) can be hosted on a platform like:

GitHub: Great for open-source projects and public personal repositories to show off your skills.
GitLab: Supports self-hosting, making it ideal for organizational needs.

We strongly encourage you to publish any code related to your publications on one of these platforms to promote reproducibility of your results! 👩‍🔬

Data versioning

In addition to the changes made to your code, you should also keep track of how your data is generated and transformed over time (data lineage). While small datasets can be included in your repository (e.g., in a separate data/ folder), there are also more tailored tools available specifically to version your data, like DVC.

Development Environment

The program you choose for writing code directly impacts your productivity. While you can technically write code using a plain text editor (like Notepad on Windows or TextEdit on macOS), special-purpose text editors and integrated development environments (IDEs) provide a tailored experience that boosts productivity.

Text Editors

Developer-focused text editors are lightweight tools with features like syntax highlighting and extensions for basic programming tasks.
Examples include:

Sublime Text: Lightweight and fast, with excellent customization through lots of plugins.
Atom: Open-source and backed by GitHub (though less popular than other tools).
Vim and Emacs: Some of the first code editors, often used as command line tools and beloved by keyboard shortcuts enthusiasts.

Terminal

When you write code in a text editor, you need a way to execute it. This is where the terminal comes in. A terminal, or console, lets you interact with your computer through the command line, using text-based commands. Think of it like stepping back to the 1970s—or like being one of those cool hackers you see on TV.

On macOS and Linux, a terminal app is already preinstalled. On Windows, different options exist to install a Unix-like terminal, like the Windows Terminal. Inside the terminal, there’s a shell: the actual program that processes the commands you type. The most common shells on Unix systems are bash and zsh, which are quite similar. For this book, we’ll assume you’re using one of these.

With the shell, you can navigate your computer’s file system and run programs through their command-line interface (CLI). Try it out!

Basic shell commands

Follow along by typing these commands into your terminal. In parallel, you can watch your normal file browser to see files and folders appear or disappear as you go.

pwd: Print the current working directory—this shows the path to where you opened the terminal.
ls: List files and directories in the current location. Use ls -la for more details, including hidden files (like .gitignore).
cd path/to/folder: Change directory to the specified path. Tips: Use tab to autocomplete names. If the path starts with /, it’s absolute (from the file system’s root). If it starts with ~/, it’s relative to your home directory. Use .. to move up one folder.
mkdir new_folder: Create a new directory named new_folder.
touch new_file.txt: Create an empty file named new_file.txt.
cp new_file.txt copied_file.txt: Copy new_file.txt to copied_file.txt. Use mv instead of cp to move or rename files.
rm new_file.txt: Delete new_file.txt. Add -r to delete directories. But be careful: files deleted this way bypass the trash and are gone for good, so double-check before hitting enter!

You can also run other CLI programs in the terminal, like using the git commands described earlier.
A Python script can be executed with python script.py (assuming the script is in your current directory).

Not all CLI programs mentioned in this book will be preinstalled on your machine. Linux systems already come with a command-line package manager (like apt on Ubuntu), which can be used to install other tools. A popular package manager for macOS is brew, while for Windows you can use winget.

Once you get comfortable with your shell, you can also create shell scripts (files with a .sh extension) to automate tasks and handle more complex workflows. These scripts can include conditionals, loops, and other programming constructs. For more information on bash scripting, check out this resource.

Full IDEs

Integrated Development Environments (IDEs) combine all the tools you need in one place—file browser, editor, terminal, Git support, debugger, and more. They are ideal for larger projects and provide support for more complex tasks, like renaming variables across multiple files when you’re refactoring your code.
Examples include:

VS Code: Minimalist by default but highly customizable with plugins, making it suitable for everything from basic editing to full-scale development.
JetBrains IDEs (e.g., PyCharm): IDEs tailored to the needs of specific programming languages with very advanced features. You need to purchase a license to use the full version, but for many IDEs there is also a free community edition available.
JupyterLab: An extension of Jupyter notebooks (see below), popular for data science and exploratory coding.
RStudio: Tailored for R programming, with excellent support for data visualization, markdown reporting, and reproducible research workflows.
MATLAB: The MATLAB programming language and IDE are virtually synonymous. However, its rich feature set comes with steep licensing fees.

Jupyter Notebooks

Jupyter notebooks are a unique format that lets you mix code, output (like plots), and explanatory text in one document. The name Jupyter is derived from Julia, Python, and R, the programming languages for which the notebook format, and later the JupyterLab IDE, were created. The IDE itself runs inside your web browser.

Notebooks are great for exploratory data analysis and to create reproducible reports. However, since the notebooks themselves are composed of individual interactive cells that can be executed in any order, developing in notebooks often becomes messy quickly. We recommend that you keep the main logic and reusable functions in separate scrips or libraries and primarily use notebooks to create plots and other results. It is also good practice once you’re finished to restart the kernel and run your notebook again from top to bottom to make sure everything still works and you’re not relying on variables that were defined in now-deleted cells, for example.

Notebooks as text files

Jupyter notebooks, stored as files ending in .ipynb, are internally represented as JSON documents. If you have your notebooks under version control (which you should 😉), you’ll notice that the diffs between versions look quite bloated. But do not despair! Tools like Jupytext can convert notebooks into plain text without loss of functionality.

Parameterize notebooks

If you want to execute the same notebook with multiple different parameter settings (e.g., create the same plots for different model configurations), have a look at papermill.

In addition to the original JupyterLab IDE and notebooks that you install on your computer, there are also free cloud-based options available, such as Google Colab, which even gives you free compute time on GPUs.

Reproducible Setups

“It works on my machine” isn’t good enough for science. Reproducibility means your results can be replicated by others (and by you a few months later when the reviewers of your paper request changes to your experiments). The first step to achieve this is to manage your dependencies (i.e., external libraries used by your code) to ensure the environment in which your code is executed is identical for everyone that runs your code, every time. This can be done using virtual environments, or, if you want to go even further, containers like Docker, which will be discussed in Chapter 6.

Virtual environments in Python with uv

Virtual environments isolate your project’s dependencies, thereby ensuring consistency. For Python, a common tool to do this is uv. It tracks the libraries and their versions in a pyproject.toml file like this:

[project]
name = "example-project"
version = "0.1.0"
description = "A sample Python project"
authors = [{name="Your Name", email="youremail@example.com"}"]
requires-python = ">=3.10"
dependencies = [
    "matplotlib >=3.7.2",
    "numpy >=1.22.3,<2",
]

Basic commands:

uv init example-project: Create a new project (folder incl. pyproject.toml file).
uv add {package}: Add a dependency (can also be done directly in the file).
uv sync: Install all dependencies.
uv run python script.py: Run a Python script inside the virtual environment.

Handling Randomness

Your program will often depend on randomly sampled values, for example, when defining the initial conditions for a simulation or initializing a model before it is fitted to data (like a neural network). To ensure that your experiments can be reproduced, it is important that you always set a random seed at the beginning of your program so the random number generator starts from a consistent state.

Setting random seeds in Python

At the beginning of your script, set a random seed (depending on the library that you’re using this can vary):

import random
import numpy as np

random.seed(42)
np.random.seed(42)

To get a better idea of how much your results depend on the random initialization and therefore how robust they are, it is advisable to always run your code with multiple random seeds and compare the results (e.g., compute the mean and standard deviation of the outcomes of different runs like in Figure 2.2).

Random state at startup

Depending on the programming language that you’re using, if you run a script without executing any other code before, the random number generator may or may not always start in the same state. This means, if you don’t set a random seed and, for example, run your script ten times from scratch, you may always receive the same result even though the results would differ if the code was run under different circumstances. To avoid surprises, you should always explicitly set the random seed to have more control over the results.

Hardware differences

If your code is run on very different hardware, e.g., a CPU vs. a GPU (graphics card, used to train neural network models, for example), despite setting a random seed, your results might still differ slightly. This is due to how the different architectures internally represent float values, i.e., with what precision the numbers are stored in memory.

Clean and Consistent Code

Especially when working together with others, it can be helpful to follow to a style guide to produce clean and consistent code. Google published their style guides for multiple programming languages, which is a great resource and adhering to these rules will also help you to avoid common sources of bugs.

Formatters & Linters

Since programmers are often rather lazy, they developed tools that automatically fix your code to implement these rules where possible:

Formatters rewrite code to follow a consistent style (e.g., add whitespace after commas).
Linters analyze code for errors, inefficiencies, and deviations from best practices.

Formatter & Linter in Python: ruff

ruff is a (super fast) formatter and linter for Python, written in Rust. You can install it via pip and configure it in the same pyproject.toml file that we also used to manage the dependencies of our project. Then run it over you code like this:

ruff check        # see which errors the linter finds
ruff check --fix  # automatically fix errors where possible
ruff format       # automatically format the code

You’ll probably want to add exceptions for some of the errors that the linter checks for in your pyproject.toml file as ruff is quite strict. 😉

It is important to have the configuration for your formatter and linter under version control as well, so that all collaborators use the same settings and you avoid unnecessary changes (and bloated diffs in merge requests) when different people format the code.

Pre-commit Hooks

In the heat of the moment, you might forget to run the formatter and linter over your code before committing your changes. To avoid accidentally checking messy code into your repository, you can configure so-called “pre-commit hooks”. Pre-commit hooks catch issues automatically by enforcing coding standards before committing or pushing code with git.

Setting up pre-commit hooks

First, you need to install pre-commit hooks, e.g., through Python’s package manger pip:

pip install pre-commit

Then configure it in a file named .pre-commit-config.yaml (here done for ruff):

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v2.3.0
  hooks:
    - id: check-yaml
    - id: end-of-file-fixer
    - id: trailing-whitespace
- repo: https://github.com/astral-sh/ruff-pre-commit
  # Ruff version.
  rev: v0.8.3
  hooks:
    # Run the linter.
    - id: ruff
      args: [ --fix ]
    # Run the formatter.
    - id: ruff-format

Then install the git hook scripts from the config file:

pre-commit install

Now the configured hooks will be run on all changed files when you try to commit them and you can only proceed if all checks pass.

To catch any style inconsistencies after the code was pushed to your remote repository (e.g., in case one of your collaborators has not installed the pre-commit hooks), you can also add these checks to your CI/CD pipeline (see Chapter 6).

Putting It All Together

When you set up all these tools, your repository should now look something like this (see here for more details; setup for programming languages other than Python will differ slightly):

project-name/
├── .gitignore              # Exclude unnecessary files from version control
├── README.md               # Describe the project purpose and usage
├── pre-commit-config.yaml  # Pre-commit hook setup
├── pyproject.toml          # Python dependencies and configs
├── data/                   # Store (small) datasets
├── notebooks/              # For exploratory analysis
├── src/                    # Core source code
└── tests/                  # Unit tests

A clean project structure makes it easier to maintain your code.

Before you continue

At this point, you should have a clear understanding of:

How to set up your development environment to code efficiently.
How to host your version-controlled repository on a platform like GitHub or GitLab, complete with pre-commit hooks to ensure well-formatted code.
The fundamental syntax of your programming language of choice (incl. key scientific libraries) to get started.