3 Tools
Before we continue with creating your results—i.e., actually start developing software—let’s take a quick tour of some tools that can make your software engineering journey smoother.
Although the code examples in this book use Python, the general principles discussed here apply to most programming languages.
Programming Languages
Different programming languages suit different needs. Here’s a quick overview of some popular ones used in science and engineering:
- R: Commonly used for statistics, with rich functionality to create data visualizations, fit statistical models (like different types of regression), and conduct advanced statistical tests (like ANOVA). The poplar Shiny framework also makes it possible to create interactive dashboards that run as web applications.
- MATLAB: Once dominant in engineering, used for simulations. But due to its high licensing costs, MATLAB is being replaced more and more by Python and Julia.
- Julia: Gaining traction in scientific computing for its speed and modern syntax.
- Python: A versatile language with strong support for data science, AI, web development, and more. Its active open source community has created many popular libraries for scientific computing (numpy, scipy), machine learning (scikit-learn, TensorFlow, PyTorch), and web development (FastAPI, streamlit).
Due to its broad applicability and popularity in industry, Python is used for the examples in this book. However, you should choose the programming language that is most popular in your field as this will make it easier for you to find relevant resources (e.g., tailored libraries) and collaborate with colleagues.
There are plenty of great books and other resources available to teach you programming fundamentals, which is why this book focuses on higher level concepts. Going forward we’ll assume that you’re familiar with the basic syntax and functionality of your programming language of choice (incl. key scientific libraries). For example, to learn Python essentials, you can work through this tutorial.
Version Control
Version control is essential in software development to keep track of code changes and collaborate effectively. Think of it as a time machine that lets you revert to any version of your code or examine how it evolved.
Why Use Version Control?
- Track changes: See what you’ve modified and when, with the ability to revert if necessary.
- Review collaborators’ changes: When working with others, reviewing their changes before they are merged with the main version of the code (in so-called pull or merge requests) ensures quality and provides opportunities to teach each other better ways of doing things.
- Not just for code: Version control can be used for any kind of file. While it’s less effective for binary formats like images or Microsoft Word where you can’t create a clean “diff” between two versions, you should definitely give it a try when writing your next paper in a text-based format like LaTeX.
Git
The go-to tool for version control is Git. While desktop clients exist, many professionals use git
directly in the terminal as a command line tool.
If you’re new to Git, this beginner’s guide is a great place to start.
git init
: Start a new repository in the current folder.git status
: View changes.git diff
: View differences between file versions before committing.git add [file]
: Stage files for a commit.git commit -m "message"
: Save staged changes.git push
: Upload changes to a remote repository (e.g., on GitHub).git pull
: Download changes from a remote repository.git branch
: Create or list branches.git checkout [branch]
: Switch branches.git merge [branch]
: Combine branches.
By default, your repository’s files are on the main branch. Creating a new branch is like stepping into an alternate universe where you can experiment without affecting the main timeline. When making a major change or adding a new feature, it’s good practice to create a new branch, like new-feature, and implement your changes there. Once you’re satisfied with the result, you can merge the changes back into the main branch.
This approach keeps the main branch stable and ensures you always have a working version of your code. If you decide against your new feature, you can simply abandon the branch and start fresh from main. By creating a merge request (MR) once your new-feature branch is ready, you or a collaborator can review the changes thoroughly before merging them into main.
To publish your code or collaborate with others, your repository (i.e., the folder under version control) can be hosted on a platform like:
- GitHub: Great for open-source projects and public personal repositories to show off your skills.
- GitLab: Supports self-hosting, making it ideal for organizational needs.
We strongly encourage you to publish any code related to your publications on one of these platforms to promote reproducibility of your results! 👩🔬
In addition to the changes made to your code, you should also keep track of how your data is generated and transformed over time (data lineage). While small datasets can be included in your repository (e.g., in a separate data/
folder), there are also more tailored tools available specifically to version your data, like DVC.
Development Environment
The program you choose for writing code directly impacts your productivity. While you can technically write code using a plain text editor (like Notepad on Windows or TextEdit on macOS), special-purpose text editors and integrated development environments (IDEs) provide a tailored experience that boosts productivity.
Text Editors
Developer-focused text editors are lightweight tools with features like syntax highlighting and extensions for basic programming tasks.
Examples include:
- Sublime Text: Lightweight and fast, with excellent customization through lots of plugins.
- Atom: Open-source and backed by GitHub (though less popular than other tools).
- Vim and Emacs: Some of the first code editors, often used as command line tools and beloved by keyboard shortcuts enthusiasts.
Full IDEs
For more features, IDEs integrate tools like file browsers, Git support, and debuggers. They are ideal for larger projects and provide support for more complex tasks, like renaming variables across multiple files when you’re refactoring your code.
Examples include:
- VS Code: Minimalist by default but highly customizable with plugins, making it suitable for everything from basic editing to full-scale development.
- JetBrains IDEs (e.g., PyCharm): IDEs tailored to the needs of specific programming languages with very advanced features. You need to purchase a license to use the full version, but for many IDEs there is also a free community edition available.
- JupyterLab: An extension of Jupyter notebooks (see below), popular for data science and exploratory coding.
- RStudio: Tailored for R programming, with excellent support for data visualization, markdown reporting, and reproducible research workflows.
- MATLAB: The MATLAB programming language and IDE are virtually synonymous. However, its rich feature set comes with steep licensing fees.
Jupyter Notebooks
Jupyter notebooks are a unique format that lets you mix code, output (like plots), and explanatory text in one document. The name Jupyter is derived from Julia, Python, and R, the programming languages for which the notebook format, and later the JupyterLab IDE, were created. The IDE itself runs inside your web browser.
Notebooks are great for exploratory data analysis and to create reproducible reports. However, since the notebooks themselves are composed of individual interactive cells that can be executed in any order, developing in notebooks often becomes messy quickly. We recommend that you keep the main logic and reusable functions in separate scrips or libraries and primarily use notebooks to create plots and other results. It is also good practice to run your notebook again from top to bottom once you’re finished to make sure everything still works and you’re not relying on variables that were defined in now-deleted cells, for example.
Jupyter notebooks, stored as files ending in .ipynb
, are internally represented as JSON documents. If you have your notebooks under version control (which you should 😉), you’ll notice that the diffs between versions look quite bloated. But do not despair! Tools like Jupytext can convert notebooks into plain text without loss of functionality.
If you want to execute the same notebook with multiple different parameter settings (e.g., create the same plots for different model configurations), have a look at papermill.
In addition to the original JupyterLab IDE and notebooks that you install on your computer, there are also free cloud-based options available, such as Google Colab, which even gives you free compute time on GPUs.
Reproducible Setups
“It works on my machine” isn’t good enough for science. Reproducibility means your results can be replicated by others (and by you a few months later when the reviewers of your paper request changes to your experiments). The first step to achieve this is to manage your dependencies (i.e., external libraries used by your code) to ensure the environment in which your code is executed is identical for everyone that runs your code, every time. This can be done using virtual environments, or, if you want to go even further, containers like Docker, which will be discussed in Chapter 6.
poetry
Virtual environments isolate your project’s dependencies, thereby ensuring consistency. For Python, a common tool to do this is poetry
. It tracks the libraries and their versions in a pyproject.toml
like this:
[tool.poetry]
name = "example-project"
version = "0.1.0"
description = "A sample Python project"
authors = ["Your Name <youremail@example.com>"]
[tool.poetry.dependencies]
python = "^3.9"
requests = "^2.26.0" # external libraries incl. versions
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Basic commands:
poetry new example-project
: Create a new project (folder incl.pyproject.toml
file).poetry add [package]
: Add a dependency (can also be done directly in the file).poetry install
: Install all dependencies.poetry shell
: Activate the virtual environment.
Handling Randomness
Your program will often depend on randomly sampled values, for example, when defining the initial conditions for a simulation or initializing a model before it is fitted to data (like a neural network). To ensure that your experiments can be reproduced, it is important that you always set a random seed at the beginning of your program so the random number generator starts from a consistent state.
At the beginning of your script, set a random seed (depending on the library that you’re using this can vary):
import random
import numpy as np
42)
random.seed(42) np.random.seed(
To get a better idea of how much your results depend on the random initialization and therefore how robust they are, it is advisable to always run your code with multiple random seeds and compare the results (e.g., compute the mean and standard deviation of the outcomes of different runs like in Figure 2.2).
Depending on the programming language that you’re using, if you run a script without executing any other code before, the random number generator may or may not always start in the same state. This means, if you don’t set a random seed and, for example, run your script ten times from scratch, you may always receive the same result even though the results would differ if the code was run under different circumstances. To avoid surprises, you should always explicitly set the random seed to have more control over the results.
If your code is run on very different hardware, e.g., a CPU vs. a GPU (graphics card, used to train neural network models, for example), despite setting a random seed, your results might still differ slightly. This is due to how the different architectures internally represent float values, i.e., with what precision the numbers are stored in memory.
Clean and Consistent Code
Especially when working together with others, it can be helpful to follow to a style guide to produce clean and consistent code. Google published their style guides for multiple programming languages, which is a great resource and adhering to these rules will also help you to avoid common sources of bugs.
Formatters & Linters
Since programmers are often rather lazy, they developed tools that automatically fix your code to implement these rules where possible:
- Formatters rewrite code to follow a consistent style (e.g., add whitespace after commas).
- Linters analyze code for errors, inefficiencies, and deviations from best practices.
ruff
ruff
is a (super fast) formatter and linter for Python, written in Rust. You can install it via pip
and configure it in the same pyproject.toml
file that we also used for poetry
. Then run it over you code like this:
ruff check # see which errors the linter finds
ruff check --fix # automatically fix errors where possible
ruff format # automatically format the code
You’ll probably want to add exceptions for some of the errors that the linter checks for in your pyproject.toml
file as ruff
is quite strict. 😉
It is important to have the configuration for your formatter and linter under version control as well, so that all collaborators use the same settings and you avoid unnecessary changes (and bloated diffs in merge requests) when different people format the code.
Pre-commit Hooks
In the heat of the moment, you might forget to run the formatter and linter over your code before committing your changes. To avoid accidentally checking messy code into your repository, you can configure so-called “pre-commit hooks”. Pre-commit hooks catch issues automatically by enforcing coding standards before committing or pushing code with git.
First, you need to install pre-commit hooks through Python’s package manger pip
:
pip install pre-commit
Then configure it in a file named .pre-commit-config.yaml
(here done for ruff
):
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.8.3
hooks:
# Run the linter.
- id: ruff
args: [ --fix ]
# Run the formatter.
- id: ruff-format
Then install the git hook scripts from the config file:
pre-commit install
Now the configured hooks will be run on all changed files when you try to commit them and you can only proceed if all checks pass.
To catch any style inconsistencies after the code was pushed to your remote repository (e.g., in case one of your collaborators has not installed the pre-commit hooks), you can also add these checks to your CI/CD pipeline (see Chapter 6).
Putting It All Together
When you set up all these tools, your repository should now look something like this (see here for more details; setup for programming languages other than Python will differ slightly):
project-name/
├── .gitignore # Exclude unnecessary files from version control
├── README.md # Describe the project purpose and usage
├── pre-commit-config.yaml # Pre-commit hook setup
├── pyproject.toml # Python dependencies and configs
├── data/ # Store (small) datasets
├── notebooks/ # For exploratory analysis
├── src/ # Core source code
└── tests/ # Unit tests
A clean project structure makes it easier to maintain your code.
At this point, you should have a clear understanding of:
- How to set up your development environment to code efficiently.
- How to host your version-controlled repository on a platform like GitHub or GitLab, complete with pre-commit hooks to ensure well-formatted code.
- The fundamental syntax of your programming language of choice (incl. key scientific libraries) to get started.