Programming
Locality of Behavior
An interesting topic of discussion I came across online is whether the principles outlined in Uncle Bob's Clean Code book may have perversed the concept of good coding practice. One of Uncle Bob's rules that drew a lot of criticism is his strong emphasis on single responsibility (SRP), even going to lengths of suggesting all functions should be 4-6 lines long. Engineers and professors have rightfully pointed out this absurd emphasis on short isolated functions actually impedes maintainable code, because it is more difficult to keep say 40 functions in your head than a few very long functions.
Ultimately, a good principle to follow is instead the Locality of Behavior principle, which states that the behavior of a unit of code should be as obvious as possible to the programmer looking only at that unit of code.
"The primary feature for easy maintenance is locality: locality is that characteristic of source code that enables a programmer to understand that source by looking at only a small portion of it." - Richard Gabriel
Inversion of Control
The design principle Inversion of Control (IoC) is key to simplifying abstractions. The issue working with abstractions is that they are often leaky. As a developer you fall into cycles of refining these abstractions, constantly adding new options and cases the abstraction supports that it becomes unwieldy, bug riddled, and difficult to maintain.
Inversion of Control alleviates this problem by inverting the control flow of the program.
A simple illustration of how inversion of control helps deal with these headaches, take the example of a basic function to filter out missing values from an array:
# start with a simple filtering functionality
def filter_arr(arr):
return [elem for elem in arr if elem is not None]
# now let's refine this abstraction to support many missing types
def filter_arr(arr, filterNone = True, filterEmptyString = False, filterZero = False):
lst = []
for elem in arr:
if filterNone and elem is None:
continue
if filterEmptyString and elem == "":
continue
if filterZero and elem == 0:
continue
lst.append(elem)
return lst
To invert control, we could do the following:
def filter_arr(arr, filter_fn):
"""filter_fn returns True if element is kept"""
return [elem for elem in arr if filter_fn(elem)]
Now the responsibility of defining how to filter the elements of the list has been moved out of the function to the user. These types of control inverted APIs might take more time to use, but the benefit is that they allow you to build more abstractions on top of them with ease (extensibility). A lot of the time when adjusting the API for someone using it slightly differently, instead of going for the if or ternary operator, consider using inversion of control. It also avoids the issue of making bad assumptions about future possible use of a reusable piece of code.
Inversion of Control is the fundamental characteristic behind frameworks. Instead of you calling the framework, the framework calls your tailored code (hollywood principle "don't call us, we'll call you"). The framework plays the role of coordinator and hence is powerful as an extensible skeleton. The user supplies the methods that tailor the framework's generic algorithm.
Do not confuse Inversion of Control Principle with the SOLID Dependency Inversion Principle which is about making both high and low level modules depend on abstractions (interfaces or abstract classes) rather than high level depending on low level specific implementations (to encourage loose coupling).
Lessons
One of the lessons on maintainable code that I learnt (and regrettably so) is the importance of separating data loading I/O and data manipulation. In my Argonne flood mapping project, I made the mistake of encapsulating in one singular function (many times!) both the data loading logic (say reading data from some file) and certain operational logic on that data (say computing mean std). At first this seems attractive because it simplifies the codebase so you have all the logic in one place (cough cough locality). The problem later on however, is that this hinders TDD because you cannot easily test the data loading and computation separately. Consequently, it forces you to create a lot of messy ways to hand dummy data over to the function in order to evaluate it at test time. The rule here is that for proper testing, the input to the function should be createable looking at it from the outside, so you do not need to create mocks. Keeping data retrieval and data manipulation as separate functions makes things much easier to test.
There are always opposing tensions when it comes to programming principles. Here there is the tension between locality principle and SRP for testing. Certainly in many cases you may not need to test a function, though it is great to make it testable. A good rule of thumb is if you can't get it right in three tries, you need to test it.
Programming Patterns
Context Object Pattern
The pattern involves bundling function arguments into context objects (e.g. dataclasses like AppContext) and passing it as a single function argument. This avoids unwieldy function calls with dozens of parameters.
The context object pattern must be used with care. In certain scenarios, this can also be an antipattern. For instance if applied too widely throughout the codebase in low levels of orchestration (which does not need all the parameters), the result is unnecessary coupling and high dependency on the context object.
In my own experience, the context object pattern has been really useful in an ML pipeline setting, where many different parameters necessary for configuring an experiment need to be passed through multiple layers of execution.
I learned this from experience. In the beginning of my project, I tried to keep my code explicit with long function signatures with 6-10 arguments for transparency and readability. However I soon discovered that this choice had the tradeoff of low maintainability (frustrating to manage when new parameters were constantly added and old parameters removed). The better solution was to use OmegaConf objects instantiated using yaml files in Python to encapsulate many experiment arguments. I also chose to pass the cfg object for high level functions like run_experiment and to avoid doing so for lower level functions like calculate_weight_decay. Passing in many parameters that would not be used in a small function increases unnecessary coupling and code brittleness.
For more detail, see the following video.