Did you have to use previously written code again? When this need arises for the first time, we simply copy the required function. However, if you get carried away by pressing Ctrl-C — Ctrl-V, then over time a mess will start in files and the code won’t work properly. Almost every developer faces the code accumulating problem. A similar problem arose with PsyQuation’s Data Science team. So we had the idea to create our own library — DS-Lib.
When we started working, there was no need for it. We received tasks such as write a certain algorithm or create a function for value calculation. But then there was a need to use these functions in correctness testing of the code and writing backtests. And it was during this period that we slowly returned to the evil path. The first thought that arises: “Now I quickly copy this function.”. Indeed, there’re only a few lines of code. Copied, tested, forgot. However, we shouldn’t forget that we need code written in the past to test new features of our algorithms. The question we were confronted with is how to handle such cases? Use “copy paste” again?
Many data scientists work in jupyter notebooks, where it is even less pleasant to copy the same code from one notebook to another. It’s much easier to create a notebook with functions and import it where you need to use it. But there are still problems with this approach.
The first one: it’s often necessary to use only one function but we might import a whole notebook. Or you need to use the functionality that is in the folder, the path is often long and complicated. In this case, we don’t import the notebook because of our laziness, rather we just duplicate the current directory.
Another unpleasant situation is related to the execution of notebooks from another notebook. Consider the situation. We have three notebooks A, B, C. We execute the notebook A in the notebook B, and then execute B in the notebook C. In this case, we received both incorrect display of information on the screen (intermediate results) and import with errors. Also, we shouldn’t forget that in our current notebook there may be a function similar to another function in the notebook we are importing (or with the same name). We need to keep track of namespaces to avoid any conflicts.
Also, we often need to use a code written by a colleague and we have questions about the exact format of the input parameters, the details of work, and so on. We comment our code, however, it often remains at the commentary level and rarely ends up in the documentation with a specific description of the parameters.
It seems that it isn’t difficult at all. You just have to keep the functionality in one place, document it for ease of use and enjoy it use.
So, data scientist. Is he a programmer or not?
As for me, the answer can be found in the quote:
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician”. — Josh Wills.
That is, the data scientist’s code should be qualitative not only in terms of whether it works correctly but also in terms of the possibility of its use in the future. After all, no one except us will do it.
So, we have identified the following main tasks for the first stage of DS-Lib developing:
- agree the general architecture of the library, which will be convenient to use in the future, to think about what modules need to be made and what functions to put them in;
- write a description of parameters and documentation for all functions;
- refactor the code, parametrize functions that perform the same actions with a slight change in the input parameters.
Finally, we made the first version of our library. We propose a closer look at its structure. At the moment, the entire functionality is assembled in 3 modules and 2 packages.
rates. Designed to diagnose FxRates on company servers. The need for it arose at a time when it was accidentally discovered that there was an error in our data (equity) due to the fact that the rates were incorrect. We hoped this was an isolated case, but having realized how to detect such errors, we wrote a script that checks for this error on all servers. Later, we found that this wasn’t the only problem in the data, and new scripts were created to search for each type of error. Therefore, we combined this functionality into a separate module with the ability to select the server and the problem that you want to look for. This diagnostic now requires no more than 1 minute to run.
trade. Here are the help functions for working with trades and checking their correctness.
account. A large module that contains functionality for parsing standard statements, merging accounts, creating virtual accounts, starting statistics calculation and deleting this information.
support. Contains documented scripts to perform tasks from the support team.
account_selection. Contains 4 modules:
risk_parity. The module is devoted to the risk parity algorithm (view an article about risk parity) and to support the work of this algorithm in practice
support_for_backtest. Contains functions for backtest managing.
calculate accounts. Contains functions for allocations managing and loading this information into the database.
account_select_pipe. Contains filters for account selection for allocation.
Of course, this is not the final DS-Lib version. We will define Account class with functional similar to the
PerformanceAnalytics package (
R language). And the main goal for us now is to create a class account and optimize the calculation of all statistics.
To be continued…