Analyzing Software Development using data science and analytic tools

July 10, 2018

Yesterday I had the pleasure to host Markus Harrer who gave a talk about analyzing software code using data science, which he describe in details in his blog.

There is often a communication gap between software developers and management. While good developers can see the big picture of the code, and timely identify a need to restructure or even rewrite the code, they often miss to see the risks of both, time and cost, of such an operation.

The management on the other hand, while being able to identify risks of getting into an adventure of rewriting a legacy system, often fail to understand the outcome and the risks of lack of maintenance.

The solution to overcome this 'gap of ignorance' is communicating using data science and analytics. Using Jupyter notebooks, for example, to combine both textual explanations as well as analytics and diagrams can easily communicate the risks of a software to the management.

For using data science on software code, one should decide what are the purpose of the analysis, and which features should be chosen to analyze this target.

The main targets that Markus was discussing were:

Un-maintained code (old code)
Code that was not maintained more than 6 months, for example, is considered old. And as such, the company may be lacking the maintenance know-how of that modules.
Structure
By analyzing the connection between different objects or function calls, one can find 'dead' parts that are not being called and can be removed from the system.
Performance
By analyzing logs and execution timestamps, the bottlenecks of the software can be identified.

To illustrate these ideas, he had demonstrated it using Jupyter notebook and python.

Old code

By retrieving all the commits of a project from GitHub, in his example it was Linux, as a CSV which contains the user, timestamp, filename and line number, and by using Pandas to analyze it, he demonstrated with simple group_by statements, how to find the top contributers, the most and least maintained files (and with simple string manipulations also modules/folders) and using the delta between the current date and the timestamp column - also the files and modules which were not updated recently.

Code Structure

For analyzing the structure, he used an automated tool to convert Java OO code to a graph DB scheme (Neo4J). Every class was represented as a node, and the relation between classes as an edge.

The same can be also done for Function-oriented programming, by converting a function invocation as en edge between two nodes (which represent the functions).

When the graph is ready, it is easy to analyze it using graph-theory algorithms, such as finding isolated islands (dead code), barebones (the nodes with the most edges), etc.

Performance

To analyze the performance, he analyzed execution logs by converting them into a Graph as well.

A similar approach for web development can be done by extracting the Performance execution from Chrome Developer's Tools, and converting it from a Tree to a Graph.

Since each of the function invocation includes a timestamp, it's easy to detect not only the heaviest functions, but also the heaviest function-chains, and tackle it later by code.

Interesting?

For more information, please visit his blog and his github repository for examples, the original talk slides are here, and feel free to feedback this post.

Keep up with our meetups, and we're also on facebook and twitter.

Liad's little blog