Project management and collaborating on data science projects

Nigel Zhuwaki

Cover Image for Project management and collaborating on data science projects

Nigel Zhuwaki

November 21, 2020

sciHello All,

I am currently leading a data and analytics team. As a team, we have been receiving multiple sources of data that require cleaning, analysis, and reporting. As we model the data pipelines based on the different projects we are working on. We have faced some challenges affecting how we work.

All data analysis work up to now has been happening in spreadsheets. Data pipelines were not accurately defined. Each data project did not have clear documentation of how the data was handled, reshaped, and analyzed. Reproducing analysis and accounting for changes between raw and clean data was impossible. Only individuals who worked on the data were able to have a recollection of the analysis that occurred. This means all knowledge and quality assurance workflows developed from specific projects vanished after completion of the project. When new data was received, it was not clear who was responsible and what approach was best to process it.

In this post, I will describe how we are using open source tools to solve the challenge of collaborating on data science projects. This solution is not comprehensive, rather it has come about as a response to resolving the challenges of reproducibility of analysis and management of data workflows highlighted in the previous paragraph.

We decided as a team to extend our analytic capabilities beyond Excel and google sheets. We would achieve this by adopting either python or R as a programming language to extend our capabilities. The team choose Python as the default programming language of choice firstly because I am proficient in python for data analysis than R, which would make it easier for me to support the team and secondly I personally find python easier to learn than R. For some projects we do however integrate R into our data workflows, using the packages such as the R tidytransit package for analysing GTFS feeds and the choiceDes package for designing choice experiments.

After picking a language of choice – the next step was to train the team on how to use the python package, pandas, to explore and analyse data. Essentially the objective was to bridge the gap between the spreadsheets and pandas package by encouraging team members to perform the same tasks using the python package as they would in spreadsheets.

To perform the analysis, the team uses Jupyter Notebooks. Jupyter Notebooks, developed by Project Jupyter easily enable the team to document their analysis and report on their workflows. According to the Jupyter notebook documentation.

The notebook extends the console-based approach to interactive computing in a qualitatively new direction, providing a web-based application suitable for capturing the whole computation process: developing, documenting, and executing code, as well as communicating the results.

The Jupyter notebook combines two components :

A web application: a browser-based tool for interactive authoring of documents which combine explanatory text, mathematics, computations, and their rich media output.

Notebook documents: a representation of all content visible in the web application, including inputs and outputs of the computations, explanatory text, mathematics, images, and rich media representations of objects.

Documentation of analysis improves problem-solving skills for data scientists because firstly, documentation allows the analysts to fully understand the problem and objective of analysis before embarking on the analysis. Secondly, documentation allows the analyst to easily share references across data projects in order to save time and coding effort.

For project organization cookiecutter the python package was adopted. Cookiecutter provides a logical, reasonably standardized, but flexible project structure for doing and sharing data science work. This was selected for many reasons one which is that it is compatible with our current data science workflows and that it can easily be implemented. The template also contains a python boilerplate for python projects. A good project structure encourages practices that make it easier to come back to old work, and for engineering best practices like version control.

For storage, sharing, reviewing, and version control of notebooks we use Github repositories and Github flow respectively. Github, apart from it being a good place to develop your data science portfolio, is also a good place to learn and share data science projects. Github also provides tools for project management and collaboration for data science projects. With Github pull requests, collaborators can review, comment, and see exactly what items were changed in the code line by line. Collaborators can also set timelines, raise and track issues related to a specific repository.

Implementing the tools and practices has resolved the challenge of documenting data analysis workflows and the challenge of sharing and collaborating on data science projects.

As we reinforce the new practices and iterate to improve reproducibility in our analysis we are still researching better ways to improve collaboration and project management of our data science workflows using open source tools.

Facebook Tweet LinkedIn Email