How do Data Scientists use GitHub? (Hons) [Open]




Previous works have highlighted data scientists may use version control systems like GitHub differently. In this project, you will be leveranging the GitHub API to extract information about data science projects in multiple languages, and compare every version control operation. Expect to assess type of repositories, user activity, perform topic modelling in commit messages and issues, participation graphs between developers. In addition, you'll be part of an Ethical Application to be capable of surveying developers through an anonymous, online survey.

Note: This project is open and recruiting students.



  • Programming knowledge, preferably either Python or R. Other languages are welcome but not needed.
  • Knowledge (or willingness to learn quickly) about using APIs to download data.
  • Demonstrated academic writing skills.
  • Excellent attention to details.
Please, contact me via email with a detailed resume, and your comments (1 page only) on why you are interested in on this project.
Anybody is welcome to apply. However, female candidates (or female-identifying) are especially encouraged to submit.

Background Literature

Z. Codabux, M. Vidoni and F. Fard,  "Technical Debt in the Peer-Review Documentation of R Packages: a rOpenSci Case Study," in 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021 pp. 195-206.

  • Empirical Software Engineering. Mixed-Methods. Developers Survey.
  • Natural Language Processing
  • Data Scienc Software, Scientific Software
  • Developers' Challenges


