Category Archives: best practices

ChatGPT and AI-generated Code: The Impact of Natural Language Models on Software Creation and Sharing

The following guest post is by John Wallin, the Director of the Computational and Data Science Ph.D. Program and Professor of Physics and Astronomy at Middle Tennessee State University

Photo of Dr. John Wallin, Director of the Computational and Data Science Ph.D. Program and Professor of Physics and Astronomy at Middle Tennessee State University

Dr. John Wallin

Since the 1960s, scientific software has undergone repeated innovation cycles in languages, hardware capabilities, and programming paradigms. We have gone from Fortran IV to C++ to Python. We moved from punch cards and video terminals to laptops and massively parallel computers with hundreds to millions of processors. Complex numerical and scientific libraries and the ability to immediately seek support for these libraries through web searches have unlocked new ways for us to do our jobs. Neural networks are commonly used to classify massive data sets in our field. All these changes have impacted the way we create software.

In the last year, large language models (LLM) have been created to respond to natural language questions. The underlying architecture of these models is complex, but the current generation is based on generative pre-trained transformers (GPT). In addition to the base architecture, they have recently incorporated supervised learning and reinforcement learning to improve their responses. These efforts resulted in a flexible artificial intelligence system that can help solve routine problems. Although the primary purpose of these large language models was to generate text, it became apparent that these models could also generate code. These models are in their infancy, but they have been very successful in helping programmers create code snippets that are useful in a wide range of applications. I wanted to focus on two applications of the transformer-based LLM  – ChatGPT by OpenAI and GitHub Copilot.

ChatGPT is perhaps the most well-known and used LLM. The underlying GPT LLM was released about a year ago, but a newer interactive version was made publicly available in November 2022. The user base exceeded a million after five days and has grown to over 100 million. Unfortunately, most of the discussion about this model has been either dismissive or apocalyptic. Some scholars have posted something similar to this:

“I wanted to see what the fuss is about this new ChatGPT thing, so I gave a problem from my advanced quantum mechanics course. It got a few concepts right, but the math was completely wrong. The fact that it can’t do a simple quantum renormalization problem is astonishing, and I am not impressed. It isn’t a very good “artificial intelligence” if it makes these sorts of mistakes!”

The other response that comes from some academics:

“I gave ChatGPT an essay problem that I typically give my college class. It wrote a PERFECT essay! All the students are going to use this to write their essays! Higher education is done for! I am going to retire this spring and move to a survival cabin in Montana to escape the cities before the machine uprising occurs.”

Jobs will change because of these technologies, and our educational system needs to adapt.Of course, neither view is entirely correct. My reaction to the first viewpoint is, “Have you met any real people?” It turns out that not every person you meet has advanced academic knowledge in your subdiscipline. ChatGPT was never designed to replace grad students. A future version of the software may be able to incorporate more profound domain-specific knowledge, but for now, think of the current generation of AIs as your cousin Alex. They took a bunch of college courses and got a solid B- in most of them. They are very employable as an administrative assistant, but you won’t see them publish any of their work in Nature in the next year or two. Hiring Alex will improve your workflow, even if they can’t do much physics.

The apocalyptic view also misses the mark, even if the survival cabin in Montana sounds nice. Higher education will need to adapt to these new technologies. We must move toward more formal proctored evaluations for many of our courses. Experiential and hands-on learning will need to be emphasized, and we will probably need to reconsider (yet again) what we expect students to take away from our classes. Jobs will change because of these technologies, and our educational system needs to adapt.

Despite these divergent and extreme views, generative AI is here to stay. Moreover, its capabilities will improve rapidly over the next few years. These changes are likely to include:

  • Access to live web data and current events. Microsoft’s Bing (currently in limited release) already has this capability. Other engines are likely to become widely available in the next few months.
  • Improved mathematical abilities via linking to other software systems like Wolfram Alpha. ChatGPT makes mathematical errors routinely because it is doing math via language processing. Connecting this to symbolic processing will be challenging, but there have already been a few preliminary attempts.
  • Increased ability to analyze graphics and diagrams. Identifying images is already routine, so moving to understand and explaining diagrams is not an impossible extension. However, this type of future expansion would impact how the system analyzes physics problems.
  • Accessing specialized datasets such as arXiv, ADS, and even astronomical data sets. It would be trivial to train GPT3.5 on these data sets and give it domain-specific knowledge.
  • Integrating the ability to create and run software tools within the environment. We already have this capability in GitHub Copilot, but the ability to read online data and immediately do customized analysis on it is not out of reach for other implementations as well.

Even without these additions, writing code with GitHub Copilot is still a fantastic experience. Based on what you are working on, your comments, and its training data, it attempts to anticipate your next line or lines of code. Sometimes, it might try to write an entire function for you based on a comment or the name of the last function. I’ve been using this for about five months, and I find it particularly useful when using library functions that are a bit unfamiliar. For example, instead of googling how to add a window with a pulldown menu in python, you would write a comment explaining what you want to do, and the code will be created below your comment. It also works exceptionally well solving simple programming tasks such as creating a Mandelbrot set or downloading and processing data. I estimate that my coding speed for solving real-world problems using this interface has tripled.

However, two key issues need to be addressed when using the code: authorship and reliability.

When you create a code using an AI, it goes through millions of lines of public domain code to find matches to your current coding. It predicts what you might be trying to do based on what others have done. For simple tasks like creating a call to a known function in a python library, this is not likely to infringe on the intellectual property of someone’s code. However, when you ask it to create functions, it is likely to find other codes that accomplish the task you want to complete. The only thing of value we produce in science is ideas. Using someone else's thoughts or ideas without attribution can cross into plagiarism, even if that action is unintentional.For example, there are perhaps thousands of examples of ODE integrators in open-source codes. Asking it to create such a routine for you will likely result in inadvertently using one of those codes without knowing its origin.

The only thing of value we produce in science is ideas. Using someone else’s thoughts or ideas without attribution can cross into plagiarism, even if that action is unintentional. Code reuse and online forums are regularly part of our programming process, but we have a higher level of awareness of what is and isn’t allowed when we are the ones googling the answer. Licensing and attribution become problematic even in a research setting. There may be problems claiming a code is our intellectual property if it uses a public code base. Major companies have banned ChatGPT from being used for this reason. At the very least, acknowledging that you used an AI to create the code seems like an appropriate response to this challenge. Only you can take responsibility for your code, but explaining how it was developed might help others understand its origin.

The second issue for the new generation of AI assistants is reliability. When I asked ChatGPT to write a short biographical sketch for “John Wallin, a professor at Middle Tennessee State University,” I found that I had received my Ph.D. from Emory University. I studied Civil War and Reconstruction era history. It confidently cited two books that I had authored about the Civil War. All of this was nonsense created by a computer creating text that it thought I wanted to read.

It is tempting to think that AI-generated code would produce correct results. However, I have regularly seen major and minor bugs within the code it has generated. Some of the mistakes can be subtle but could lead to erroneous results. Therefore, no matter how the code is generated, we must continue to use validation and verification to determine if we both have a code that correctly implements our algorithms and have the correct code to solve our scientific problem.

Both authorship and reliability will continue to be issues when we teach our students about software development in our fields. At the beginning of the semester, I had ChatGPT generate “five group coding challenges that would take about 30 minutes for graduate students in a Computational Science Capstone course.” When I gave them to my students, it took them about 30 minutes to complete. I created solutions for ALL of them using GitHub Copilot in under ten minutes. Specifying when students can and can’t use these tools is critical, along with developing appropriate metrics for evaluating their work when using these new tools. We also need to push students toward better practices in testing their software, including making testing data sets available when the code is distributed.

Sharing your software has never been more important, given these challenges. Although we can generate codes faster than ever, the reproducibility of our results matters. Your methodology’s only accurate description is the code you used to create the results. Publishing your code when you publish your results will increase the value of your work to others. As the abilities of artificial intelligence improve, the core issues of authorship and reliability still need to be verified by human intelligence.

Addendum: The Impact of GPT-4 on Coding and Domain-Specific Knowledge
Written with the help of GPT-4; added March 20, 2023

Since the publication of the original blog post, there have been significant advancements in the capabilities of AI-generated code with the introduction of GPT-4. This next-generation language model continues to build on the successes of its predecessors while addressing some of the limitations that were previously observed.

One of the areas where GPT-4 has shown promise is in its ability to better understand domain-specific knowledge. While it is true that GPT-4 doesn’t inherently have access to specialized online resources like arXiv, its advanced learning capabilities can be utilized to incorporate domain-specific knowledge more effectively when trained with a more specialized dataset.

Users can help GPT-4 better understand domain-specific knowledge by training it on a dataset that includes examples from specialized sources. For instance, if researchers collect a dataset of scientific papers, code snippets, or other relevant materials from their specific domain and train GPT-4 with that data, the AI-generated code would become more accurate and domain-specific. The responsibility lies with the users to curate and provide these specialized datasets to make the most of GPT-4’s advanced learning capabilities.

By tailoring GPT-4’s training data to be more suited to their specific needs and requirements, users can address the challenges of authorship and reliability more effectively. This, in turn, can lead to more efficient and accurate AI-generated code, which can be particularly valuable in specialized fields.

In addition to the advancements in domain-specific knowledge and coding capabilities, GPT-4 is also set to make strides in the realm of image analysis. Although not directly related to coding, these enhancements highlight the growing versatility of the AI engine. While the image analysis feature is not yet publicly available, it is expected to be released soon, allowing users to tap into a new array of functionalities. This expansion of GPT-4’s abilities will enable it to understand and interpret images, diagrams, and other visual data, which could have far-reaching implications for various industries and applications. As GPT-4 continues to evolve, it is crucial to recognize and adapt to the ever-expanding range of possibilities that these AI engines offer, ensuring that users can leverage their full potential in diverse fields.

With the rapid advancements in AI capabilities, it is essential for researchers, educators, and developers to stay informed and adapt to the changes that GPT-4 and future models bring. As AI-generated code becomes more accurate and domain-specific, the importance of understanding the potential benefits, limitations, and ethical considerations of using these tools will continue to grow.

Best Practices for Software Registries and Repositories

by Alejandra Gonzalez-Beltran, Alice Allen, Allen Lee, Daniel Garijo, Thomas Morrell, SciCodes Consortium

This post is cross-posted on the SciCodes website, the US Research Software Sustainability Institute blog, the UK Software Sustainability Institute blog, and the FORCE11 blog.

Software is a fundamental element of the scientific process, and cataloguing scientific software is helpful to enable software discoverability. During the years 2019-2020, the Task Force on Best Practices for Software Registries of the FORCE11 Software Citation Implementation Working Group worked to create Nine Best Practices for Scientific Software Registries and Repositories. In this post, we explain why scientific software registries and repositories are important, why we wanted to create a list of best practices for such registries and repositories, the process we followed, what the best practices include, and what the next steps for this community are.

Why are scientific software registries and repositories important?

Scientific software registries and repositories support identifying and finding software, provide information for software citation, foster long-term preservation and reuse of computational methods, and ultimately, improve research reproducibility and replicability.

Why did we write these guidelines?

Managers of scientific software registries and repositories have been working independently to run their services and provide useful information and tools to users in different communities. The Best Practices for Software Registries Task Force participants had different perspectives representing a heterogeneous set of resources, but came together for the common goal of creating a list of best practices for scientific software registries. These shared practices help to raise awareness of software as a research output, enable credit for software creators, and guide curators working on software catalogues through the steps to consider when setting up their software registries. In the longer term, we hope to improve the interoperability of the software metadata supported by different services.

The goals that we considered for writing the guidelines were:

  • to have a minimal number of best practices, easy to adopt by repository managers
  • to be broadly applicable to most or all of our resources
  • to be descriptive on a meta level, not prescriptive, and focused on what the best practices should do or provide, not on what a suggested policy or element should specifically say.

What are the best practices?

Our guidelines, listed below, provide an overview of the key points to take into consideration when creating a software registry. They are:

  • Provide a public scope statement (examples)
  • Provide guidance for users
  • Provide guidance to software contributors
  • Establish an authorship policy (examples)
  • Share your metadata schema (examples)
  • Stipulate conditions of use (examples)
  • State a privacy policy (examples)
  • Provide a retention policy (examples)
  • Disclose your end-of-life policy (examples)

Our pre-print offers more explanation about each guideline and a longer list of implementations that we found when we were doing our work on these practices.

What process did we follow to produce the guidelines?

Representatives from numerous software registries and repositories were involved in the FORCE11 Software Citation Implementation Working Group (SCIWG). Alice Allen proposed that we form a task force within the SCIWG for writing up some best practices for the registries and repositories, and with acceptance by the co-chairs of the SCIWG and interest from relevant people, the Task Force on Best Practices for Software Registries was formed. Initially, we gathered information from members of this Task Force to learn more about each resource and to identify some of our overlapping interests. We then identified potential best practices based on prior issues we experienced running our services and  discussed what each potential practice might include or exclude.

Through iterative deliberations, we determined which of the potential practices were the most broadly applicable. With generous funding from the Alfred P. Sloan Foundation, we hosted a workshop for scientific registries and repositories, part of which was devoted to gathering final consensus around the Best Practices.  The workshop included registries who were not part of the Task Force, resulting in a broader set of contributions to the final list.

What are the next steps for the group?

Our goal is to continue our efforts by implementing these practices more uniformly in our own registries and repositories and reducing the burdens of adoption. We have created SciCodes, a consortium of scientific software registries and repositories, which is now defining the next priorities to tackle, such as tracking the impact of good metadata, improving interoperability between registries, and making our metadata more discoverable by search engines and services such as Google Scholar, ORCID, and discipline indexes. We are also sharing tools and ideas in a series of presentations that are recorded and available for viewing on the SciCodes website, so please check them out!

A workshop for scientific software registries and repositories

I am involved in several efforts, in addition to the ASCL, to improve recognition and credit for software authors; one such effort is the FORCE11 Software Citation Implementation Working Group (SCIWG), in which several software registries and repositories are involved. These resources, along with others not part of the SCIWG, have formed a Repository Best Practice Task Force, which has held monthly conference calls this year to collaboratively develop a list of best practices for such resources. This has also been an excellent vehicle for enabling people who run these resources to share information about managing software registries and working with software authors, researchers, and journal editors to improve software citation.

Thanks to funding from the Sloan Foundation, members of this Task Force and other software resources are coming together in a Scientific Software Registry Collaboration Workshop to demonstrate unique aspects of our respective services, discuss challenges and share solutions to common issues that arise in managing our resources, finalize a list of best practices for our resources, and work cooperatively to speed adoption of the CodeMeta and/or Citation File Format standards. The workshop has been organized by the Caltech Library and ASCL, and takes place at the University of Maryland (College Park) this coming Wednesday and Thursday (November 13-14). It includes presentations by software registry managers and subject matter experts, break-out sessions for collaborative work, and group discussion.

I’m happy to say we are able to provide remote access to most of the plenary portions of the workshop through Webex; links on the workshop agenda identify the sessions available over Webex. As the workshop has an element of unconferencing, it’s possible that additional portions of the workshop will be suitable for Webex and if so, we will update the agenda accordingly. In addition, we will have someone live-scribing the event; a link to the Google Doc for these notes will be added to the agenda webpage before the workshop begins.

A major focus of this workshop is to discuss and finalize the best practices that have been identified so far in our monthly conference calls. A draft list of the practices (PDF) is available for download below; these are the practices we will be working on in break-out groups during the workshop. Links to the Google Docs we will be using for these breakout sessions are listed on the agenda; this offers another way for anyone interested to see the work being done in this meeting.

I have wanted to meet with others doing work similar to that I do on the ASCL for a long time, and am very grateful to Tom Morrell, Mike Hucka, and Stephen Davison from Caltech Libraries for partnering with me to organize this workshop, and to Josh Greenberg at the Sloan Foundation for thinking this workshop was a good idea and funding the project. My thanks to all of them!

Draft list of Best Practices for research software registries (pdf)

In conclusion 1…

Here are a few slides from presentations mentioned in a previous blog post; slides from more of the talks at EWASS will be covered in another post.


Software development best practices from Astropy, Thomas Robitaille (slides: PDF)

All contributions are made in GitHub repositor(ies). All contributions are reviewed via pull requests. Test suite run using pytest. Docs written in Sphinx, hosted on ReadTheDocs. Continuous integration on Travis AppVeyor, Circle CI.


A Computer Science Perspective on the Astronomy Research Software Process by John Wenskovitch (slides: PDF)

Summary: 1. Acknowledgement of strengths. 2. Version control. a. Use it. b. Commit often. 3. Frequent communication. 4. Manage feature requests. 5. Collaborate with an expert.


TARDIS: A radiative transfer code, an open source community, and an interdisciplinary collaboration by Wolfgang Kerzendorf (slides: PDF)

Developing simulation codes. Science discovery needs to be the key driver (everything else is secondary). Only write code that doesn't exist anywhere else. Many of the software engineering techniques are geared towards team development -- not always applicable.


Research software best practices: Transparency, credit, and citation by yours truly (slides: PDF, PPTX)

You can change the world! Or at least a little piece of it. Release your code. Specify how you want your code to be cited. License your code. Register your code. Archive your code.

Dagstuhl Manifesto on Citation: I will make explicit how to cite my software. I will cite the software I used to produce my research results. When reviewing, I will encourage others to cite the software they have used.

ASCL at AAS 227

Posters! Sessions! Meetings! The upcoming AAS meeting in Kissimmee, Florida is shaping up to be the busiest ever! Here are the formal meeting activities the ASCL is participating in.


Special Session: Tools and Tips for Better Software (aka Pain Reduction for Code Authors)
Tuesday, January 05, 2:00 PM – 3:30 PM; Sanibel
Organizers: Astrophysics Source Code Library (ASCL)/Moore-Sloan Data Science Environment at NYU

Research in astronomy is increasingly dependent on software methods and astronomers are increasingly called upon to write, collaborate on, release, and archive research quality software, but how can these be more easily accomplished? Building on comments and questions from previous AAS special sessions, this session, organized by the Astrophysics Source Code Library (ASCL) and the Moore-Sloan Data Science Environment at NYU, explores methods for improving software by using available tools and best practices to ease the burden and increase the reward of doing so. With version control software such as git and svn and companion online sites such as GitHub and Bitbucket, documentation generators such as Doxygen and Sphinx, and Travis CI, Intern, and Jenkins available to aid in testing software, it is now far easier to write, document and test code. Presentations cover best practices, tools, and tips for managing the life cycle of software, testing software and creating documentation, managing releases, and easing software production and sharing. After the presentations, the floor will be open for discussion and questions.

The topics and panelists are:

Source code management with version control software, Kenza S. Arraki
Software testing, Adrian M. Price-Whelan
The importance of documenting code, and how you might make yourself do it, Erik J. Tollerud
Best practices for code release, G. Bruce Berriman
Community building and its impact on sustainable scientific software, Matthew Turk
What to do with a dead research code, Robert J. Nemiroff


Poster 247.07: Astronomy education and the Astrophysics Source Code Library
Wednesday, January 06, Exhibit Hall A

The Astrophysics Source Code Library (ASCL) is an online registry of source codes used in refereed astrophysics research. It currently lists nearly 1,200 codes and covers all aspects of computational astrophysics. How can this resource be of use to educators and to the graduate students they mentor? The ASCL serves as a discovery tool for codes that can be used for one’s own research. Graduate students can also investigate existing codes to see how common astronomical problems are approached numerically in practice, and use these codes as benchmarks for their own solutions to these problems. Further, they can deepen their knowledge of software practices and techniques through examination of others’ codes.


Poster 348.01: Making your code citable with the Astrophysics Source Code Library
Thursday, January 07, Exhibit Hall A

The Astrophysics Source Code Library (ASCL, ascl.net) is a free online registry of codes used in astronomy research. With nearly 1,200 codes, it is the largest indexed resource for astronomy codes in existence. Established in 1999, it offers software authors a path to citation of their research codes even without publication of a paper describing the software, and offers scientists a way to find codes used in refereed publications, thus improving the transparency of the research. Citations using ASCL IDs are accepted by major astronomy journals and if formatted properly are tracked by ADS and other indexing services. The number of citations to ASCL entries increased sharply from 110 citations in January 2014 to 456 citations in September 2015. The percentage of code entries in ASCL that were cited at least once rose from 7.5% in January 2014 to 17.4% in September 2015. The ASCL’s mid-2014 infrastructure upgrade added an easy entry submission form, more flexible browsing, search capabilities, and an RSS feeder for updates. A Changes/Additions form added this past fall lets authors submit links for papers that use their codes for addition to the ASCL entry even if those papers don’t formally cite the codes, thus increasing the transparency of that research and capturing the value of their software to the community.

Months away, but AAS 227 Kissimmee meeting is already software rich!

Already it’s shaping up to be a software maven’s dream AAS meeting, with workshops and Special Sessions focused on expanding your software skills and a Hack Day to put them to use! We’ll have a comprehensive listing closer to the meeting date, but here are the activities already on the schedule, with more to come!

Introduction to Software Carpentry 2 Day Workshop
Astrostatistics and R
Using Python for Astronomical Data Analysis
SciCoder Presents: Developing Larger Software Projects
Bayesian Methods in Astronomy: Hands-on Statistics
Tools and Tips for Better Software (aka Pain Reduction for Code Authors)
Lectures in AstroStatistics
Hack Day

Closure of Google Code and impact on ASCL records

Google has announced the closure of its Google Code service. Google suggests several courses of action and states, “We … offer stand-alone tools for migrating to GitHub and Bitbucket, and SourceForge offers a Google Code project importer service.”

Please take steps to save your software! If you migrate your code to another site, I would appreciate knowing the new URL. If you are no longer working on your software and do not want to migrate it to another project hosting site, please allow the ASCL to store an archive file (tarball/zip/etc.) of it with the ASCL entry so it remains available to support your written research record, or select another option to preserve your code. If you would like to have the ASCL host an archive file, please contact me; thank you.

Licensing Astrophysics Codes session at AAS 225

On Tuesday, January 6, the ASCL, AAS Working Group on Astronomical Software (WGAS), and the Moore-Sloan Data Science Environment at NYU sponsored a special session on software licenses, with support from the AAS. This subject was suggested as a topic of interest in the Astrophysics Code Sharing II: The Sequel session at AAS 223.

Frossie Economou from the LSST and chair of the WGAS opened the session with a few words of welcome and stressed the importance of licensing. I gave a 90-second overview of the ASCL before turning the podium over to Alberto Accomazzi from NASA/Astronomy Data System (ADS), who introduced the panel of speakers and later moderated the open discussion (opening slides), after which Frossie again took the podium for some closing remarks. The panel of six speakers discussed different licenses and shared considerations that arise when choosing a license; they also covered institutional concerns about intellectual property, governmental restrictions on exporting codes, concerns about software beyond licensing, and information on how much software is licensed and characteristics of that software. The floor was then opened for discussion and questions.

photo of audience at licensing session

Discussion period moderated by Alberto Accomazzi

Presentations
Some of the main points from each presentation are summarized below, with links to the slides used by the presenters.

    • Copy-left and Copy-right, Jacob VanderPlas (eScience institute, University of Washington)
      Jake extolled everyone to always license codes, as in the US, copyright law defaults to “all privileges retained” unless otherwise specified. He pointed out that “free software” can refer to the freedoms that are available to users of the software. He covered the major differences between BSD/MIT-style “permissive” licensing and GPL “sticky” licensing while acknowledging that the difference between them can be a contentious issue.
      slides (PDF)
    • University tech transfer perspective on software licensing, Laura L. Dorsey (Center for Commercialization, University of Washington)
      Universities care about software licenses for a variety of reasons, Laura stated, which can include limiting the university’s risk, respecting IP rights, complying with funding obligations, and retaining academic and research use rights. She also covered factors software authors may care about, among them receiving attribution, controlling the software, and making money. She reinforced the importance of licensing code and discussed the common components of a software license.
      slides (PDF)
    • Relicensing the Montage Image Mosaic Engine, G. Bruce Berriman (Infrared Processing and Analysis Center, Caltech)
      In last year’s Astrophysics Code Sharing session, Bruce had discussed the limitations of the Caltech license under which the code Montage was licensed; since then, Montage has been relicensed to a BSD 3-Clause License. Following on the heels of Laura’s discussion and serving as a case study for institutional concerns regarding software,  Bruce related the reasons for and concerns about the relicensing, and discussed working with the appropriate office at Caltech to bring about this change.
      slides (PDF)
    • Export Controls on Astrophysical Simulation Codes, Daniel Whalen (Institute for Theoretical Astrophysics, University of Heidelberg)
      image of presentation slide

      Restricted algorithms; image by Adam M. Jacobs

      Dan’s presentation covered some of the government issues that arise from research codes, including why certain codes fall under export controls; a primary reason is to prevent the development of nuclear weapons.Dan also brought up how foreign intelligence agencies collect information and what specific simulations are restricted, and stated that Federal rules are changing, but slowly.
      slides (PDF)

    • Why licensing is just the first step, Arfon M. Smith (GitHub Inc.)
      Arfon went beyond licensing in his presentation to discuss open source and open collaborations, and how GitHub delivers on a “theoretical promise of open source.” He shared statistics on the growth of collaborative coding using GitHub, and demonstrated how a collaborative coding process can work and pointed out that through this exposed process, community knowledge is increased and shared. He challenged the audience to contemplate the many reasons for releasing a project and to ask themselves what kind of project they want to create.
      slides (PDF)
    • Licenses in the wild, Daniel Foreman-Mackey (New York University)
      First, I have to note that Dan made it through 41 slides in just over the six minutes allotted for his talk, covering about seven slides/minute; I don’t know whether to be more impressed with his presentation skills or the audience’s information-intake abilities!

      17% of GitHub repositories examined are licensed

      Percentage of licensed GitHub repos; image by Arfon Smith

      After declaring that he knows nothing about licensing, Dan showed us, and how, that he knows plenty about mining data and extracting information from it. From his “random” selection of 1.6 million GitHub repositories, he noted with some glee that 63 languages are more popular on GitHub than IDL is, the number of repositories with licenses have increased since 2012 to 17%, and that only 28,972 of the 1.6 million mentioned the license in the README file. Dan also determined the popularity of various licenses overall and by language and shared that information as well.
      slides (PDF)

Open Discussion
After Dan’s presentation, Alberto Accomazzi opened the floor for discussion. Takeaway points included:

  • Discuss licensing with your institution; it’s likely there is an office/personnel devoted to deal with these issues
  • This office is likely very familiar with issues you bring to it, including who to refer you to when the issues are outside their purview
  • “Friends don’t let friends write their own licenses.” IOW, select an existing license rather than writing your own
  • License your code
  • Let others know how you want your code cited/acknowledged

My thanks to David W. Hogg, Kelle Cruz, Matt Turk, and Peter Teuben for work — which started last March! — on developing the session, to Alberto for his excellent moderating and to Frossie for opening and closing it. My thanks also to the wonderful Jake, Laura, Bruce, Dan W, Arfon, and Dan F-M for presenting at this session, and to the Moore-Sloan Data Science Environment at NYU and AAS for their sponsorship.

Resources
Many resources on licensing, including excellent posts by Jake and Bruce, can be found here.

Software licensing resources

Below, a list of informative, interesting (or both!) writings about software licensing; the ASCL doesn’t necessarily agree with all positions in these articles, but we want to know what people are thinking even when we don’t agree with them.

EUDAT License Wizard
http://www.eudat.eu/news/eudat-license-wizard-guides-you-through-legal-maze
http://ufal.github.io/lindat-license-selector/

A Quick Guide to Software Licensing for the Scientist-Programmer
By Andrew Morin, Jennifer Urban, Piotr Sliz
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002598

Relicensing yt from GPLv3 to BSD
By Matthew Turk
http://blog.yt-project.org/post/Relicensing.html

Best Practices for Scientific Computing
Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Katy Huff, Ian M. Mitchell, Mark Plumbley, Ben Waugh, Ethan P. White, Paul Wilson
http://arxiv.org/abs/1210.0530v4

The Whys and Hows of Licensing Scientific Code
By Jake VanderPlas
http://www.astrobetter.com/the-whys-and-hows-of-licensing-scientific-code/

Licensing your code
ASCL blog post https://ascl.net/wordpress/?p=726 lists the following:

Making Sense of Software Licensing
Choose a license
Open Source Initiative also offers information on licenses
White paper from the Software Freedom Law Center
Bruce Berriman’s post on relicensing Montage

The Gentle Art of Muddying the Licensing Waters
by Glyn Moody
http://blogs.computerworlduk.com/open-enterprise/2014/08/the-gentle-art-of-muddying-the-licensing-waters/index.htm

STM open license suggestions and aftermath

Open Access Licensing
Don’t Muddy the “Open” Waters: SPARC Joins Call for STM Association to Rethink New Licenses
Global Coalition of Access to Research, Science and Education Organizations calls on STM to Withdraw New Model Licenses
STM response to ‘Global Coalition of Access to Research, Science and Education Organisations calls on STM to Withdraw New Model Licenses’
New “open” licenses aren’t so open

Interesting talk on ITAR
http://www.state.gov/e/stas/series/154211.htm
Discusses dual-use technologies, which is what codes are under ITAR. These are governed by the Wassenaar Arrangement. The countries that participate meet 3x/year to decide what restrictions to put on dual-use technologies. Dr. James Harrington was the speaker. Slides available on that page.

Update: Where the codes are; also, a bit about citing software

This is an update on figures I’ve previously shared (most recently here). Currently, the ASCL indexes 977 codes. The percentage of these codes housed on social coding sites are:

GitHub: 8.1%
SourceForge: 4.2%
Code.Google: 2.8%
Bitbucket: 1.3%

This gives us 16.4% of codes listed on the ASCL housed on a public social coding site, an increase since February of 5.4%, most of this from GitHub (up from 4.2% in February), though the percentages of four sites have increased.

As I said in February, I expect the percentage of codes on social coding sites will continue to grow, especially since GitHub’s use is increasing quickly in the community. One factor some credit for this increase is that GitHub has made it easy to push code to Zenodo for archiving and DOI minting, and providing another way to cite code.*

As mentioned in my previous post, how codes are cited vary. Software citation will be the main topic at Tuesday’s inaugural Software Publishing Special Interest Group meeting at AAS225, which will be held at 3:45 PM in 615 of the Convention Center. If you are at AAS this week, you are welcome to attend and I hope to see you there!

 

*It was reported at .Astronomy6 that “some astro journals won’t even accept a DOI as a citation.” I don’t know which journals and hope someone will enlighten me; I would like to know the rationale for that stance and would gladly take this up with publishers.