Artful Computing

I spent a lot of May 2021 acting as a Section Leader in Stanford University's "CodeInPlace" project, which aimed to teach Python remotely to a very large number of students world-wide (about 12,000),  staffed largely by volunteers. It was a great experience and I am posting here some of the general advice I gave to my students.

I thought that it might be worth giving a perspective of a Python user who still regards himself primarily as a physicist/engineer (rather than computer-scientist/software-engineer). We do more heavy data handling that most other scientific disciplines but the actual computing knowledge/skill requirements may be fairly modest: 90% of the work requires 10% of the available functionality. We learn Python in order for it to be useful, not because it is a neat elegant tool, or because we enjoy programming (though many of us do). It has to earn its keep.

In fact, CodeInPlace has covered most of the tha essential 10%, so a practicing scientist needs to add just a few more areas from the standard library and the Python module index.

  • I tend to use Regular Expressions quite a lot, as part of parsing data in input text files. (These would not have the entirely regular structure of CSV or JSON files.)  Best of all, of course, is to settle on some widely supported data formal, such as CSV, JSON etc. for passing data between codes so you do not need code-specific parsing. Astronomers, for example, generally agree to pass observational data in their domain specific FITS format. You do, however, at some point have to accept instructions from human operators and regular expression parsers are often useful here. (Really experienced programmers devise a regular command language that can be described formally with something like Backus Naur Form, and then invoke one of the widely available tools that writes a parser for you. It took me 30 years to learn this! Profit from my experience!)
  • Most science ends with graphs, so I use matplotlib a lot.  While there is lot of tutorial stuff on the matplotlib website I feel that RealPython at https://realpython.com/python-matplotlib-guide/ gives the best introduction because it explains some of the history and design logic.
  • I also sometimes output data in columns, as CSV files, which makes using gnuplot very convenient (or any of a number of other graphing tools - but gnuplot is one of the most sophisticated and flexible of these tools, that also produces publication quality output, so is widely used by many scientists). Hence, you should get familiar with the CSV module, both for input and output.
  • If you want to produce user-understandable output one needs to use the formatter object.
  • I have used scripting (not always Python) a lot to handle the running and checking of large test case suites. So we need to look at the operating system interface, os, including how to traverse file systems, iterate over files, initiate execution of other codes and system utilities, and interfaces with configuration control systems (e.g. GIT or Mercurial). I would, for example, start up a code run, with specified input files extracted from configuration control, then parse all the output files comparing them with examples from previous runs stored in configuration control, highlighting any differences for closer examination. 

That actually covers at least 90% of what I ever did with scripting languages and what the majority of benefit to many practicing scientists. 

I do feel, however, that those who need to do a lot of heavy data handling and statistical analysis might wish to also look at numpy and pandas.  You do not need to do a lot to get most of the benefits of leaning Python. By all means go a lot further (and become even more valuable to your employer) but do not imagine that you need to follow loads of stuff from a CS degree to be sufficiently competent.

There is one other tool - not specifically Python - that any serious programmer really needs to use on a regular basis: install a Software Configuration Manager (SCM) which will allow you to keep a history or your code and associated documents, scripts and test cases. You do not need to pay for these (unless you have a particularly well disposed employer who is prepared to equip his team with the best commercial tools). Many Integrated Development Environments (if you like that type of thing) come with a built in SCM but if you like stand-alone tools (I do) then Open Source products such as Mercurial or GIT are excellent and very widely used. (There is no lock-in: it is fairly easy to transfer information from one of these to the other, and the way they are applied is remarkably similar. They each have minor advantages in particular contexts. I happen to use Mercurial just because it is more familiar. Choice often depends on what those around you are using - because they can give you support.) Do it now!

Any serious project involving more than one person, or with external stakeholders will also need to use an Issue Tracker. You will be surprised at how these items quickly build up even with comparatively small projects and you might as well get a handle on it at the start. There are free tools available (e.g. Bugzilla) which are widely used on large Open Source projects - but may be a bit overkill for your individual needs. Even spreadsheets are better than nothing for small projects. Employers looking to maximise the productivity of their development teams (and keep a close eye on progress) may well invest in commercial tool suites that integrate project management, issue tracking with SCM and other software development support tools. In such environments you will not have any choice about what tools you use and how you work.

 

Breadcrumbs