Data Evolve
The noun data implies it is somehow static. However, data evolve in multiple ways.
I started thinking about how data evolve in our lives as a verb when I met
Jhon Goes-in-Center
at the
Exploring Data Sovereignty Workshop in Feb 2024.
He was musing on how difficult it is to understand the noun data
when his Lakota language is woven with verbs. Later, when we were looking out at the huge cottonwood trees surrounding the
hogan
where we were meeting, he talked about how everything is related by evolution
–how these very trees have evolved to be successful in this seemingly dry land just a short walk from the Rio Grande River.
- data evolve through collecting anew and extant data editing
- data evolve in context of changing metadata and harmonizing to clean data
- measurement tools evolve, which change precision and focus, altering meaning
- methods to examine data evolve insights through summaries and visualizations
- questions about data evolve as the broader context changes
- “small data” augmented by “Big Data” evolve our ability to ask questions
One datum
(a single piece of data) may not change, but everything around it may, and hence its importance and value may change. A datum to one person or situation may be a world of data to another–consider a painting, which to one person may represent a single datum in a collection, such as current owner, while to another person may contain a rich history of technique, setting, artist, and
provenance.
Software Evolves
Much earlier, in Fall 2017, I wrote a living (evolving) document as a guide for the language
R for teams in the data sciences.
I organized it as a series of verbs–curate, visualize, organize, analyze, profile, and connect–to emphasize how to think about examining data with such a language.
About that time, I and others were asking,
What is Data Science?
Interestingly, science
is also a noun.
What is Science?,
a 101-level explanation from UC-Berkely, points out that science is both a body of knowledge (noun) and a process (verb) of studying the world.
Science is very much verb-driven, as is the incorporation of data science into science to make sense of complex patterns and relationships.
Someone once told me that software rots. That is, software/code cannot be static, or it becomes irrelevant as the computing system, data and context around it evolve. I once heard Daryl Pregibon, one of the incredible team at Bell Labs that developed C and S, precursors to R and many other language systems, who stated (paraphrased from memory), “We used to be able to provide an equation, then an algorithm, and later a string of code perhaps organized as a function. Later we found it useful to organize code into a package, say to use in a system like R or S. This was useful for fellow statisticians, but often didn’t reach our collaborators effectively. So we developed standalone widgets to share with colleagues.” See for instance Daryl Pregibon: “Incorporating statistical expertise into computer software” talking at the Second International Tampere Conference in Statistics (1987), later published in the 1991 Future of Statistical Software National Academies report. Roger Peng (2013) (Back to) The Future of Statistical Software provides a short critique of this volume and Pregibon’s chapter.
Now there is recognized value in web-based apps, such as Shiny apps that in their simplest form can be created in a few minutes or hours. More complicated platforms have emerged, such as Galaxy and Ramadda, that enable teams to collaboratively share data and code, possibly with protections to respect data sovereignty. This begs the question of emerging tools and platforms for AI, notably large language models (LLMs).
Learning from Data with R
The organic process of learning from data while developing useful study tools might be somewhat planned in advance, but typically, for me, evolves based on insights and collaboration along the way. Good collaboration includes documenting at every step, including notes on what is still broken and what is hoped for future development. Some of this is written for collaborators, but much is for my “future self” so that I can recall my strategy and challenges along the way.
Consider a data project that benefits from software scripting (that is, most data projects), here illustrated with the R language system. My involvement is usually organic, starting with a few lines of code, developing likely into an R markdown document that enables me to document my ideas and present visualizations, and even share results with others. As the project develops, some code gets reused and is better organized as a function. Eventually, those functions get organized into a folder. Ideally, each function of import is documented, say in R using Roxygen2. As the collection of functions grow, it may be useful to organize them into an R package, particularly if I am using them for more than one project. Typically, both for safety and sharing, it is helpful to put packages online, often in GitHub; then others may view or use them as well. Some projects have broader appeal and longevity, leading to submission to an archive such as CRAN.
Blending R and Python to Leverage Data
The Environmental Data Science Innovation & Inclusion Lab (ESIIL) has the ambitious goal of “Enabling an inclusive community of practice to leverage the wealth of environmental data for challenges in the environmental sciences”. That is, guide communities to use data at scale with online tools, particularly for geospatial but also for lots of other types of data. I have gotten involved in this environmental data science center for a variety of reasons. Last week (March 22, 2024) I attended a virtual training session with two Tribal colleges where I learned a great deal (see my notes on python in https://github.com/byandell/geospatial/). The following is based on follow-up conversations.
Compute platform
- The session last Friday used CoLab, which seems to be an intuitive tool from google for python. ESIIL set up an empty python notebook that overcame the hassles of configuring your own laptop.
- ESIIL is also using the Cyverse Discovery Environment, which supports R/Rstudio as well as version of Jupyter Notebook and a Linux command line interface. Cyverse is a heavier lift, particularly for an intro.
Language choice
- Generic scripting languages (say R or Python) are your (and my) way to customize how we play with data. They give us local control of how we use our understanding of a project to make a story from data, which might include available data from NSF or NASA or other sources, possibly along with our own sovereign data.
- R and Python were designed for different purposes—R for data and statistical analysis, Python for computing at scale—so they naturally have different strengths and weaknesses.
- R can be quite slow, but should not be dismissed as such. It can be lightning fast. It takes a mix of art and careful choice of methods and add-on packages. A key area involves for loops, where judicious use of apply and parallel approaches can greatly improve speed. The apply tools in R are actually quite old, largely supplanted by cool, new tools in the tidyverse.
- Here are some comparisons of R and Python quickly gleaned via search, which capture some of the essence:
Working environment
- I personally find Rstudio more intuitive than a Jupyter notebook. You can actually do R or Python in either these days, and share data between them in the same document with care.
- Rmarkdown is great for developing and documenting one’s data story, and later turning that into a separate, refined document. Again, one can use python and linux code in Rmarkdown code chunks.
- Quarto seems to have the benefits of Rmarkdown with the flexibility to switch between source and rendered document. It is also newer technology. I have not really explored this much yet.
Personal choice
Yeah, we all have feelings about this stuff, and we have our comfort levels. There seems to be a lot of ambiguity and overlap, which makes it so one size does not fit all. But hey, that gives us opportunity to do what we want. Sort of reminds me of the emotional battles over frequentist and Bayesian approaches to statistics decades ago—that seems to have settled into a blended approach these days, where we now have the data and computing resources to learn from data in context of (partially) parametric models.
Beyond Data to Broader Communications
Software is itself a type of data, and data carry us along the path from ignorance to wisdom (see Clifford Stoll quote). But software is used for many things beyond computation and data analysis.
Software, of course, is used to write articles and books. I am working on a book Quantitative Population Ethology that is written in Bookdown and hosted on the UW-Madison campus Connect server (part of the Data Science Platform) with source at GitHub. The connect server uses Posit’s continuous integration so that when I update GitHub, the book on Connect is soon updated. Notice in particular equations in section 5.7 on competing risks (see chapter 5 source). Chapters are written in Rmarkdown using MathJax to invoke LaTeX.
This blog is part of my personal website https://byandell.github.io, which is largely written in markdown with source at https://github.com/byandell/byandell.github.io. Fork this repo (or the original at https://github.com/barryclark/jekyll-now) and make your own. Slide decks can also be software, and can be hosted online, such as Srikanth Aravamuthan’s Posit Day slides https://connect.doit.wisc.edu/posit-day/.
Nowadays, I continue to rely on the lessons I learned building that guide and various projects in my GitHub repositories, including this web site byandell.github.io. I now think more about how teams evolve their relationship with data, and their relationship with tools to make sense of data. Data is really about people and about how we form our data-informed stories, much in the way Jaron Lanier thinks about AI.
Resources
- What is Science?
- What is Data Science?
- Why Software Rots
- R language system
- GitHub
- Extensible Work Environments
- Large Language Models (LLMs)
Updated March 22-25, 2024.