What is Data Science?

Data includes numbers, text, images, graphs, sounds, code, and metadata. Data literacy is the ability to read, work with, analyze and communicate with data. Data science is the study, development, or application of methods that reveal new insights from data.

Wikipedia: Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics.

“Defining data science is like defining the internet - ask 10 people and you get 10 different answers. What most would likely agree on, at a high level of abstraction, is that it draws from statistics, computer science, and applied mathematics to operate on data from one or more domains leading to outcomes not achieved otherwise. The extent to which domain knowledge is incorporated in the work of data science varies, but it is essential for achieving meaningful outcomes…. data science transcends traditional disciplinary boundaries to discover new insights not owned by any one existing discipline, driven by endless streams of digital data with the promise of translation to societal benefit.” Ten Simple Rules for Starting (and Sustaining) a Data Science Initiativ (Parker et al., ADSA 2020; PLoS Comp Bio in process)

Many sites have attempted to define a data scientist, often distinguishing this person from a data engineer or a data analyst. Another approach is to consider such a person to be a “researcher” who now happens to be swamped by “data” in a quest to make headway on a “project”. Here are references on comparing labels for people: O’Reilly | DataQuest | Medium with WordCloud | OpenSystemsTech | AnalyticsVidhya | Google Search

History of Data Science

  • Biometry is the active pursuit of biological knowledge by quantitative methods … improve the thought forms, which make possible an understanding of variable phenomena … constant experience in analysing and interpreting observational data of the most diverse types … we come to think of ourselves … in terms of the community of our interests with those doing similar work in other departments (RA Fisher, Biometrics, 1948).
  • John Tukey (1962) The Future of Data Analysis
    • The formal theories of statistics
    • Accelerating developments in computers and display devices
    • The challenge, in many fields, of more and ever larger bodies of data
    • The emphasis on quantification in an ever wider variety of disciplines apprenticeship as primary mode of learning how to learn from data
  • Peter Naur (1960 & 1974) Concise Survey of Computer Methods defined “data science” as “the science of dealing with data”
  • George Box (1980 JRSSA) “Sampling and Bayes’ Inference in Scientific Modelling and Robustness” (From Andrew Gelman): On the proper role of a statistician: “a colleague working with an investigator throughout the whole course of iterative deductive-inductive investigation… scientific process … leading to (a) the study of a number of different sets of already existing data and/or (b) the devising of appropriate surveys”
  • Gil Press (2013) A Very Short History Of Data Science
  • ‘Statistics at a Crossroads’ (NSF 2019)

Writings 2009-2016

Donoho’s Talk and Responses

Broader History of Data Science

Data

Data include numbers, text, images, graphs, sounds, code, metadata. “Data, the units of information observed, collected, or created during the course of research, is not limited to scientific data, but includes social science statistical and ethnographic data, humanities texts, or any other data used or produced in the course of academic research, whether it takes the form of text, numbers, image, audio, video, models, analytic code, or some yet-to-be-identified data type…. Data can be repurposed in ways not foreseen by the originating researchers, inspiring collaborations and new areas of research.” Starting the Conversation: University-wide Research Data Management Policy (Educause 6 Dec 2013)

Doug Laney in 2001 introduced the 3 Vs of big data: volume, velocity and variety. Inderpal Bhandar added veracity. Kevin Normandeau added validity and volatility. George Firican extended this with four more: variability, vulnerability, visualization, value.

See original at https://go.wisc.edu/22ict8 (compiled in 2015; rev 2017, 2018, 2019, 2021, 2022, 2023).

Written on May 17, 2017