What is Data Science?
Data includes numbers, text, images, graphs, sounds, code, and metadata. Data literacy is the ability to read, work with, analyze and communicate with data. Data science is the study, development, or application of methods that reveal new insights from data.
Wikipedia: Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics.
- Wikipedia: Data Science
- Data Science: An Introduction (WikiBooks)
- What is Data Science? (UC Berkeley)
- What is Data Science? (Oracle)
- Data Literacy: Why it Matters for Your Business (Qlik)
“Defining data science is like defining the internet - ask 10 people and you get 10 different answers. What most would likely agree on, at a high level of abstraction, is that it draws from statistics, computer science, and applied mathematics to operate on data from one or more domains leading to outcomes not achieved otherwise. The extent to which domain knowledge is incorporated in the work of data science varies, but it is essential for achieving meaningful outcomes…. data science transcends traditional disciplinary boundaries to discover new insights not owned by any one existing discipline, driven by endless streams of digital data with the promise of translation to societal benefit.” Ten Simple Rules for Starting (and Sustaining) a Data Science Initiativ (Parker et al., ADSA 2020; PLoS Comp Bio in process)
Many sites have attempted to define a data scientist, often distinguishing this person from a data engineer or a data analyst. Another approach is to consider such a person to be a “researcher” who now happens to be swamped by “data” in a quest to make headway on a “project”. Here are references on comparing labels for people: O’Reilly | DataQuest | Medium with WordCloud | OpenSystemsTech | AnalyticsVidhya | Google Search
History of Data Science
- Biometry is the active pursuit of biological knowledge by quantitative methods … improve the thought forms, which make possible an understanding of variable phenomena … constant experience in analysing and interpreting observational data of the most diverse types … we come to think of ourselves … in terms of the community of our interests with those doing similar work in other departments (RA Fisher, Biometrics, 1948).
- John Tukey (1962) The Future of Data Analysis
- The formal theories of statistics
- Accelerating developments in computers and display devices
- The challenge, in many fields, of more and ever larger bodies of data
- The emphasis on quantification in an ever wider variety of disciplines apprenticeship as primary mode of learning how to learn from data
- Peter Naur (1960 & 1974) Concise Survey of Computer Methods defined “data science” as “the science of dealing with data”
- George Box (1980 JRSSA) “Sampling and Bayes’ Inference in Scientific Modelling and Robustness” (From Andrew Gelman): On the proper role of a statistician: “a colleague working with an investigator throughout the whole course of iterative deductive-inductive investigation… scientific process … leading to (a) the study of a number of different sets of already existing data and/or (b) the devising of appropriate surveys”
- Gil Press (2013) A Very Short History Of Data Science
- ‘Statistics at a Crossroads’ (NSF 2019)
Writings 2009-2016
- Peter Norvig (2009) The Unreasonable Effectiveness of Data
- Drew Conway (2010) Data Science Venn Diagram
- Mac Slocum (2010) Data Analysis Path is Built on Curiosity …
- O’Reilly (2010) What is Data Science?
- DJ Patil (2011) Building Data Science Teams
- Hadley Wickham (2013) Simply Statistics Unconference (http://simplystatistics.org)
- V Dhar (2013) Data science and prediction. Communications of the ACM 56 (12): 64.
- Cathy O’Neill & Rachel Schutt (2014) Doing Data Science
- Michael Hochster (2014) What is Data Science?
- Jenny Bryan (2014-15) Stat 545
- What is Data Science? (2015) UC-Berkeley School of Information
- ASA (2015) Data Science Statement
- Diggle PJ (2015) Statistics: a data science for the 21st century. J R Statist. Soc A 178: 793–813.
- Priceonomics (2015) What is the difference between data science and statistics?
- Ryan L (2016) But I’m a Data Scientist too, Aren’t I?
Donoho’s Talk and Responses
- David Donoho (2015) 50 Years of Data Science, Tukey Centennial workshop, Princeton NJ, Sept 18 2015 (JCGS 2017)
- Rafa Irizarry (2015) 20 Years of Data Science … from music to genomics
- Chris Wiggins (2015) ICERM talk on Data Science
- Tommy Jones (2015) The Identity of Statistics in Data Science
- Sean Owen (2015) What 50 Years … Leaves Out
- Wray Buntine Blog: above and beyond Donoho’s Greater Data Science
- Jennifer Priestley (2016) Data Science: The Evolution or the Extinction of Statistics?
- Citations of Donoho paper
Broader History of Data Science
- Karina Giberta, Jeffery S. Horsburgh, Ioannis N. Athanasiadis, Geoff, Holmes (2018) Environmental Data Science (Env Model Software) [See “2. Origins and brief history” of the term and field “data science”.]
- ‘Statistics at a Crossroads’ (NSF 2019)
- Ten Simple Rules for Starting (and Sustaining) a Data Science Initiative. Parker, M.S., Burgess, A.E., Bourne, P.E. 2020. PLOS Comp Bio (in revision). OSF Preprints. June 2.
- Integrating computing in the statistics and data science curriculum: Creative structures, novel skills and habits, and ways to teach computational thinking (JSDSE 2021)
- Data Science Initiatives
Data
Data include numbers, text, images, graphs, sounds, code, metadata. “Data, the units of information observed, collected, or created during the course of research, is not limited to scientific data, but includes social science statistical and ethnographic data, humanities texts, or any other data used or produced in the course of academic research, whether it takes the form of text, numbers, image, audio, video, models, analytic code, or some yet-to-be-identified data type…. Data can be repurposed in ways not foreseen by the originating researchers, inspiring collaborations and new areas of research.” Starting the Conversation: University-wide Research Data Management Policy (Educause 6 Dec 2013)
Doug Laney in 2001 introduced the 3 Vs of big data: volume, velocity and variety. Inderpal Bhandar added veracity. Kevin Normandeau added validity and volatility. George Firican extended this with four more: variability, vulnerability, visualization, value.
See original at https://go.wisc.edu/22ict8 (compiled in 2015; rev 2017, 2018, 2019, 2021, 2022, 2023).