Data Sciences Collaboratory
This was an idea for a collaborative environment for data science activity. It has been successful at various other institutions, including Stanford, Columbia and U VA. See the Harvard Data Science Review of Columbia’s Collaboratory.
Purpose: The Data Sciences Collaboratory (DSC) will promote educational innovation in the data sciences. DSC will work closely with a federation of campus units to enhance scholarship with big data.
Mission: DSC will catalyze learning from data to address questions in a team setting by fostering development of data science training resources and enhancing data literacy in research teams.
Vision: A campus community that thrives on data literacy, using data science to make decisions to improve society and the natural and physical world through teaching, research and outreach.
Rationale: Campus units want to make better use of data to make decisions. Emerging non-pooled programs value sharing data sciences curriculum resources and expertise. Research teams thrive when data science expertise leads to deeper results and sustained funding. DSC will be a hub for creating and sharing data sciences curriculum and other resources to complement and enhance other campus units, and in the process raise new revenue and sustain its own operation.
Plan: Create DSC as an umbrella around the Biometry Program, which was founded on the principle of mutual collaboration between quantitative and discipline scientists in research, teaching and outreach. Biometry has been a campus hub for data science across non-human biology for over 30 years (BMI in SMPH focuses on human health). Recently, Biometry faculty led in creation of new data science courses and revenue programs and in building connections with social, physical and environmental sciences. Biometry would bring experience and leadership to DSC with minimal restructuring.
Why Biometry? Sir RA Fisher, preeminent statistician and co-founder of population genetics, opened the first Biometric Society in 1948, stating biometry is “the active pursuit of biological knowledge by quantitative methods … [through] constant experience in analysing and interpreting observational data of the most diverse types…. [W]e come to think of ourselves … in terms of the community of our interests with those doing similar work in other departments.” UW Biometry has run the CALS Consulting Facility for over 30 years to collaborate with students, faculty and staff on research design and analysis. Biometry faculty have been chairs of two departments, and have extensive experience in campus-level leadership and in development of data science courses and programs.
Training Need: The need for expertise and infrastructure to use data well is widespread and growing rapidly. However, skilled personnel are in short supply, and retraining in the data sciences is crucial. Technology alone cannot match this growth or meet the need with pre-packaged tools. The most efficient approach is to develop workforce retraining programs targeted to different data science needs in different disciplines. This educational innovation is best accomplished with an inclusive group across the data sciences to encourage partnerships to share best practices.
Research Need: Research is increasingly team based, built on highly structured knowledge requiring a diverse range of expertises. Further, simply stated research questions, such as what is the nature of a disease or how can we achieve a clean environment while meeting growing energy needs, rely heavily on design, collection, inference, visualization and interpretation of data. These data science skills are themselves diverse, cutting across traditional disciplinary boundaries. Retraining researchers will be enhanced through new curriculum aimed at the workforce but accessible to campus staff.
Program Opportunities: Biometry created a unique MS degree (now an Statistics MS degree option Applied Statistics) to enhance research bridging between quantitative and biological fields by having two co-advisors per student. Biometry faculty envisioned the MS Statistics Option Data Science, with the aim to attract students from the workforce for retraining. In addition, Biometry faculty have created new data science courses, and are in dialog with developers of a variety of new data science programs on campus, in engineering, social sciences, business and environmental sciences. Biometry faculty have also been in discussion with other faculty about developing an undergraduate major in data science.
Nationwide Trends: Student demand for STEM is high, particularly for data science related fields. Many campuses have developed data science degrees in response to widespread demand, beginning with NCSU’s Masters of Science in Analytics. Beyond courses and programs, several universities have created data science groups, centers or departments, including UC Berkeley, Carnegie Mellon U, U Chicago, Columbia U, Cornell U, Duke U, Indiana U, Iowa St U, Johns Hopkins U, U Michigan, New York U, U Penn, Penn St U, Stanford U and U Washington. Wisconsin’s modest, bottom-up approach leads to creative innovations but lacks focus and visibility.
Industry Opportunities: Forbes claims data scientist is the best job for 2016. Multiple industries in Wisconsin and beyond are asking for retraining of their workforce in data science skills. Madison has leverage with Microsoft, Google, American Family and Epic nearby, among other engines of innovation. A growing community of data rich startups and innovations are emerging in the greater Madison area, with considerable interest in big data meetups, coworking spaces and tech incubators. DSC would work with WID’s Hybrid Zone X creative exchange.
Technology Infrastructure: The Information Technology Committee (ITC) wrote a white paper in 2012 on Elevating Research Computing Cyberinfrastructure at UW-Madison (broken link: http://itc.wisc.edu/documents/12/UW_maci_march4.pdf), which led to the formation of the Advanced Computing Initiative (ACI) the one-stop shop for learning about campus IT services, which interfaced with all aspects of data technology, notably the Center for High Throughput Computing (CHTC). ACI later evolved into datascience@uw], which sponsors data- and software- carpentry workshops and is a clearinghouse for data science seminar series. However, there appear to be two unmet needs: 1) program-level access to a variety of data science courses for emerging non-pooled programs; and 2) campus-level access to data science consultation to focus research questions at design and analysis planning phases, before getting to IT issues. DSC addresses the first need, and Biometry faculty are experienced with the second. Research consulting often leads to long-term collaboration that transforms research direction, enables richer use of technology, and enhances the quality and quantity of publications and grants.
Other Campus Units: UW-Madison has a number of discipline-area core facilities that provide high value, data-rich support to selected audiences, including the Cancer Informatics Shared Resource, Institute for Clinical and Translational Research (ICTR), Bioinformatics Resource Center, Social Science Computing Cooperative (SSCC), and Industrial and Systems Engineering. The Biotechnology Center and other facilities focus on collecting rather than analyzing data. Statistics, mathematics and computer sciences departments contribute theory and methods through course offerings, and many other departments offer a range of useful data science courses. DCS would complement these resources.
Funding: DSC will generate revenue through non-pooled instructional programs as well as federal and private training grants such as NSF/NRT and the Sloan Foundation. DSC research grant collaborations would leverage projects across campus and with industry. Non-pooled programs build new types of relationships with industry to train their workforce and address forefront data science challenges. The interdisciplinary nature of DSC will require particular attention to the campus funding model, and how non-pooled revenue-generating programs and interdisciplinary grants share resources. Sufficient incentives are needed to encourage scholarly collaboration and to address issues of merit.
Definitions: Data science is the study of the generalizable extraction of knowledge from data. Data literacy is the ability to read, create and communicate data as information, now expected across all branches of academia. Collaboratory is an open space, creative process where a group of people work together to generate solutions to complex problems; an environment to experiment on ways to use data to share concepts and results.
ACI Response: Paul Wilson (Director of ACI) liked the idea and suggested a group meeting later in Spring. He also wanted to invite a current Sloan Fellow to visit to learn about opportunities for research enhancement. Others noted emerging interest for a Computing Group, which would complement / overlap this proposed effort.