What counts as data?
Observational: data captured in real-time, usually irreplaceable (e.g., censor data, telemetry, survey data, sample data, neuroimages)
Experimental: data from lab equipment, often reproducible, but can be expensive (e.g., gene sequences, chromatograms, toroid magnetic field data)
Simulation: data generated from test models where model and metadata (inputs) are more important than output data (e.g., climate models, economic models)
Derived or compiled: data that is reproducible, but very expensive (e.g., text and data mining, compiled database, 3D models, data gathered from public documents)
Evaluate your data needs
- What type of data will be produced? Will it be reproducible? What would happen if it got lost or became unusable later?
- How much data will there be? How quickly will it grow? How often will it change? Once archives/stored, what kind of access will be needed to use it?
- Who will use the data now, and in the future?
- Who controls the data (PI, student, lab, CUNY, funding agency)? What intellectual property considerations might apply?
- How long should the data be retained? How long would you expect it to be useful, e.g. through the end of grant/experiment, 3-5 years, 10-20 years, permanently?
- Is there good project and data documentation?
- What directory and file naming conventions will be used?
- What project and data identifiers will be assigned?
- What file formats are used? Are they standards-based or proprietary?
- Are there tools or software needed to create/process/visualize the data? Are the tools or software proprietary?
- Is there an ontology or other community standard for data sharing/integration?
Access, Sharing, and Re-use
- Any special privacy or security requirements? e.g., personal data, high-security data
- Any sharing requirements? e.g., funder data sharing policy
- Any other funder requirements? e.g., data management plan in grant proposals
- What is your storage and backup strategy?
- When will it be shared and where? How broadly will it be shared? Are there I/O throughput issues with respect to the size of the datasets?
- Who in the research group will be responsible for data management?
Data Management Class Materials