MASTERING YOUR DATA – FROM EXPLORATION TO VISUALIZATION
RSU Introduction to Data Science with R and Tidyverse
The traditional approach to research programs is to assume that students will find a way to analyze and visualize their data. This assumption brings problems for the students, their supervisors, and a significant waste of time. Many students are scared by the data rather than curious and usually skip exploratory data analysis and go straight to advanced statistical models that they cannot explain later because they do not understand their data in depth.
This course aims to provide basic knowledge, skills and tools to perform such an exploratory data analysis, with a major focus on publication-ready data visualization to detect patterns and trends in the data, to extract meaningful information from the data and to prepare for further inferential analysis.
On successful completion of the course, the students will have the knowledge and practical skills to successfully apply the R statistical software and its essential functions and packages to wrangle and transform their research data to perform informative exploratory analyses and create publication-ready visualization of their data, enabling effective interpretation and communication of the research results and findings to the scientific community.
About the lecturer, prof. Sergio Uribe: Hi! I am a Maxillofacial Radiologist (DDS, PhD) and Clinical Researcher. My research focuses on developing and evaluating strategies for enhanced oral health, particularly caries and temporomandibular conditions. Through analysing valid and reliable clinical studies' epidemiological data and utilising cutting-edge technologies, including artificial intelligence applications, my objective is to enhance diagnostic accuracy and prognosis,, and positively impacting on oral health. Furthermore, I strive to provide valuable insights and innovative solutions that advance the dental field. For updates on my research and related topics, you can follow me on Twitter @sergiouribe. If you have any inquiries or wish to reach out, please feel free to email me at sergio.uribe@rsu.lv.
Before the 1st session
Do
Install R first then Rstudio
Then read and view
Read Chapter 1 Getting Started with Data in R, From ModernDive
View Suggested: Introduction to R (video 60 min)
Download and print the Cheat Sheets
Print preferable in color
How to import data into R and Rstudio data-import.pdf
Manage the interface of rstudio-ide.pdf
Data Transformations data-transformation.pdf
Working with dates.pdf
Working with factors.pdf
Working with strings.pdf
Reproducible and communicable results with rmarkdown-2.0.pdf
Books
These books are free to access and read. Click on any of them for more information.
Wickman's book is indispensable. He is the creator of several of the packages included in tidyverse, such as dplyr, ggplot, and tidyr among others.
Modern dive is a general guide to doing data science.
The next two books, Healy and Wilke, covers practically everything that allows you to make high-quality visualizations, from theory to practice, with code included.
Lastly, the Big Book of R is a consolidated list of online resources, useful to bookmark.
Also, two books that show the power of using data to generate information are Factfulness from Rosling and Enlightenment now by Pinker. These books show, with data, the current state of the world and have simple graphs, but whose elaboration shows how it should be the proper process to express ideas through graphics.
Important: all your data must be correctly formatted!
Every column is a variable.
Every row is an observation.
Every cell is a single value.
Compulsory reading: Data Organization in Spreadsheets
Data organization: folders, files and projects
How to name things slides
First session
Lecture: Descriptive statistics are essential to making complex analyses useful
Lecture: Data Visualization
Lecture: A gentle introduction to some key concepts in ggplot2
Code First session
Self-Tutorial https://rstudio.cloud/learn/primers/1.1
Reccomended video A Gentle Introduction to Tidy Statistics in R
Lectures First Session
A ggplot2 Tutorial for Beautiful Plotting in R: Blog post: https://cedricscherer.netlify.app/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/
Datawrapper GmbH, 2020. How to pick more beautiful colors for your data visualizations [WWW Document]. URL https://blog.datawrapper.de/beautifulcolors/ (acceded 9.5.20).
A detailed explanation about how to choose the colors for your graphics
Evanko, D., s. f. Data visualization: A view of every Points of View column : Methagora [WWW Document]. URL http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html (accedido 5.13.19).
A list of articles published in Nature that deal in detail with the subject of generating quality graphics. Each article is one or two pages, super practical.
Holtz, Y., s. f. The R Graph Gallery [WWW Document]. URL https://www.r-graph-gallery.com/ (accedido 10.13.20).
A graphic gallery with ggplot2 with code
Second session
Lectures pre-session
Broman, K.W., Woo, K.H., 2018. Data Organization in Spreadsheets. Am. Stat. 72, 2–10. URL
Chapters 1 Look at data. Healy
Chapter 2 Get started. Healy
Effective Visualizations for Data-Driven Decisions Video (30min)
Code: Second session
Self-tutorial: https://rstudio.cloud/learn/primers/3
Cheat sheet data visualization
Data Viz chart here
Different charts in R r-charts
Cheatsheet - 70+ ggplot Charts here
Third session
Lectures pre-session
Video Hans Rosling (LV) https://www.ted.com/talks/hans_rosling_new_insights_on_poverty/transcript?language=lv#t-463101
Data Transformations: https://r4ds.had.co.nz/transform.html
Self-tutorial: https://rstudio.cloud/learn/primers/2.1
Self-tutorial: https://rstudio.cloud/learn/primers/2.2
Code 3rd Session
Fourth session
Self-tutorial: https://rstudio.cloud/learn/primers/2.3
Self-tutorial: https://rstudio.cloud/learn/primers/4
Code 4th Session
How to download the code
Fifth session
How to make your data FAIR here
Compulsory read: Data Organization in Spreadsheets
How to create codebooks in R here
Version control introductory video here and introductory info here
Data Management Plan (DMP)
Data Management Plan
Readings
Wilkinson et al., 2016. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018.https://www.nature.com/articles/sdata201618
Checklist for Data Management Plan
Here is a checklist to consider as you write your NSF Data Management Plan (generic):
Data Description: What data will be generated? How will you create the data? (simulated, observed, experimental, software, physical collections)
Existing Data: Will you be using existing data? Relationship between the data you are collecting and existing data.
Audience: Who will potentially use the data?
Access and Sharing: How will data files be shared? How will others access them?
Formats: What data formats will you be creating?
Metadata and Documentation: What documentation will you provide to describe the data? Metadata formats and standards?
Storage, backup, replication, versioning: Are the data files backed up regularly? Are there replicas in different locations? Are older versions of the data kept?
Security: Are the system and storage that will be used secure?
Budget: Any costs for preparing the data? Costs for storage and long-term access?
Privacy, Intellectual Property: Does the data contain private or confidential information? Any copyrights?
Archiving, Preservation, Long-term Access: What plans do you have to archive the data and other research products? Will it have long-term accessibility?
Adherence: How will you check for adherence of this plan?
Template for Data Management Plan
[ Note: This DMP describes how the project will conform in the RSU Dataverse recommedantion on dissemination and sharing of research results, including the requirement to “share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered.” In addition, check for specific directorate/program requirements.]
1. Data description
[ Briefly describe nature & scale of data {simulated, observed, experimental information; samples; publications; physical collections; software; models} generated or collected. ]
2. Existing Data [ if applicable ]
[Briefly describe existing data relevant to the project; added value/justification of new data collection/generation; and plans for integration with existing data]
3. Audience
[ Briefly describe potential secondary users; scope and scale of use]
4. Access and Sharing All data collected or generated will be deposited in the RSU Dataverse. The RSU Dataverse is a public repository, hosted and maintained by RSU University Information Technology (HUIT). The RSU Dataverse facilitates data access by providing descriptive and variable/question-level search; topical browsing; data extraction and re-formatting; and on-line analysis.
All data will be deposited at least 90 days prior to the expiration of the award. Such data may be embargoed until the publication of research based on the data or until 1 year after the expiration of the award, whichever is sooner. Users will be required to agree to click-through terms that prohibit unlawful uses and intentional violations of privacy, and require attribution. Use of the data will be otherwise unrestricted and free of charge.
5. Formats
Immediately after collection, quantitative data will be converted to [ SELECT ALL THAT APPLY: Stata, SPSS, R, Excel, CSV] formats. These formats are fully supported by the RSU Dataverse, which will perform archival format migration; metadata extraction; and validity checks. Deposit in these formats will also enable on-line analysis; variable-level search; data extraction and re-formatting; and other enhanced access capabilities. Documentation will be deposited in PDF/a, or plain-text formats, to ensure long-term accessibility, with any accompanying sound (in WAV), video, or images separate from the documentation deposited as JPEG 2000 files (with lossless compression) or uncompressed TIFF files.
6. Documentation, Metadata and Bibliographic Information
The project will create documentation detailing the sources, coding, and editing of all data, in sufficient detail to enable another researcher to replicate them from original sources; and descriptive metadata for each dataset including a title, author, description, descriptive keywords, and file descriptions. The project will include bibliographic information for any publication by the project based on that data.
The Dataverse application’s “templating” feature will be used for consistency of information across datasets. The Dataverse repository automatically generates persistent identifiers, and Universal Numeric Fingerprints (UNF) for datasets; extracts and indexes variable descriptions, missing-value codes and labels; creates variable-level summary statistics; and facilitates open distribution of metadata with a variety of standard formats (Data Cite, DDI v 2.5, Dublin Core, VO Resource, and ISA-Tab) and protocols (OAI-PMH, SWORD).
[ If applicable, briefly describe additional metadata/documentation to be provided; standards used; treatment of field notes and collection records; and quality assurance procedures for all of these ]
7. Storage, backup, replication, and versioning
The Dataverse repository provides automatic version (revision) control over all deposited materials and no versions of deposited material are destroyed except where such destruction is legally required. All systems providing on-line storage for the Dataverse are contained in a physically secured facility that is continually monitored. System backups are made on a daily basis. [For social science data: ] Replicas of data are held by independent archives as part of the Data-PASS archival partnership, regularly updated, and regularly validated, using the LOCKSS system.
8 . Security
The RSU Dataverse complies with RSU University requirements for good computer use practices. RSU University has developed extensive technical and administrative procedures to ensure consistent and systematic information security. “Good practice” requirements include system security requirements (e.g., idle session timeouts; disabling of generic accounts; inhibiting password guessing) operational requirements (e.g., breach reporting; patching; password complexity; logging ); and regular auditing and review.
9. Budget
The cost of preparing data and documentation will be borne by the project, and is already reflected in the personnel costs included in the current budget. The incremental cost of permanent archiving activities will be borne by RSU Dataverse.
[IF the data requires storage over 5GB, cannot be ingested using the acceptable formats above, requires extensive documentation, or is unusually complex in structure include: Staff time has been allocated in the proposed budget to cover the costs of preparing data and documentation for archiving for [describe complexities and management]. RSU has estimated their additional cost to permanently archive the data is [insert dollar amount, to be agreed with Dataverse Project team at RSU]. This fee appears in the budget for this application as well. ]
10. Privacy, Intellectual Property, Other Legal Requirements
Information collected can be released without privacy restrictions because [ it does not constitute private information about identified human subjects; informed consent for full public release of the data will be obtained; the data will be anonymized using an IRB-approved protocol prior to the conduct of analysis ]. The data will not be encumbered with intellectual property rights (including copyright, database rights, license restrictions, trade secret, patent or trademark) by any party (including the investigators, investigators’ institutions, and data providers.); nor is subject to any additional legal requirements. Depositing with the RSU Dataverse does not require a transfer of copyright, but instead grant permission for the RSU Dataverse to re-disseminate the data and to transform the data as necessary for preservation and access.
11. Archiving, Preservation, Long-term Access
The RSU Dataverse commits to good archival practice, including independent geo-spatially distributed replication, a succession plan for holdings, and regular content migration. Should the archiving entity be unable to perform, transfer agreements with the Data-PASS partnership ensure the continued preservation of the data by partner institutions. All data under this dataset will also be made available for replication by any party under the CC-attribution license, using the LOCKSS protocols – which is fully supported by the Dataverse application.
12. Adherence
[If not the PI, briefly describe who/which project role is responsible for managing data for the project]
Adherence to this plan will be checked at least ninety-days prior to the expiration of the award by the P.I. Adherence checks will include review of the RSU Dataverse content, number of datasets released, availability for each dataset of subsettable/preservation friendly data formats (possibly embargoed, but listed); availability of documentation (public); and correctness of data citation, including UNF integrity check.
Data Management Tool for DMP creation
Click https://dmptool.org/ link to open resource.
Template Data Management Plan (DMP) Horizon 2020
Final project
When you have a question, for example "how do I limit the y-axis in ggplot2" you will most likely google it. Some recommendations to make your search more efficient:
limit the results to one year back.
Stackoverflow usually has the exact answer.
consult the official documentation: ggplot Book and the tidyverse
Seminar examples
Seminar example here in webpage version
Seminar example in PDF format here
Seminar example code in Rmd format here
Codebook example here
Traffic accidents in Latvia, code here
Tips and tricks
Tidyverse tips and tricks video
Ten Tremendous Tricks in the Tidyverse video
The Lesser Known Stars of the Tidyverse video
RStudio Tips and Tricks video
Visualizations here
Some Open databases
Learn more
Biomedical interest
Article: O’Donoghue et al. 2018. Visualization of Biomedical Data. Annu. Rev. Biomed. Data Sci. 1, 275–304.
Article: Hattab G, Rhyne T-M, Heider D (2020) Ten simple rules to colorize biological data visualization. PLoS Comput Biol 16(10): e1008259.
Serie of articles: Data visualization Series. Nature Methods Journal Series
Code examples of complex figures: ggplot2 complex figures code examples Link
Visualising harms in publications of randomised controlled trials: consensus and recommendations Link
Social / Media / Humanities interest
BBC Visual and Data Journalism cookbook for R graphics. Link
BOOK: Machlis: Practical R for Mass Communication and Journalism. Link
Data cleaning
WHAT IT TAKES TO TIDY CENSUS DATA
Janitor package and tidyxl, useful to clean dirty excel files
General visualization
Data Visualization with R (book)
Website and code: ggplot2 code examples. Link
Website: ggplot2 cookbook. Link
Multiple views on how to choose a visualization and the visualization chooser guide
Blog Post: Why scientists need to be better at data visualization. Link
BOOK: Tufte. The Visual Display of Quantitative Information (classic textbook on visualization) Link
Twitter thread from Cedric Scherer with several presentations and resources thread + original tweet
Animations tweet
Some inspiration from Data is Beautiful
Tables
Integration with Word: https://ardata-fr.github.io/officeverse/index.html
Statistics
Courses
Data Visualization Courses on DataCamp Link
Video Plot anything with ggplot2 Part 1 + Part 2 by the ggplot2 maintainer
Documentation
Utilities
Collaboration
More tricks
dplyr::relocate()
dplyr::count() / n()
dplyr::distinct()
dplyr::glimpse()
dplyr::slice()
ggplot2::geom_count()