Washington, DC - There are currently about 5,400 different species of mammals on Earth, but this wasn’t the case until about 56 million years ago. As non-avian dinosaurs faded away, mammals began to fill the empty niches and extraordinary things happened, like when hippo-esque creatures returned to the sea and became what we now know as whales, dolphins and porpoises (cetaceans). Fast forward to the present day and we have mammals that can fly, live with very little water, run super-fast and eat just about anything.
Despite this apparent diversity, mammals are fundamentally very similar. We share the same basic organs, digest food essentially the same way, have similar skeletal systems, but more importantly, we have a very similar repertoire of genes and proteins. But, and I apologize for using car analogies, this is like saying a VW Beetle is the same as a Formula 1 race car. True, they have the same basic parts, but each is optimized for a very specific purpose. The same goes for mammals. For example, an elephant seal can shut off blood to its vital organs and dive almost a mile underwater for over an hour and not die when it resurfaces and the oxygen-rich blood rushes back. Black bears can hibernate for over seven months, yet their muscles don’t wither away and the toxins that have accumulated in their blood don’t kill them. These different adaptations use the same basic mammalian blueprint and machinery, but they are tweaked to give these animals superpowers, i.e., advantageous phenotypes.
Biologists have for years been using insights from nature to advance technology, e.g., gecko feet and adhesives. Studying naturally occurring advantageous phenotypes at the molecular level holds enormous promise for human health applications. If we can determine how diving mammals’ genes and the proteins they encode enable them to restore the flow of oxygen-rich blood to organs after resurfacing without suffering tissue damage, then we could use that knowledge to limit damage from heart attacks, strokes and kidney injury. By understanding why elephants have a much lower incidence of cancer or why dogs and sea lions have a much higher rate than humans, we could do the same for cancer research. However, the insights nature can provide us are limited by our ability to make the necessary comparisons. There are myriad reasons for this, but two are simply that the analytical tools aren’t there to be used and standardized data aren’t there to be mined.
Getting NISTy with it
This is where NIST Charleston, in Charleston, South Carolina, comes in. We have state-of-the-art instrumentation and experience at diving deep into proteomes, i.e., the collection of proteins in a sample. For example, over the past decade, I have used proteomic techniques to study everything from cancer biomarkers to marine mammal health and developed analytical workflows for numerous applications. Currently, we can routinely identify and measure over 300 different proteins in blood and over 8,000 in tissues (for reference, humans potentially have over 20,000 different proteins). More important than our ability to do cutting-edge proteomics is to do so in a standardized manner. This point may seem insignificant, but it is actually unbelievably important. For example, we already have published blood proteomic datasets from dolphins, humans, rats and monkeys, but it is nearly impossible for us to compare the results due to different sample extraction or data acquisition methods. We use standardized methods from sample-to-data and human standard reference materials to ensure quality along the way. This focus on quality ensures the resulting data has the greatest impact on our stakeholders.
In addition to having the requisite analytical capabilities, which isn’t surprising given we are NIST, we also collaborate with an incredible group of veterinarians and field biologists who are part of the Comparative Mammalian Proteome Aggregator Resource (CoMPARe) Program. Our collaborators frequently take and archive small blood samples as part of health assessments or as standard care at zoos and clinics, and they can often spare the three microliters (about 1/10 of a drop of blood) we need to perform a proteomic analysis. With their help, we are building out a phylogenetically diverse data set.
Each species we survey is an opportunity to demonstrate the feasibility of modern bioanalytical techniques in a new system. Typically, scientists use model organisms of human development and disease, such as rodents, nematodes, fruit flies and zebrafish for which we have a plethora of associated biochemical tools and decades of proven bioanalytical capabilities. Classic dogma suggests that it is risky to study non-model organisms (like bears, seals or bats) to answer questions about interesting biology because the analytical tools just aren’t there. This, in turn, hampers progress in critical research that could have more impactful results. Simply by working with a new species for the first time, NIST is answering the question, “Can it be done?” Taking that first step of demonstrating the power of modern techniques in non-model organisms gives researchers the evidence they need to propose studies using transcriptomics (RNA molecules in one cell or a population of cells) and proteomics in their laboratories.
I like to think what we do at NIST is like building scientific lighthouses. We can see that where we’re trying to go could benefit research and commerce, so we brave and map the rocks, come ashore and build the infrastructure others can use to navigate these new waters and explore this new land. Blazing that path is difficult and not something many companies or researchers want to risk. Because we’re taking this first step, we’re not only helping to advance human medical research, but we are also accelerating research in non-model species. There is a misnomer that we live in a post-genomic era, but it is more apt to say that we are on the cusp of the post-model organism era where cutting-edge work on any species is possible.
Genomes in a post-model organism era
But to perform this type of proteomic analysis, you have to know the sequence of amino acids that are proteins’ building blocks, which are in turn spelled out in our genetic code (pairs of the nucleotides A, T, C and G). And if you want to know about all the more than 20,000 proteins, you need to know the whole genome, which is made up of around 2.5 billion base pairs. The first genome was sequenced 22 years ago (yeast), yet there are only 120 mammalian genomes currently annotated, and one-third of those are rodents and primates. The good news is that it is easier than ever to sequence a genome, and there is a veritable explosion of genomic sequencing going on (prime examples are DNA Zooand the Earth BioGenome Project).
Sequencing and assembling a genome, which once took an army of bioinformaticians years to accomplish, is now possible in less than two weeks. From there, it moves to the team at the National Center for Biotechnology Information’s Reference Sequence Database (RefSeq), who annotates the genome to determine where genes start and stop and assign names and amino acid sequences to the proteins. Our project currently has samples from more species than we have genomes for, but we are actively working to either sequence key genomes (like the Atlantic bottlenose dolphin in 2016) or help generate data for RefSeq to annotate genomes (like the California sea lion). Moreover, when we find phylogenetic deserts (groups of species lacking genomes, like marsupials), we can enlist the help of CoMPARe program collaborators to help fill in the blanks.
Data, data everywhere
Once we have generated data from this menagerie of species, we have to determine how to compare the results. At a very basic level, this can be challenging, to say the least. Proteins that have homologous amino acid sequences don’t always have the same exact name. Serine protease hepsin in the walrus is 92 percent identical to the human protein hepsin. This small difference in naming is compounded across many species and hundreds of proteins. Along with our CoMPARe program collaborators, we are building informatics tools that “humanize” results so that we know which proteins are the same across species, e.g., all the hepsin is called hepsin. But we are also working with the groups in charge of assigning the protein names so that we aren’t just making one-off solutions but are trying to change the system itself. Finally, these results will be on a webportal designed to allow researchers to retrieve data, look for patterns between groups of species, or even look at relative levels of proteins of interest across the different species represented.
Of special importance to me is the fact that the data generated by this project will be made publicly available as quickly as possible to help researchers hurdle obstacles to discoveries and encourage open-scientific practices. Each solution we create, be it an informatics tool or a new genome, will be available to the world, and I’m excited to see creative uses.
As we measure thousands and tens of thousands of biological constituents, it’s more and more apparent to me that a single group can’t effectively understand the full nuances contained in a data set. It is hubristic to think one could. Our goal isn’t to understand but to facilitate discovery by others. We have very basic questions, like whether something in blood can provide clues about certain traits in mammals. Regardless of whether this first question is the best one to ask, it will empower others to ask more elegant questions that will lead to a deeper understanding of the machinery of life.