Behind-the-Scenes Look at Content Migration for NASA
To say NASA has a lot of web content would be quite the understatement. Being one of the first government websites to come online, they’ve amassed almost 20 years worth of content. This also meant 20 years of different content management systems (CMS), image managers, re-platformings, fixes, and conversions. Even a complete and full migration would have been tricky with these aspects, but luckily, that wasn’t our situation. Since some of the content was outdated or had been rewritten more recently, we were able to approach this with the constraint that not all content would be migrated. This migration was also a consolidation, so we began by identifying which content was intended to come over at all.
The most recent CMS in place for NASA was a Drupal 7 build. The Drupal database was much more expansive and detailed than the one WordPress sets up out of the box. In addition, the site used a JavaScript-based templating system, so simply scraping the content based on just the URL was not an option. Instead, by comparing it to live content, we were able to see where those fields existed in the database. By manually piecing together data files this way, we were able to select what information we wanted to bring over. Though some things such as the source id were set manually, any bit of data brought over was saved and tucked off to the side in case it needed to be referenced later on.
In order to help move the content, we wrote command-line scripts using the WP-CLI framework. The custom scripts began with JSON data files we generated from the Drupal database. While processing, each piece of content was then checked against what had already been brought in to avoid any duplicates. Since the new CMS did not have all the authors from the Drupal site, as content was migrated, they were assigned a migration author to make it easier to identify later on.
At this point, anything brought over was automatically brought into a dedicated content type just for migration, so things like the original publish date could easily be maintained and the embedded or linked media assets could be ingested into the WordPress media library. This also gave authors and editors time to check their content before converting it into its final post type.
With NASA moving to a completely new design system and information architecture, the content needed human eyes to confirm that it was still accurate and reliable and to also apply any new formatting or style options that the new site provided to them. In all, we migrated 70,000 pages which included 36,430 content items, 30,731 image features, and 845 podcasts, on top of over 100,000 media assets. As a result, amazingly, we never had a situation where the user base was having to enter content into two different CMS.