USA Census - Population, households, gender and age

Document created by Bob Gill on Jun 8, 2018Last modified by Bob Gill on Jul 2, 2018
Version 14Show Document
  • View in full screen mode

About the Solution/Project

As part of my learning, I wanted to explore more geodata analysis. Last year, I used the Canadian census data (Exploring Geographic Data with the GeoMap Chart ) to put together a tool to browse the dataset they have available. This year, I wanted to try doing something similar with the 2010 USA census data. The USA datasource is more detailed, and complex. I've tried to make their wealth of information more accessible for anyone who wants to use it within BOARD. The model has definitions for many geographic regions, with lat/long coordinates to facilitate mapping. I've organized the population and households data into screens focused around each of these major region types, available from the screen list menu on the top left of each screen.

         

 

The support team at the census bureau were very helpful too. For anyone looking to do something similar, I encourage you to reach out to them at +1 301 763-1128 or geo.geography@census.gov. 

 

Functionalities Used

The Geomap tool was used extensively to provide navigation through this large dataset. Latitude and longitude coordinates were available as part of the dataset, for many different geographic regions. Relationships were also used extensively. The relationships between individual geographic points, rolling up to school district, native region, voting district, county and state play a crucial role in making the data more friendly to navigate. I decided to focus each screen around a region type. For that particular region type, I used the folder container to show a few different objects with the same context. For example, the congressional districts screen has a folder container tabs for each of these items

  • Geomap showing population and households in each district
  • Dataview of the numbers displayed on the map
  • National congressional district map
  • State congressional district map with a link to the detailed PDF 

 

  I decided to include pagers by state on most screens to limit the dataset when it first loads. More data can certainly be loaded, but plotting more datapoints on a map just looks messy and slows down the load time. I loaded a picture cube with the congressional district maps by state. I loaded a PDF cube with the more detailed congressional district wall map PDFs by state. I created datareaders to cycle through 50+ files of the same type in a folder, rather than building one datareader for each one. These are some large datasets, so please use selections while navigating the data. Query times can quickly get out of control when selections are missing.

 

What Can We Do With This Model

Here are some questions you could investigate with this model

  • How many people or households are in a region?
  • What is the distribution of men and women, by age, across a set of counties or county subdivisions?
  • Where are regions located throughout the country?
  • What is the ratio of people per household in a particular area?
  • How many more women are there than men in an area?
  • How many 35 year-olds are there in a particular county subdivision?

 

Screens

Congressional Districts

Population and households across each congressional district. GIF and PDF maps of each state's districts are available. These are the districts USA voters use to vote for their national representatives.

 

Here is a national map showing each congressional district. 

Depending on the state selected, a GIF image of the state's congressional districts is displayed. This image is saved in a picture cube. There is a more detailed wallmap PDF, also stored in a BLOB cube.

 

State Legislative Districts (Upper and Lower)

Population and households across each state legislative district is shown. For each state, we have both the upper (senate) and lower (house) district data. These are the districts USA voters use to vote for their state legislative representatives.

Places

Population and households are shown across places in each state. Places are unique areas for which census data is tracked. Places are one of the most detailed datasets.

Counties

Population and households are shown across counties and county subdivisions.

School Districts

Population and households are shown across school districts. There are three types of school districts: elementary, secondary and unified. 

 

Urban Areas

Population and households are shown across urban areas across the nation.

 

Age vs Gender

Aside from population and household information, I loaded the P12 Census table with population data for each county subdivision, by age bucket, by gender. This is a more detailed dataset than we have shown above. I've struggled to show a visualization for this dataset, so I've just left it as a treemap and dataview for now. 

Architecture

Relationships

The graphical relationship diagram showcases how important the GEOID field is in the model. This directly maps to many different region types.This is also why so many versions are included. The versions allow saving of data at the lowest level (GEOID), but aggregates calculated at the reporting level (region).


 

Entities

Here are the entities used in the model. Since no new data will be added, I've allowed the saturation to be very high.

 

Cubes

Here are the main cubes. The Gazetter cubes are statistics for each geographic area (GEOID). The 2010 Population and 2010 Households are counts available from the same Gazetter load files. I made them year specific because I don't have any comparison data at the same grain. The P12 Sex by age is a model based on the P12 table from the https://www.census.gov/prod/cen2010/doc/sf1.pdf document.

 

Challenges

  • Concatenated codes - Depending on the source file, most keys are concatenated. This was a bit tricky to setup relationships and data readers. Since BOARD prefers strict balanced relationships, I found I had to create relationships between concatenated keys that rollup to their individual components. Use of the Type entity to denote which type of GEOID is being defined has helped keep the model simple.
  • Massive dataset - The measures shown here are a small piece of what is available. Block level data is available from the census bureau. I decided not to go down to the block level of detail because lat/long coordinates are not available for each block, and the data size grew tremendously. Just for California, there were 800k blocks with data. Loading a national model at that grain would be well over 100GB of data.
  • Many many files - Some of the data is partitioned by state. That means around 50 files for a datareader. Using the pattern mask, I was able to use one datareader to cycle through all the files of the same type, rather than having to build a datareader for each file.
  • Versions - To improve performance, I've created 21 versions across most cubes. That helps ensure aggregated values are already available when requested.
  • If this were to be used in a production environment, I would build target-specific cubes for the appropriate analysis. The current design is very broad, whereas performance can be improved by making more specific cubes at the appropriate grain.
  • Hidden cubes - I needed 49 stage cubes to be able to load files in one pass. All of these are now hidden cubes, so the model is not clutter for screen developers. I highly recommend using the hidden cube functionality to hide any intermediate or stage cubes that don't have value for most folks.
  • If/when you find a bug, let me know and I'll try to update the post with a corrected model. If/when you have ideas for improvements or followup experiments, post a comment to let me know, or post a comment to let the community know how your experiment worked out. You can also post issues on the github repository here.

 

How to reuse this model

For anyone who wants to reuse this model for, here are some recommended steps to get setup.

  1. Save model to your server
  2. Move the Census USA capsule into your capsules folder
  3. Move the CensusUSA database folder into your databases folder
  4. Restart the BOARD service, so the new database and capsules can be opened
  5. Review the Population and Household screens to see how it data can be navigated
  6. You can navigate the model without loading any additional data, but if you want to...continue
  7. Visit the bulk data census site at Download Bulk Census Data as CSV - census.ire.org 
  8. Select the state
  9. Select the grain (state, county, county subdivision, place, census tract)
  10. Select the measure of interest from the long list available
  11. Create new data readers based on the P12 Sex by age one already created.
  12. Add the new datareaders to the Rebuild All procedure
  13. When you run Rebuild All, it should load everything into the model and you should be able to start navigating the data. If you want to clear out all cubes and entities, running Rebuild All will reload everything.

Roadmap

Here are some features I'd like to add, if I have time and interest to do so

  • Add other detailed census tables like the P12 table
  • Add more interesting visualizations to showcase some of the BOARD features
  • Add 2020 census data, once it's available

 

References

Attachments

    Outcomes