As part of my learning, I wanted to explore more geodata analysis. Last year, I used the Canadian census data (Exploring Geographic Data with the GeoMap Chart ) to put together a tool to browse the dataset they have available. This year, I wanted to try doing something similar with the 2010 USA census data. The USA datasource is more detailed, and complex. I've tried to make their wealth of information more accessible for anyone who wants to use it within BOARD. The model has definitions for many geographic regions, with lat/long coordinates to facilitate mapping. I've organized the population and households data into screens focused around each of these major region types, available from the screen list menu on the top left of each screen.
The support team at the census bureau were very helpful too. For anyone looking to do something similar, I encourage you to reach out to them at +1 301 763-1128 or email@example.com.
- Geomap showing population and households in each district
I decided to include pagers by state on most screens to limit the dataset when it first loads. More data can certainly be loaded, but plotting more datapoints on a map just looks messy and slows down the load time. I loaded a picture cube with the congressional district maps by state. I loaded a PDF cube with the more detailed congressional district wall map PDFs by state. I created datareaders to cycle through 50+ files of the same type in a folder, rather than building one datareader for each one. These are some large datasets, so please use selections while navigating the data. Query times can quickly get out of control when selections are missing.
What Can We Do With This Model
Here are some questions you could investigate with this model
- How many people or households are in a region?
- What is the distribution of men and women, by age, across a set of counties or county subdivisions?
- Where are regions located throughout the country?
- What is the ratio of people per household in a particular area?
- How many more women are there than men in an area?
- How many 35 year-olds are there in a particular county subdivision?
Population and households across each congressional district. GIF and PDF maps of each state's districts are available. These are the districts USA voters use to vote for their national representatives.
Here is a national map showing each congressional district.
Depending on the state selected, a GIF image of the state's congressional districts is displayed. This image is saved in a picture cube. There is a more detailed wallmap PDF, also stored in a BLOB cube.
State Legislative Districts (Upper and Lower)
Population and households across each state legislative district is shown. For each state, we have both the upper (senate) and lower (house) district data. These are the districts USA voters use to vote for their state legislative representatives.
Population and households are shown across places in each state. Places are unique areas for which census data is tracked. Places are one of the most detailed datasets.
Population and households are shown across counties and county subdivisions.
Population and households are shown across school districts. There are three types of school districts: elementary, secondary and unified.
Population and households are shown across urban areas across the nation.
Age vs Gender
Aside from population and household information, I loaded the P12 Census table with population data for each county subdivision, by age bucket, by gender. This is a more detailed dataset than we have shown above. I've struggled to show a visualization for this dataset, so I've just left it as a treemap and dataview for now.
The graphical relationship diagram showcases how important the GEOID field is in the model. This directly maps to many different region types.This is also why so many versions are included. The versions allow saving of data at the lowest level (GEOID), but aggregates calculated at the reporting level (region).
Here are the entities used in the model. Since no new data will be added, I've allowed the saturation to be very high.
Here are the main cubes. The Gazetter cubes are statistics for each geographic area (GEOID). The 2010 Population and 2010 Households are counts available from the same Gazetter load files. I made them year specific because I don't have any comparison data at the same grain. The P12 Sex by age is a model based on the P12 table from the https://www.census.gov/prod/cen2010/doc/sf1.pdf document.
- Concatenated codes - Depending on the source file, most keys are concatenated. This was a bit tricky to setup relationships and data readers. Since BOARD prefers strict balanced relationships, I found I had to create relationships between concatenated keys that rollup to their individual components. Use of the Type entity to denote which type of GEOID is being defined has helped keep the model simple.
- Massive dataset - The measures shown here are a small piece of what is available. Block level data is available from the census bureau. I decided not to go down to the block level of detail because lat/long coordinates are not available for each block, and the data size grew tremendously. Just for California, there were 800k blocks with data. Loading a national model at that grain would be well over 100GB of data.
- Many many files - Some of the data is partitioned by state. That means around 50 files for a datareader. Using the pattern mask, I was able to use one datareader to cycle through all the files of the same type, rather than having to build a datareader for each file.
- Versions - To improve performance, I've created 21 versions across most cubes. That helps ensure aggregated values are already available when requested.
- If this were to be used in a production environment, I would build target-specific cubes for the appropriate analysis. The current design is very broad, whereas performance can be improved by making more specific cubes at the appropriate grain.
- Hidden cubes - I needed 49 stage cubes to be able to load files in one pass. All of these are now hidden cubes, so the model is not clutter for screen developers. I highly recommend using the hidden cube functionality to hide any intermediate or stage cubes that don't have value for most folks.
- If/when you find a bug, let me know and I'll try to update the post with a corrected model. If/when you have ideas for improvements or followup experiments, post a comment to let me know, or post a comment to let the community know how your experiment worked out. You can also post issues on the github repository here.
How to reuse this model
For anyone who wants to reuse this model for, here are some recommended steps to get setup.
- Save model to your server
- Move the Census USA capsule into your capsules folder
- Move the CensusUSA database folder into your databases folder
- Restart the BOARD service, so the new database and capsules can be opened
- Review the Population and Household screens to see how it data can be navigated
- You can navigate the model without loading any additional data, but if you want to...continue
- Visit the bulk data census site at Download Bulk Census Data as CSV - census.ire.org
- Select the state
- Select the grain (state, county, county subdivision, place, census tract)
- Select the measure of interest from the long list available
- Create new data readers based on the P12 Sex by age one already created.
- Add the new datareaders to the Rebuild All procedure
- When you run Rebuild All, it should load everything into the model and you should be able to start navigating the data. If you want to clear out all cubes and entities, running Rebuild All will reload everything.
Here are some features I'd like to add, if I have time and interest to do so
- Add other detailed census tables like the P12 table
- Add more interesting visualizations to showcase some of the BOARD features
- Add 2020 census data, once it's available
- Download the latest ZIP file for the model - CensusUSA/Releases at master · grobertgill/CensusUSA · GitHub
- Census data repository folder - Index of /census_2010/04-Summary_File_1
- Census data geography explained - https://www.census.gov/prod/cen2010/doc/sf1.pdf
- Census geographic data definitions - 2017 U.S. Gazetteer Files - Geography - U.S. Census Bureau
- Census geographic region relationship definitions - 2010 Census Block Assignment Files - Geography - U.S. Census Bureau
- Bulk data CSV data selection page - Download Bulk Census Data as CSV - census.ire.org
- Thank you to Paola Mason for her assistance
- This is not production-ready code. This is a learning project to be shared with the community. If you want to use it, you do so at your own risk...that being said, I'll still probably help, if you want.
- Census analysis maps - The census department has produced nice PDFs with some of their analysis of the data. Index of /geo/pdfs/maps-data/maps/2010pop
- Reference Maps - Geography - U.S. Census Bureau
- Zipcode Reference - https://www.unitedstateszipcodes.org/98052/
- As I refine the model, I'll update this github repository. If you'd like to help me, I'd love the help and opportunity to learn from others.
GitHub - grobertgill/CensusUSA: BOARD capsule and database based on the USA census bureau's 2010 dataset
- Installing Git