In this section we will look at data visualization, exploration, validation and analysis.
This is an exceedingly rich and vital part of your data and should be included in most phases of your data pipeline.
Wait, what? All phases? Yep pretty much all phases including in many cases in gathering/generating the raw data itself.
By the end of this section you will begin to get the first glimmer of a glimpse ae to why this is so valuable and important!
But alas we will not cover everything because I have made a life time exploring visualization since the first scientific visualizations were publishes in the 70's (yes there are people that old still breathing!) And the fun part is this area changes, grows, and evolves EVERY single day!!!
There a HUGE number of different packages that you can use for plotting data including, but far from limited too:
For the longest time matplotlib ruled the roost and is probably way and by far the most often used package for Python data visualization.
However a few years ago another old school package beefed itself up with some mega coding steroids of the D3 variety is now the king of the hill both for straight out visualization as well as user interactive exploration (spelled gui'ed enabled and dashboard syncopated via its siamese brother Dash.
One of the advantages of Plotly is it works in many different languages almost exactly the same in each. This include Python, R, Julia, its native javascript, and others.
Did I mention that in the background its pretty much all javasript? That makes it valuable for web presentations too!
Another advantage is if you can dream a plot you can create that plot, albeit at the cost of a bit (spelled - lot) of work on your part but very doable.
How big and powerful is Plotly now? Well Matplotlib used to be the default plotting package built into Pandas but now its Plotly and you have to tell Pandas to use Matplotlib if you so wish to use it.
Compared to Matplotlib Plotly has a simpler, more refined 'layout' based approach as shown below. Note the same architecture as you see in say matplotlib still exists here it just hides in the background.
The real power of customizing your plots to look exactly how you want/need them lies in the expansive usage of 'update_traces' and 'update_layout' more on this later but just know that this makes life, easier, once your used to it, then working in other visualization packages.
Plotly Express(plotly.express): Plotly Express is the high level api of the Plotly and it’s much easier to draw charts with this module. We can even draw the whole figure with a single line of code. That being said, it is relatively new and sufficient help may not be provided in the documentation.
Graph_objects (plotly.graphs_objects): It is the module that contains the objects or shape templates used to visualize. Graph_objs is low-level interface to figures, traces and layout. Graph objects can be turned into their Python dictionary representation . Similarly, you can turn the JSON representation to a graph object.
Subplots(make_subplots): This module contains the helper functions for layouts of the multi-plot figures. Figures with predefined subplots configured in ‘layout’.
Figure Factories(plotly.figure_factory): This module provides many special types of figures such that drawing these in Plotly or Plotly Express is quite difficult. These figures can be easily plotted with Figure Factories. These charts are: Annotated Heatmaps, Dendrograms, Gantt Charts, Quiver Plots, Streamline Plots, Tables, Ternary Contour Plots, Triangulated Surface Plots.
Note: Figure factories appear to be slowly going away with some of their features already moved into express and/or graph_objects.
Its always important to know where the documentation and examples are. You will find yourself referring to the two following reference documentation pages a lot and happily they are really well written!
https://plotly.com/python/
https://plotly.com/python-api-reference/generated/plotly.data.html
Lets start off by looking at a simple Plotly Express scatter plot.
We begin as always with importing all the packages we will need for this section.
import numpy as np
import pandas as pd
import os
import plotly.graph_objects as go
import plotly.express as px
import nltk
from nltk.corpus import stopwords
import re
import unicodedata
from collections import Counter
from ast import literal_eval
import pycountry
We will work with our WoS data we just got through playing with by loading it up into a fresh DataFrame.
df = pd.read_csv(os.path.join('Processed_data', 'Combined_lists.csv'))
For this first plot lets look at the pages counts for each item in our DataFrame.
To do this lets make a little function to create and return a new DataFrame hosting the number of pages for each item.
Along with this we will generate a count of the number of times that number of pages is found in the DataFrame.
Since the counts will show up as the index we will reset the index and rename it to 'Counts'.
We then will go ahead and call our new function.
def get_page_counts(df):
pc = df['Number of Pages'].value_counts().rename_axis('NumPages').reset_index(name='Counts')
return pc
pc = get_page_counts(df)
pc
NumPages | Counts | |
---|---|---|
0 | 12 | 824 |
1 | 10 | 806 |
2 | 11 | 741 |
3 | 14 | 716 |
4 | 13 | 685 |
... | ... | ... |
96 | 0 | 1 |
97 | 201 | 1 |
98 | 85 | 1 |
99 | 67 | 1 |
100 | 359 | 1 |
101 rows × 2 columns
The advantage, as already stated, is it makes it fast and easy to create simple plots.
We will create a figure ('fig') based on a Express scatter plot.
We then assign the x values to the 'Number of pages' and the y values to the 'Count'.
fig = px.scatter(x=pc['NumPages'],
y=pc['Counts'],
)
fig.show()
Lets just explore this plot for a minute.
Notice anything uncomfortable?
Surely something with >350 pages must be a book.
Right?
# Lets look at the number of pages just for Journals and send it into a new DataFrame
jpdf = df.loc[df['Publication Type'] == 'Journal']
jpdf = jpdf['Number of Pages'].to_frame()
# Now lets run this new DataFrame of just Journals info into our handy function.
pc = get_page_counts(jpdf)
pc
NumPages | Counts | |
---|---|---|
0 | 12 | 755 |
1 | 14 | 676 |
2 | 11 | 659 |
3 | 13 | 654 |
4 | 10 | 649 |
... | ... | ... |
83 | 69 | 1 |
84 | 101 | 1 |
85 | 73 | 1 |
86 | 0 | 1 |
87 | 359 | 1 |
88 rows × 2 columns
Buggers - Theres something wrong here what journal has >350 let alone for one article?!?! Lets extract the information for this articlwe.
maxdf = df[df['Number of Pages'] == df['Number of Pages'].max()]
maxdf
Unnamed: 0 | Publication Type | Authors | Author Full Names | Article Title | Source Title | Book Series Title | Book Series Subtitle | Language | Document Type | ... | Publication Year | Volume | Issue | Start Page | DOI | Number of Pages | WoS Categories | Web of Science Index | Research Areas | Country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12684 | 12684 | Journal | Narasimhan, VM; Patterson, N; Moorjani, P; Roh... | Narasimhan, Vagheesh M.; Patterson, Nick; Moor... | The formation of human populations in South an... | SCIENCE | NaN | NaN | English | Article | ... | 2019.0 | 365 | 6457 | 999 | 10.1126/science.aat7487 | 359 | Multidisciplinary Sciences | Science Citation Index Expanded (SCI-EXPANDED)... | Science & Technology - Other Topics | ['Afghanistan', 'Austria', 'Canada', 'China', ... |
1 rows × 41 columns
We can look deeper at this entry but all we will find is more troubly issues.
So we use our good friend Google to find the article here;
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822619/pdf/nihms-1053677.pdf
So most of the info is correct just not the number of pages nor its end page(if we were to extract that info out we would find +999)
This is just one advantage to using data visualization to explore and possibly clean up your data!
Theres more to explore here but lets move on.
NLP provides a practical introduction to programming for language processing. Written by the creators of NLTK (Natural Language Tookit), it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.
As per the official website: https://www.nltk.org/
We will use the toolkit to look for common phrases found in each items Abstract.
We will look at what is known as ngrams, groupings of words. This will make sense as we go.
The first thing we need to do is download some special data needed for nltk.
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
[nltk_data] Error loading stopwords: <urlopen error [WinError 10054] [nltk_data] An existing connection was forcibly closed by the [nltk_data] remote host> [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\tdunn\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] C:\Users\tdunn\AppData\Roaming\nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
True
We will also build an empty, for now, list of additional 'stopwords'.
Stopwords are common words such as 'the', 'and', 'this', 'there', ect ect ect.
The set we will be using today is small (~180) but pretty much sufficent for our needs.
We will pretend this is a constant. However Python has no concept of true constant variables as say c, c++, Fortran, ect does.
ADDITIONAL_STOPWORDS = ['']
We will now create a new function we wil simply call 'basic_clean' and I'll walk you through it.
def basic_clean(text):
""" basic_clean() - A simple function to clean up the data. All the words that
are not designated as a stop word is then lemmatized after
encoding and basic regex parsing are performed.
Using 'NFKD' normalization (https://docs.python.org/3/library/unicodedata.html)
Params:
text (DataFrame column) - Column with text to lemmatize
Returns:
lemmed (list) list of lemmatized words
"""
# Create a list of words from the inputted text in lemmatized form
wnl = nltk.stem.WordNetLemmatizer()
# Build stop words to remove from wnl
# Note the use of our ADDITIONAL_STOPWORDS.
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
# Normilize the text using NKFD normilzation (https://docs.python.org/3/library/unicodedata.html)
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
# Use regex to search ('^') for sequences of two characters with one non-word, and one non-space ('\w\s')
# splitting the text into individual words
words = re.sub(r'[^\w\s]', '', text).split()
# Remove stop words from rhe list
lemmed = [wnl.lemmatize(word) for word in words if word not in stopwords]
return lemmed
You Try It: Load up our saved dataset into a fresh DataFrame.
Solution:
df = pd.read_csv(os.path.join('Processed_data', 'Combined_lists.csv'))
We will now call our function with our DataFrame.
We will also use what is known as a Jupyter magic.
In this case we will measure the amount of time it takes to run this function (~10 seconds for my laptop and only 1 of 2 power supplies attached)
Note: Our function requires a single list of all the 'Abstract' entires 'joined' together!
%%time
words = basic_clean(''.join(str(df['Abstract'].tolist())))
Wall time: 19.2 s
Sooooo, the reason I timed this is that the first go around of writing this code I did things the 'documented' way and used the NLTK build it 'tokenizer' to break the words down before lemmatizing them.
The problem with this approach is that it took >15minutes to run.
Not ideal for our workshop!!!
Contrary to the normal Python route I skipped using the libraries function and used my own methodology.
Note: Usually Python code which is extremely process intensive is actually written in c (eg Numpy is pretty much all c!!!)
len(words)
1785509
words[0:15]
['climate', 'change', 'established', 'scientific', 'fact', 'dealing', 'may', 'require', 'significant', 'shift', 'consumption', 'economic', 'organization', 'key', 'question']
We will now concentrate on generating our common ngrams of;
def build_gram(words, depth, num):
""" build_gram() - Builds ngram word groupings for the lemmatized text.
Params:
words (list) - cleaned, lemma text (text cleaned by running basic_clean() first)
depth (int) - Depth of ngram build out (eg 2 for bigrams, 3 for trigrams, ect)
num (int) - Number of top ngrams to return
Returns:
ngram (DataFrame)
"""
# Create a Dataframe of words in ngram groupings (depth) with the top 'num' entries
ngram = (pd.Series(nltk.ngrams(words, depth)).value_counts())[:num].to_frame()
# Reset the DataFrame index, for cleaness more then anything else
ngram.reset_index(inplace=True)
# Rename the header column names
ngram = ngram.rename(columns={'index':'ngram', 0:'count'})
# Cast all words to strings for cleaness when plotting
ngram['ngram'] = ngram['ngram'].astype(str)
# Sort the DataFrame based on counts
ngram.sort_values('count', ascending=False, inplace=True)
ngram['ngram'] = ngram['ngram'].map(lambda x: x.lstrip("('").rstrip("',)"))
return ngram
Next we will build a function to use Plotly's Grophic Objects (go) Bar charts.
The advantage of plotly.go is that it is VASTLY more customizable using those 'trace' and 'update_layout' functions we talked about earlier.
Again I'll quickly walk you through the code.
def plot_ngram(ngram, title):
# Create a figure we will add our chart to
fig = go.Figure()
# Add a bar chart to the figur
fig.add_trace(go.Bar(x = ngram['ngram'], # x-axis data
y = ngram['count'], # y-axis data
text=ngram['count'], # Textual data displayed above each bar
textposition='outside', # Location of 'text'
orientation='v', # We want veritcle bars, and 'h' would give us horizontal bars
)
)
fig.update_layout(title=f'<b>{title} Analysis</b>', # Title of figure/chart
width = 1400, # Width of figure
height = 800, # Height of figure
xaxis_title="<b>ngrams</b>", # x-axis label
yaxis_title="<b>ngram count (in thousands k)</b>", # y-axis label
font=dict(family="Arial", # dict. specifing font details
size=16,
)
)
fig.show() # Display the figure
Now to run both our ngram and plotting functions.
ngram = build_gram(words, depth=1, num=20)
plot_ngram(ngram,title='Unigram')
You try it: Create a plot of bigrams and then trigrams. </style>
Solution:
ngram = build_gram(words, depth=2, num=20)
plot_ngram(ngram,title='Bigram')
ngram = build_gram(words, depth=3, num=20)
plot_ngram(ngram,title='Trigram')
Well we started to get good information that may be of value to use except now we have a bunch of word associates we do not need.
Remember that ADDITIONAL_STOPWORD list?
Lets make us of it and rerun everything
ADDITIONAL_STOPWORDS = ['ltd', 'right', 'reserved', 'elsevier']
words = basic_clean(''.join(str(df['Abstract'].tolist())))
ngram = build_gram(words, depth=3, num=20)
plot_ngram(ngram,title='Trigram')
Obviously we go do a lot more work here but I think you are beginning to see the power here.
But lets go ahead and move on to our our next exploration.
We will now quickly produce anothter type or scatter plot this time using plotly.go.
For this we will explore the number of publications per year as well as their number of citations.
We naturally begin with getting a clean DataFrame.
df = pd.read_csv(os.path.join('Processed_data', 'Combined_lists.csv'))
Next we create a new DataFrame with just the 'Publication Year' and 'Cited Reference Count' columns.
Then we get rid of any rows that did not have a publication in our new 'Pub_Year' column.
Finally we will confirm we got rid of all them.
ndf = pd.DataFrame({'Pub_Year':df['Publication Year'], 'Cited_Count':df['Cited Reference Count']})
ndf.dropna(inplace=True)
ndf['Pub_Year'].isna().sum()
0
We will quickly look at 'Pub_Year' to make sure its a decent enough data type for our needs (we want integer values.
ndf['Pub_Year']
0 2010.0 1 2014.0 2 2022.0 3 2015.0 4 2020.0 ... 12681 2020.0 12682 2021.0 12683 2011.0 12684 2019.0 12685 2021.0 Name: Pub_Year, Length: 12527, dtype: float64
SInce our column is actually 'floats' we will 'cast' them to integers using astype('int32') as our function to do this.
Following this we will look at all the unique values, publication years, we have in our data.
ndf['Pub_Year'] = ndf['Pub_Year'].astype('int32')
ndf['Pub_Year'].unique()
array([2010, 2014, 2022, 2015, 2020, 2018, 2017, 2019, 2009, 2012, 2006, 2011, 2021, 2013, 2008, 2016, 2007, 2002, 2005, 1995, 2001, 2004, 2023, 1997, 2000, 1999, 2003, 1996, 1956, 1998, 1978, 1991, 1990, 1979, 1994, 1982, 1969, 1963, 1992, 1988, 1993, 1973])
We should note that there is an issue here.
First note that I generated this data we are using in Dec 2022.
Lets just assume all rows with a publication dates of 2023 are pre-prints and lest ignore those.
You Try it: go ahead and remove all rows with 'Pub_year' equal to 2023 and confirm it. Hint - instead of getting rid of 2023, keep everything that is not equal to 2023
Solution:
#Remove preprints and errors for 2023 (data gathered in Dec 2022)
ndf = ndf[ndf['Pub_Year'] != 2023]
ndf['Pub_Year'].unique()
array([2010, 2014, 2022, 2015, 2020, 2018, 2017, 2019, 2009, 2012, 2006, 2011, 2021, 2013, 2008, 2016, 2007, 2002, 2005, 1995, 2001, 2004, 1997, 2000, 1999, 2003, 1996, 1956, 1998, 1978, 1991, 1990, 1979, 1994, 1982, 1969, 1963, 1992, 1988, 1993, 1973])
Now we will go through and create yet another DataFrame which includes the 'Pub_year'
Then create a new column, 'Num_Published' which is the number/count of articles created for each year.
ndf2 = ndf['Pub_Year'].value_counts().to_frame() # Create a new DataFrame with the counts for each publication year
ndf2.reset_index(inplace=True) # Make the index (with the counts) into a new column and not the index
ndf2.rename({'Pub_Year':'Num_Published', 'index':'Pub_Year'}, axis=1, inplace=True) # Rename the column header names
ndf2.sort_values(by='Pub_Year', inplace=True) # Sort the data by publication year in ascending order
ndf2.reset_index(drop=True, inplace=True) # Clean up the index
ndf2 # Display results
Pub_Year | Num_Published | |
---|---|---|
0 | 1956 | 1 |
1 | 1963 | 1 |
2 | 1969 | 1 |
3 | 1973 | 1 |
4 | 1978 | 1 |
5 | 1979 | 1 |
6 | 1982 | 1 |
7 | 1988 | 1 |
8 | 1990 | 8 |
9 | 1991 | 11 |
10 | 1992 | 8 |
11 | 1993 | 10 |
12 | 1994 | 20 |
13 | 1995 | 24 |
14 | 1996 | 29 |
15 | 1997 | 40 |
16 | 1998 | 42 |
17 | 1999 | 47 |
18 | 2000 | 61 |
19 | 2001 | 47 |
20 | 2002 | 52 |
21 | 2003 | 57 |
22 | 2004 | 73 |
23 | 2005 | 159 |
24 | 2006 | 185 |
25 | 2007 | 234 |
26 | 2008 | 250 |
27 | 2009 | 245 |
28 | 2010 | 311 |
29 | 2011 | 385 |
30 | 2012 | 436 |
31 | 2013 | 510 |
32 | 2014 | 520 |
33 | 2015 | 593 |
34 | 2016 | 767 |
35 | 2017 | 910 |
36 | 2018 | 996 |
37 | 2019 | 1150 |
38 | 2020 | 1370 |
39 | 2021 | 1453 |
40 | 2022 | 1498 |
Next we want to get the number of citations per year as another DataFrame.
Then we store the results as a new column in our main DataFrame (ndf2).
cit_df = ndf.groupby(['Pub_Year']).sum()
ndf2['Cited_Count'] = cit_df['Cited_Count'].to_list()
ndf2.tail()
Pub_Year | Num_Published | Cited_Count | |
---|---|---|---|
36 | 2018 | 996 | 70067 |
37 | 2019 | 1150 | 87021 |
38 | 2020 | 1370 | 101347 |
39 | 2021 | 1453 | 120629 |
40 | 2022 | 1498 | 125599 |
We can finally look at what our data looks like.
Our goal is to create a three (3) dimensional plot. NOT a 3D plot! I'll explain this when we look at the plot itself.
You can actuallly easily display 6,7,8 and if careful more then 8 dimesions of data using size, color, position, ect.
We will create a plotly.go.Scatter plot somewhat like we previously made.
In addition we will customize the size, colors and even the hover text for our data.
fig = go.Figure() # Create a new plotly.go figure
fig.add_trace(go.Scatter(x=ndf2['Pub_Year'], # X-axis = publication year
y=ndf2['Num_Published'], # Y-axis = Number of items published for each year
name = 'Cited', # A special name which has numerous uses which we will explain in plot
mode='markers', # The kind of scatter plot, just makers, we could add lines but we dont want any
marker=dict(color=ndf2['Cited_Count'], # Describe what the markers look like, set theier color based on citation count
size=(np.log(ndf2['Cited_Count']+.01)**2), # Set the size of the markers on a log scale
showscale=True, # Display the color scale (transfer function colorbar)
colorbar=dict(title='<b>Number of Citiations per Year</b>'), # Set the title for the colorscale
opacity=0.75, # Reduce the opacity of the markers
),
hovertemplate="<b>Year:</b> %{x}"+ # Display the publication year for moused over/hover marker
"<br><b># Publications:</b> %{y}"+ # Display the number of publication/year for moused over/hover marker
"<br><b># Citations made:</b> %{text}", # Display the citations/year for moused over/hover marker - Note we need the text
text = ndf2['Cited_Count'] # Set the 'text' used used for the above line
)
)
fig.update_layout(title="<b>Web of Science: 'Climate and Art'<br>Number of Publications and Citations per Year</b>", # Set the title for the figure/plot
xaxis_title='<b>Year</b>', # Set the X-axis label title
yaxis_title='<b># Publications</b>', # Set the Y-axis label title
width=1200, # Set the figure width
height=800, # Set the figure height
)
fig.update_layout(hoverlabel=dict(bgcolor="white", # Set the marker hover over panel to background color to white
font_size=16, # Set hover over font size
font_family="Arial", # Set hover over font family
)
)
fig.show() # Display the plot
Remember all that work we did to create the country location for each author?
Lets not let all that work go invain but create a nice plot of it.
One of the greatest benefits of using Plotly over Matplotlib, or even 'most' other packages is the easy of creating geographic plots.
If you have ever dealt with geographic plotting in other packages you know all the work needed to deal with 'projections' of the location (latitude/longitude).
But Plotly does all this in the background for you!!!
A choropleth plot is a geographic plot based on some regional definition and statistics for each segment of those regions (for us countries).
df = pd.read_csv(os.path.join('Processed_data', 'Combined_lists.csv'))
We need to flatten out our df['Country'] column which consists as a list of strings (country names).
NOTE: Saving our data out as a .csv means we saved our data in a textual format!
This is a MAJOR problem for things like our df['Country'] column becuase our lists for each row becomes a string of list (NOT a real data type!!!)
There are two ways around this issue.
We will go through the process using the later method.
temp = [] # Create an temporary empty list
for i in range(df.shape[0]): # Iterate through each row (yes this is different then iterrows but does the exact same thing)
if df['Country'][i] =='None': # If the row contains a 'None' which we added earliy skip it and contiue on with the for-loop
continue
else:
temp.append(literal_eval(df['Country'][i])) # use ast.lietral_eval function to extract the list items out correctly, for us. then appened the results to our temporary list
author_countries = [item for sublist in temp for item in sublist] # Now iterate through the temporary list and flatten it into a flattened list for ech country
Next we will create our final DataFrame of the data we want which will include the counts of autorhos/country and the FIPS code for each conutry
u_countries = Counter(author_countries).keys() # Create a 'Counter' for each country and get the dictionary 'keys' from the Counter
u_count = Counter(author_countries).values() # Create a 'Counter' for each country and get the dictionary 'values' from the Counter which is the count of authors per country
df_countries = pd.DataFrame({'Country':u_countries, 'Counts':u_count}) # Create a new DataFrame with this data
temp = [] # Create a temporary list to stor FIPS codes into
for i in df_countries['Country']: # Iterate through each row/country
temp.append(pycountry.countries.get(name=i).alpha_3) # use Pycountry to find the 'alpha_3' 3 letter country FIPS code and append it to our temporary list
df_countries['FIPS'] = temp # Add this information into our DataFrame as a new 'FIPS' column
df_countries
Country | Counts | FIPS | |
---|---|---|---|
0 | United Kingdom | 2226 | GBR |
1 | Argentina | 84 | ARG |
2 | Canada | 813 | CAN |
3 | Netherlands | 820 | NLD |
4 | Spain | 824 | ESP |
... | ... | ... | ... |
154 | Seychelles | 2 | SYC |
155 | Uzbekistan | 2 | UZB |
156 | Guyana | 1 | GUY |
157 | Brunei Darussalam | 2 | BRN |
158 | Yemen | 2 | YEM |
159 rows × 3 columns
And we finally get to making our last plot - the Choropleth of number of authors per country.
fig = go.Figure() # Create a new plotly.go figure
fig.add_trace(go.Choropleth( # Add a Choropleth plot to the figure
locations = df_countries['FIPS'], # Use our FIPS codes to define the regions of interest (ROI) for the choropleth
z = df_countries['Counts'], # Use our author counts/country as the data source for each FIPS coded country
colorscale = 'Turbo', # Set the transfer function color map to 'Turbo'
marker_line_color='darkgray', # Set the country borders color to dark gray
marker_line_width=0.5, # Set the country border line width
)
)
scale = 1.5 # Create a scale factor for sizing the figure
fig.update_layout(
width = 1024*scale, # Set the figure width
height=728*scale, # Set the figure height
geo=dict( # Create a dictionariy describing how to customize the choropleth
showframe=False, # Do not show a frame around the plot
showcoastlines=True, # Show coastline borders
projection_type='equirectangular' # Set the geo 'projection' we want to display our map in
),
)
fig.show() # Display the figure
You Try It: Modify the choropleth plot to add the following:
Solution:
fig = go.Figure() # Create a new plotly.go figure
fig.add_trace(go.Choropleth( # Add a Choropleth plot to the figure
locations = df_countries['FIPS'], # Use our FIPS codes to define the regions of interest (ROI) for the choropleth
z = df_countries['Counts'], # Use our author counts/country as the data source for each FIPS coded country
colorscale = 'Turbo', # Set the transfer function color map to 'Turbo'
marker_line_color='darkgray', # Set the country borders color to dark gray
marker_line_width=0.5, # Set the country border line width
colorbar_title = '<b>Authors per Country</b>', # Add a title to our Colorbar
text = df_countries['Country'], # The textual name for each FIPS coded region (which is the country name in our case)
hovertemplate="<b>Country: </b> %{text}"+ # Display the country name for moused over/hover marker
"<br><b># Authors:</b> %{z}", # Display the number of authors/country for moused over/hover marker
)
)
scale = 1.5 # Create a scale factor for sizing the figure
fig.update_layout(
title_text='<b>Number of Authors per Country</b>', # Add a title to our figure/plot
width = 1024*scale, # Set the figure width
height=728*scale, # Set the figure height
geo=dict( # Create a dictionariy describing how to customize the choropleth
showframe=False, # Do not show a frame around the plot
showcoastlines=True, # Show coastline borders
projection_type='equirectangular' # Set the geo 'projection' we want to display our map in
),
)
fig.show()
I normally teach a couple different multi-week series on data visualization for CRDDS.
I'm not sure if I will be teaching those but I will be teaching something graphical in nature for Love Data Week in Feburary.
So please stay tuned to Kim's weekly CRDDS Newsletters with updates.
If there is something that you would dearly love to see in a workshop PLEASE reach out to us to let us know.
No promises but we LOVE to help!!!