By now theres no need for an introduction to the importance of data and working with data.
Today we will explore the use of Python for data and data analysis.
We will follow the following schedule and by the end you will be exhausted - promise!
The good news is that I've tried to make these Notebooks as followable as possible.
Sometimes you need to do something outside the normal realm of Jupyter Notebooks.
One major example we will take advantage of right now, and that's the ability to install packages from Jupyter.
For instance we can use pip, or conda, but for speed please consider always using pip from Jupyter else jump back out to the command line.
There's a delimiter to tell Jupyter that you want to invoke command line commands and that's and bang/explanation mark, '!'
Example:
!pip install pandas
If you have not already done so, or are not sure if you have installed the following packages. If you recieved a module not found error then please uncomment the required lines in the cell below the imports cell execute/run that cell.
If you have ran an install line(s) then make sure you go back and re-execute the imports cell.
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import glob
import os
import pycountry
import requests
from io import BytesIO
from bs4 import BeautifulSoup
from collections import Counter
#!pip install pandas
#!pip install numpy
#!pip install bs4
#!pip install plotly
#!pip install pywidgets
#!pip install requests
#!pip install fastparquet
#!pip install pycountry
#!pip install nltk
NOTE: There will be a couple others we need but I'm intentionally going to wait to 'after' we need them to install them!
We will move on and talk really quickly about Python in general while these are installing.
Actually lets step back a little bit first and just look at simple variables.
A variable is just a user defined ‘name’ which which references some sort of data. For example:
foo = 37 # The true answer to life, the universe, and just everything! At least nature seems to think so!
Here we have created a variable named foo and stored the value of 37 into it. It is followed by a comment on what the value represents.
The true power, beauty, and sadly part of its dark side, is EVERYTHING in Python is a non-typed object. Where a type can be defined as as a bunch of text, an integer value, a float value, a long list or matrix of values or well anything you can dream up. In Python, at least until just now, you never worry about defining the type of data you are creating and/or using like you do in most every other language. Python then ‘interpret’ (spelled best guess but almost correctly) the type for you behind the scenes. Hence why Python is know as an interpreted language as compared to a compiled language.
NOTE: In the past few releases and even more so in the future, the concept of type-hinting has been made available to the hardcore coders that seek performance and data type safety. However due to how it works even deeper under the covers (the GIL if you know a bit about Python) even these type-hints are just that ‘hints’ and Python could conceivably interpret something different at run time. It won’t 'most of the time' but it could! We will not be covering type-hints at all today!
Beyond storing a single piece of data into a variable we have specialized data structures, also refered to as containers, to help us out.
There are many but the vast majority of the time the base containers you will run into and use use are;
Lists []: Lists are what it claims - a list of values.
Each one of them is indexed (behind the scenes), starting from zero, the second index is 1, the third 2, etc.
Lists uses ‘[]’ to enclose the list of data.
example foo = [‘Zeroth item’, 1, ‘2’, “3rd item”]
NOTE: All ‘text’ items are enclosed in either single quotes or double quotes. It pretty much makes no difference which (well till you need both at the same time more later!).
List are ‘mutable’ meaning you can change them at will!
<b>NOTE:</b> arrays and matrices are specilized lists in numpy, outside of numpy they are indeed lists for arrays and a list of lists for matrices.
Tuples (): Tuples are just like lists, but you they are ‘immutable’, meaning once defined they can not be changed.
Tuples uses ‘()’ to enclose the list (err tuple) of data.
example months = (‘January’,‘February’,‘March’,‘April’,‘May’,‘June’,‘July’,‘August’,‘September’,‘October’,‘November’,‘December’)
Sets (): Think of sets just like a Tuple except they can not have multiple occurances of the same element.
That and they are unorded and thus have no slicing/indexing available to them. Normally you will not ever have to deal with sets.
However there are occasions with Pandas that you will and we will see one of those occasions today.
Lastly theres the all powerful Dictionaries
Dictionaries uses ‘{}’ to enclose the key/value pair. In addition you assign the keys value using a colon ‘:’.
example colors = {‘Sun’: ‘is Orange’, ‘Grapes’: ‘are purple’, ‘red’: rgb(255, 0, 0}}
Dictionaries are mutable.
Heres a quick comparision of 'some' of their key traits:
Now the real fun comes in when you understand that you can compose each of these containers into super containers which hold all kinds of other containers. For example;
a list of lists
a list of dictionaries
a list of lists of list of list of— (think matrices as both vectors and matrices are actually lists in Python land!)
a list of dictionaries
a dictionary of lists
a dictionary of dictionaries
a dictionary of functions (I personally love using these my apps as part of callback functioanlity!)
The list (sorry pun not intended) goes on as far as your imagination will take you!
For data science the two you will use almost all the time are list and dictionaries.
Well till we dive into Pandas and dataframes which are different yet under the covers are just composed of standard containers!
There are several other main Python features we need to discuss, such as for-loops, functions, magics, and others. But we will explore those when we run into them.
Almost all programming languages today exist on the concept of Object Oriented Design.
Meaning code should be made modular, or groupped, such that sections that can or need to be used more then once are sperateed out from the base 'linear' code.
We will mention a few major cataogirzations used in Python and most languages but ignore the details as this would become a semster long class!
Functions - At the basest level of object oriented (think of reatable and reuasable design for those new to coding) we have functions.
A function is nothing more then a special 'grouping' of code that can be called whenever needed to execute the code there in.
Functions can take any number of parameters whether constants or variable in lenght.
Function, in Python can return nothing or as many variables as desired.
Functions have an 'internal' scope, meaning whatever variables used in them stay in them unless 'returned'.
We will use functions here and there today and we will explore both their structure and importance when we use them.
Classes - Beyond functions theres another important object oriented collection of code know as classes.
We will use classes today as provided by the imported packages we will use today.
However we will not be talking about classes at all today.
Packages - Think of these as groups of classes (OK this is like thinking of "food something what we eat" but---)
In the wonderful world of Python theres 10 million ways to do anything/everything (and thats just scratching the surface!)
Printing output is a prime example as there are NUMEROUS ways to format print statements!
For today I will use the current standard PRP norm as follows;
print(f'stuff {variable or code}')
foo = 37
notfoo = 42
print(f'The closer to true answer to Life, the Universe, and just Everything is {37} and not {notfoo}!')
The closer to true answer to Life, the Universe, and just Everything is 37 and not 42!
This is a VAST area to explore as data is stored in diverse places, ways, formats, and all sorts of other guises making "Where can I find and how do I gett" such and such data one of the most asked questions.
Sadly this is such a dense and complex area we will look at just one of the many ways to obtain data that you don't already have.
As we all, hopefully, know there are major differences between who MacOS, Windows, Linus and others handle files.
Even the orgainization of the directory/folder paths are different between operating systems.
To work safely so anyone on most any system can reproduce your code we will us Pythons 'os' packages which does most of the nasty work for us.
In particular, for today, we will utilize the 'os.path' sub-module which allows us to work with file/directory/folder paths and naming.
What we are most interested in is joining file names to their directory/folder paths.
We can load the package using the Python import as;
import os
We can then use the 'join' function to join a directory/folder path to the file name.
For example:
```python import os
folderpath = 'Data' filename = 'foo.txt'
fname = os.path.join(folderpath, filename)
print(fname) '''
Note: In the above example that path is relative to our current working directory. We can just as easily create a direct path to data anywhere on any drive you have access to.
This is all fine and dandy and exactly what we need but we are interested in reproduciable practices.
So lets pretend you are in a workshop and will be placing data into a folder that was not pre-created for you (Oh wait thats exactly what's happening today!)
When working reproducibally we should check to see if the desired directory/folder exists if not then create it.
Yes you can manually do this but, again, we are interested in reproduciable workflows so don't assume ANYTHING from an end user.
We will once again make use of both os and os.path to create a directory if it does not already exist.
First we will check to see the desired folder exists using a function nicely named 'exists()'.
If os can not find such a directory then is will use an os function called 'makedirs()' to create it.
if not os.path.exists('Raw_Data'):
os.makedirs('Raw_Data')
Yes, For those that are familar with this method you may be thinking 'Uhmmm why not just use the input parameter setup with exist_ok=True?
It's nothing more then coding style. Both will work and it would shorten the above code to a single line;
os.makedirs('Data', exist_ok=True)
But often you may want to add other things into the code and the first method we covered is more extensible for this behavior.
Alternatively, for the more advanced developers building larger more complex Python programs you may opt for something known as try-except exception handeling.
I will not be going into advanced Python coding today.
But to give you and idea how this works you would do something like this.
try:
os.makedirs('Data')
except OSError as e:
if e.errno != errno.EEXIST:
raise
Now that we know how to work with folders and file paths its time to start thinking about our data!
The data we will be workign with today is from Web of Science which we talked about on the first day.
I generated the data by searching for all topics realting to 'Climate and Art'.
I then saved the complete results out in an excel file format.
Note that you are limited to how many entries you can save at a time, therefore we have 13 files to cover all results.
I had planned on creating and posting a quick video on how this data was generated but have yet to actually do that. But am still hoping to do so.
I have intentionally stored the data for this session in a different repo just to demo one way to access data via coding.
This method works great for data found via url's such as GitHub, GitLab, OSF, Kaggle, ect.
NOTE: This will NOT work for data served off of specilized data servers such as THREDDS(TDS), EROS, NOMADS, EOSDIS, ect.
All of these have their own Python or other APIs you will be required to use to access those data/
url = 'https://github.com/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/'
soup = BeautifulSoup(requests.get(url).content, "html.parser")
files = []
for link in soup.select('a[href*=".xls"]'):
file = link['href']
files.append(file)
files
['/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_01.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_02.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_03.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_04.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_05.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_06.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_07.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_08.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_09.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_10.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_11.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_12.xls', '/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/blob/main/ClimateAndArt_13.xls']
When you run the above code as is you will end up with a list of mostly complete links to each Excel (.xls) in the Githuib repository.
From here we have two obvious choices we can make to get the info we need to download our files (yes there are many others too, welcome to Python);
We will choice the later as in the end its powerful to learn this method for a great many data processing needs.
To do this we will use the python split() function.
There are many things we can do with split, but for use we will tell it specifally where we want to split in.
For us the place to split at is at each "/".
We will walk through this process below so you can see it in action.
test = files[0]
test
test = test.split('/')
test
test[-1]
'ClimateAndArt_01.xls'
We can comprise all ow that with Pythion function chaining like this
file = file.split('/')[-1]
We can now rewrite our code to get just the file names for each Excel file in that particular Github repo.
url = 'https://github.com/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/'
soup = BeautifulSoup(requests.get(url).content, "html.parser")
files = []
for link in soup.select('a[href*=".xls"]'):
file = link['href']
file = file.split('/')[-1]
files.append(file)
files
['ClimateAndArt_01.xls', 'ClimateAndArt_02.xls', 'ClimateAndArt_03.xls', 'ClimateAndArt_04.xls', 'ClimateAndArt_05.xls', 'ClimateAndArt_06.xls', 'ClimateAndArt_07.xls', 'ClimateAndArt_08.xls', 'ClimateAndArt_09.xls', 'ClimateAndArt_10.xls', 'ClimateAndArt_11.xls', 'ClimateAndArt_12.xls', 'ClimateAndArt_13.xls']
Now that we have a list of the files we can now work on downloading and storing them locally.
url = 'https://github.com/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023'
soup = BeautifulSoup(requests.get(url).content, "html.parser")
files = []
for link in soup.select('a[href*=".xls"]'):
file = link['href']
file = file.split('/')[-1]
files.append(file)
files
['ClimateAndArt_01.xls', 'ClimateAndArt_02.xls', 'ClimateAndArt_03.xls', 'ClimateAndArt_04.xls', 'ClimateAndArt_05.xls', 'ClimateAndArt_06.xls', 'ClimateAndArt_07.xls', 'ClimateAndArt_08.xls', 'ClimateAndArt_09.xls', 'ClimateAndArt_10.xls', 'ClimateAndArt_11.xls', 'ClimateAndArt_12.xls', 'ClimateAndArt_13.xls']
We can actually merge the two 'file = ' lines and we will discuss how and why this works later.
Next we need to download each of the files and save them to our desired drive and directory.
Again there are near endless ways to do this we will look at two:
Why both? For many of you, you will spend most of your time working with Pandas for your data processing needs.
But Pandas is not the solution for everything. Often you have other needs or file formats Pandas does not understand therefore its import to know how to work without it!
The requests library is a large http package allowing all manner of Python control of http needs.
For us we will just concentrate on the get() function which, for our needs, will download the file its url to local memory.
Then we simply save the content of that file via opening and writing a new file.
url = 'https://github.com/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023/'
for file in files:
url_file = (url+'/raw/main/'+file)
req = requests.get(url_file)
fName = os.path.join('Raw_data', file)
open(fName, 'wb').write(req.content)
You can download and store a remote file with Pandas.
Some formats, such as .csv file, just need the url to the file and you load it as you would a local file.
Excel store there data in a more binary format and those requires a bit more effort.
We will simplify this process abit by once again utilizing the requests.get() function to actually download the remote datas binary "content".
Then we read that "content" data into an Excel derived DataFrame by cast the binary data into a Pandas understandable format with BytesIO().
Lastly we save it out as an .xls file (which happily Pandas will understand automatically from here on out.
NOTE: Using the to_excel() will in all probablity give you a deprecation warning. Just ignore it until you want to work on updating your software stack to use openpyxl.
Alternatively, and better yet, save the file in a different format. We will talk about this more shortly.
#url = 'https://github.com/auroraTDunn/CRDDS_Data_Bootcamp_Jan_2023'
'''
for file in files:
url_file = (url+'/raw/main/'+file)
data = requests.get(url_file).content
df = pd.read_excel(BytesIO(data))
fName = os.path.join('Raw_data', file)
df.to_excel(fName, index=False)
'''
"\nfor file in files:\n url_file = (url+'/raw/main/'+file)\n data = requests.get(url_file).content\n df = pd.read_excel(BytesIO(data))\n fName = os.path.join('Raw_data', file)\n df.to_excel(fName, index=False)\n"
QUESTION: Which of these two methods is best for you to use?
Answer:
If you are only wanting to work with one Excel file then I would suggest using the later as you will already have all your Excel data up and ready in a DataFrame.
If you have multiple files and want to keep them in an Excel format then use the first, more direct method.
We now have all our files downloaded but what we really wnat is a single file to work with.
Luckily our files have the exact same columns in each file. If your data has different column names, layouts, ect then additional prep work should be done first!
Such methods we will discuss shortly but for different purposes.
The first thing we will do is create a "path" to our raw data.
# Create a path to the data files from where our Notebook is at
path = os.path.join("Raw_data","*.xls")
With the path to all our Excel files we can now work on merging them all into one file.
In our first attempt we will do it in the most simplistic way which is perfectly fine for the small number of files and data inside those files.
# Create an empty DataFrame
df = pd.DataFrame([])
# Loop through each Excel in the Raw_data folder
# read it into a temporary DataFrame
# then concatenate the file to our DataFrame
for files in glob.glob(path):
print(f'file name: {files}')
df_temp = pd.read_excel(files)
df = pd.concat([df, df_temp], ignore_index=True)
file name: Raw_data\ClimateAndArt_01.xls file name: Raw_data\ClimateAndArt_02.xls file name: Raw_data\ClimateAndArt_03.xls file name: Raw_data\ClimateAndArt_04.xls file name: Raw_data\ClimateAndArt_05.xls file name: Raw_data\ClimateAndArt_06.xls file name: Raw_data\ClimateAndArt_07.xls file name: Raw_data\ClimateAndArt_08.xls file name: Raw_data\ClimateAndArt_09.xls file name: Raw_data\ClimateAndArt_10.xls file name: Raw_data\ClimateAndArt_11.xls file name: Raw_data\ClimateAndArt_12.xls file name: Raw_data\ClimateAndArt_13.xls
Note that concat needs a series/list of DataFrames as its input parameter.
Using this to our advantage we can simplify the code to just two three lines (noting the first line create our path to the data).
# Create a list of DataFrames for each Excel file in the "path" data directory
df_temp = (pd.read_excel(f) for f in glob.glob(path))
# Now concat the list of DataFrames together.
df = pd.concat(df_temp, ignore_index=True)
Ideally you can just chain all it altogether into just two lines of code.
df_temp = (pd.read_excel(f) for f in glob.glob(os.path.join("Raw_data","*.xls")))
df = pd.concat(df_temp, ignore_index=True)
While we are only playing with and extremely small number of files and only ~1000 rows in each file there is no major advantage to any of the above methods, effficency wise.
But if you have many more or vastly larger files then you will find that these are presented from slowest to fastest methods.
Note: These are not the only methods to merge multiple files together. Other methods including using the map() function and/or lambda functions. If yo0u are sharing your code with others beware the use of lambda functions. While they are fast, efficent and extremely powerful many peol see lambda in a line of code and well unhappy thoughts stream through their heads.
Also note that if you are playing with truely large data, hundreds ro more files, or specialized data formats (eg.. HDF5, netCDF and many others) then there are vastly superior ways to work with these data. We will discuss those methods this afternoon!
Now that we have all our data into one file we should be smart and save that data out.
Again being just a small DataFrame we can save it out to pretty much anything we want with no concerns about efficiency.
So we will just save it out to a common .csv file.
Later we will look at better formats for larger data.
First lets create a new directory to store all our processed data into
One of the great things about dataframes, regardless of the language you are coding in, is that the dataframe is just a massive data object.
Meaning it does not matter what the data format started out in, you can store it to any other, available, file format, or even more then one.
So for our case we will save our merged Excel files into a single csv file.
We save it using the Pandas functioon to_csv().
We will also make use of the special parameter index=False, this tells to_csv() to NOT save the inbuilt DataFrame index into the file. If we did not do this then when we load it later the index would appear as a new column!
# Name of our new directory
proc_dir = 'Processed_data'
# If it does not already exists then create it
if not os.path.exists(proc_dir):
os.makedirs(proc_dir)
#df.to_csv(os.path.join(proc_dir, 'Combined_lists.csv'), index=False)
df.to_csv('Processed_data/Combined_lists.csv', index=False)
Question: Is csv the best file format for our needs?
Answer
No! It in fact is not as we shall see in Part 2 and discuss in more detail in Part 3.
But hey why not, its the one most people use most of the time so its go to see what happens when 'stuff' happens!
NOTE: if you are playing with much large datasets then other formats would be vastly better and we will discuss those in Part 3.
There are numerous ways to explore the data but first you want a semi decent idea of whats in the data so you know how to explore it.
There are a great many ways to do this, but for now we will look at the quick solutions Pandas has to offer us.
The simplest method is to just have the Notebook display the info kinda like printing out it would.
Note you can only do it this way in a Notebook of somesort (yes there are many different types of Notebooks by various vendors!)
df
Publication Type | Authors | Book Authors | Book Editors | Book Group Authors | Author Full Names | Book Author Full Names | Group Authors | Article Title | Source Title | ... | Web of Science Index | Research Areas | IDS Number | Pubmed Id | Open Access Designations | Highly Cited Status | Hot Paper Status | Date of Export | UT (Unique WOS ID) | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | J | Miles, M | NaN | NaN | NaN | Miles, Malcolm | NaN | NaN | Representing nature: art and climate change | CULTURAL GEOGRAPHIES | ... | Social Science Citation Index (SSCI); Arts &am... | Environmental Sciences & Ecology; Geography | 549EU | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000274029500002 | 0 |
1 | J | Dal Farra, R; Suarez, P | NaN | NaN | NaN | Dal Farra, Ricardo; Suarez, Pablo | NaN | NaN | RED CROSS/RED CRESCENT CLIMATE CENTRE AND BALA... | LEONARDO | ... | Arts & Humanities Citation Index (A&HCI) | Art | AP6YE | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000342223700017 | 0 |
2 | J | Chen, MH | NaN | NaN | NaN | Chen, Mei-Hsin | NaN | NaN | The Contribution of Art to Climate Change Comm... | REVISTA HUMANIDADES | ... | Emerging Sources Citation Index (ESCI) | Arts & Humanities - Other Topics | 2T9FT | NaN | gold, Green Submitted | NaN | NaN | 2022-12-21 | WOS:000822772700002 | 0 |
3 | J | Guy, S; Henshaw, V; Heidrich, O | NaN | NaN | NaN | Guy, Simon; Henshaw, Victoria; Heidrich, Oliver | NaN | NaN | Climate change, adaptation and Eco-Art in Sing... | JOURNAL OF ENVIRONMENTAL PLANNING AND MANAGEMENT | ... | Social Science Citation Index (SSCI); Arts &am... | Development Studies; Public Administration | AS7TP | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000344457900003 | 0 |
4 | J | Baztan, J; Vanderlinden, JP; Jaffres, L; Jorge... | NaN | NaN | NaN | Baztan, Juan; Vanderlinden, Jean-Paul; Jaffres... | NaN | NaN | Facing climate injustices: Community trust-bui... | CLIMATE RISK MANAGEMENT | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Environmental Sciences & Ecology; Meteorology ... | PK4RQ | 33106769.0 | Green Published, gold | NaN | NaN | 2022-12-21 | WOS:000602434600001 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12681 | J | Seregin, AP; Bochkov, DA; Shner, JV; Garin, EV... | NaN | NaN | NaN | Seregin, Alexey P.; Bochkov, Dmitriy A.; Shner... | NaN | NaN | Flora of Russia on iNaturalist: a dataset | BIODIVERSITY DATA JOURNAL | ... | Science Citation Index Expanded (SCI-EXPANDED) | Biodiversity & Conservation | OT0QT | 33244292.0 | Green Submitted, gold, Green Published | NaN | NaN | 2022-12-21 | WOS:000590560300001 | 0 |
12682 | J | Keith, P; Mennesson, MI | NaN | NaN | NaN | Keith, Philippe; Mennesson, Marion, I | NaN | NaN | Review of Ophiocara (Teleostei: Butidae) from ... | CYBIUM | ... | Science Citation Index Expanded (SCI-EXPANDED) | Zoology | ST1IC | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000662202900002 | 0 |
12683 | J | Kattge, J; Diaz, S; Lavorel, S; Prentice, C; L... | NaN | NaN | NaN | Kattge, J.; Diaz, S.; Lavorel, S.; Prentice, C... | NaN | NaN | TRY - a global database of plant traits | GLOBAL CHANGE BIOLOGY | ... | Science Citation Index Expanded (SCI-EXPANDED) | Biodiversity & Conservation; Environmental Sci... | 800WS | NaN | Green Published, Green Submitted, Bronze, Gree... | NaN | NaN | 2022-12-21 | WOS:000293399000011 | 0 |
12684 | J | Narasimhan, VM; Patterson, N; Moorjani, P; Roh... | NaN | NaN | NaN | Narasimhan, Vagheesh M.; Patterson, Nick; Moor... | NaN | NaN | The formation of human populations in South an... | SCIENCE | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Science & Technology - Other Topics | IW1MK | 31488661.0 | Green Submitted, Green Accepted | Y | N | 2022-12-21 | WOS:000484732700038 | 0 |
12685 | J | Potzelsberger, E; Gossner, MM; Beenken, L; Gaz... | NaN | NaN | NaN | Poetzelsberger, Elisabeth; Gossner, Martin M.;... | NaN | NaN | Biotic threats for 23 major non-native tree sp... | SCIENTIFIC DATA | ... | Science Citation Index Expanded (SCI-EXPANDED) | Science & Technology - Other Topics | TY6PT | 34362931.0 | Green Published, gold | NaN | NaN | 2022-12-21 | WOS:000683905100002 | 0 |
12686 rows × 72 columns
Notice that it only displays the header and first five rows of data then the last five rows.
Also notice that after the "Source Title" column there is another ellipsee followed by more data.
Just like rows, if theres too much information Jupyter will fold/hide data.
Usually you only want to look at just the first bit of info. For this you can just use the .head()</i> function.
df.head()
Publication Type | Authors | Book Authors | Book Editors | Book Group Authors | Author Full Names | Book Author Full Names | Group Authors | Article Title | Source Title | ... | Web of Science Index | Research Areas | IDS Number | Pubmed Id | Open Access Designations | Highly Cited Status | Hot Paper Status | Date of Export | UT (Unique WOS ID) | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | J | Miles, M | NaN | NaN | NaN | Miles, Malcolm | NaN | NaN | Representing nature: art and climate change | CULTURAL GEOGRAPHIES | ... | Social Science Citation Index (SSCI); Arts &am... | Environmental Sciences & Ecology; Geography | 549EU | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000274029500002 | 0 |
1 | J | Dal Farra, R; Suarez, P | NaN | NaN | NaN | Dal Farra, Ricardo; Suarez, Pablo | NaN | NaN | RED CROSS/RED CRESCENT CLIMATE CENTRE AND BALA... | LEONARDO | ... | Arts & Humanities Citation Index (A&HCI) | Art | AP6YE | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000342223700017 | 0 |
2 | J | Chen, MH | NaN | NaN | NaN | Chen, Mei-Hsin | NaN | NaN | The Contribution of Art to Climate Change Comm... | REVISTA HUMANIDADES | ... | Emerging Sources Citation Index (ESCI) | Arts & Humanities - Other Topics | 2T9FT | NaN | gold, Green Submitted | NaN | NaN | 2022-12-21 | WOS:000822772700002 | 0 |
3 | J | Guy, S; Henshaw, V; Heidrich, O | NaN | NaN | NaN | Guy, Simon; Henshaw, Victoria; Heidrich, Oliver | NaN | NaN | Climate change, adaptation and Eco-Art in Sing... | JOURNAL OF ENVIRONMENTAL PLANNING AND MANAGEMENT | ... | Social Science Citation Index (SSCI); Arts &am... | Development Studies; Public Administration | AS7TP | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000344457900003 | 0 |
4 | J | Baztan, J; Vanderlinden, JP; Jaffres, L; Jorge... | NaN | NaN | NaN | Baztan, Juan; Vanderlinden, Jean-Paul; Jaffres... | NaN | NaN | Facing climate injustices: Community trust-bui... | CLIMATE RISK MANAGEMENT | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Environmental Sciences & Ecology; Meteorology ... | PK4RQ | 33106769.0 | Green Published, gold | NaN | NaN | 2022-12-21 | WOS:000602434600001 | 0 |
5 rows × 72 columns
.head() can actually be told how many line you wish to look at.
Well to an extant anyways. You can't tell it an enormous number of lines and expect to see all of them.
df.head(10)
Publication Type | Authors | Book Authors | Book Editors | Book Group Authors | Author Full Names | Book Author Full Names | Group Authors | Article Title | Source Title | ... | Web of Science Index | Research Areas | IDS Number | Pubmed Id | Open Access Designations | Highly Cited Status | Hot Paper Status | Date of Export | UT (Unique WOS ID) | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | J | Miles, M | NaN | NaN | NaN | Miles, Malcolm | NaN | NaN | Representing nature: art and climate change | CULTURAL GEOGRAPHIES | ... | Social Science Citation Index (SSCI); Arts &am... | Environmental Sciences & Ecology; Geography | 549EU | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000274029500002 | 0 |
1 | J | Dal Farra, R; Suarez, P | NaN | NaN | NaN | Dal Farra, Ricardo; Suarez, Pablo | NaN | NaN | RED CROSS/RED CRESCENT CLIMATE CENTRE AND BALA... | LEONARDO | ... | Arts & Humanities Citation Index (A&HCI) | Art | AP6YE | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000342223700017 | 0 |
2 | J | Chen, MH | NaN | NaN | NaN | Chen, Mei-Hsin | NaN | NaN | The Contribution of Art to Climate Change Comm... | REVISTA HUMANIDADES | ... | Emerging Sources Citation Index (ESCI) | Arts & Humanities - Other Topics | 2T9FT | NaN | gold, Green Submitted | NaN | NaN | 2022-12-21 | WOS:000822772700002 | 0 |
3 | J | Guy, S; Henshaw, V; Heidrich, O | NaN | NaN | NaN | Guy, Simon; Henshaw, Victoria; Heidrich, Oliver | NaN | NaN | Climate change, adaptation and Eco-Art in Sing... | JOURNAL OF ENVIRONMENTAL PLANNING AND MANAGEMENT | ... | Social Science Citation Index (SSCI); Arts &am... | Development Studies; Public Administration | AS7TP | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000344457900003 | 0 |
4 | J | Baztan, J; Vanderlinden, JP; Jaffres, L; Jorge... | NaN | NaN | NaN | Baztan, Juan; Vanderlinden, Jean-Paul; Jaffres... | NaN | NaN | Facing climate injustices: Community trust-bui... | CLIMATE RISK MANAGEMENT | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Environmental Sciences & Ecology; Meteorology ... | PK4RQ | 33106769.0 | Green Published, gold | NaN | NaN | 2022-12-21 | WOS:000602434600001 | 0 |
5 | J | Burke, M; Tickwell, D; Whitmarsh, L | NaN | NaN | NaN | Burke, Miriam; Tickwell, David; Whitmarsh, Lor... | NaN | NaN | Participatory arts and affective engagement wi... | GLOBAL ENVIRONMENTAL CHANGE-HUMAN AND POLICY D... | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Environmental Sciences & Ecology; Geography | GC1AN | NaN | Green Accepted | NaN | NaN | 2022-12-21 | WOS:000429509100010 | 0 |
6 | J | Rodder, S | NaN | NaN | NaN | Roedder, Simone | NaN | NaN | The Climate of Science-Art and the Art-Science... | MINERVA | ... | Social Science Citation Index (SSCI); Arts &am... | Education & Educational Research; History & Ph... | EL1GB | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000394367400005 | 0 |
7 | J | Bentz, J; O'Brien, K | NaN | NaN | NaN | Bentz, Julia; O'Brien, Karen | NaN | NaN | ART FOR CHANGE: Transformative learning and yo... | ELEMENTA-SCIENCE OF THE ANTHROPOCENE | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Environmental Sciences & Ecology; Meteorology ... | JZ5OV | NaN | gold, Green Published | NaN | NaN | 2022-12-21 | WOS:000505152300002 | 0 |
8 | J | Ture, C | NaN | NaN | NaN | Ture, Cengiz | NaN | NaN | THE ROLE OF VISUAL ARTS ON SOCIAL PERCEPTION A... | ANADOLU UNIVERSITESI SANAT & TASARIM DERGISI-A... | ... | Emerging Sources Citation Index (ESCI) | Art | VC9DG | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000435155100015 | 0 |
9 | J | Kim, S | NaN | NaN | NaN | Kim, Sunhee | NaN | NaN | Art therapy development in Korea: The current ... | ARTS IN PSYCHOTHERAPY | ... | Social Science Citation Index (SSCI); Arts &am... | Psychology; Rehabilitation | 414XY | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000263899700001 | 0 |
10 rows × 72 columns
Likewise you can look at the end of the DataFrame using the .tail() function.
And just like head(), you can specify a number of lines you want to look at.
df.tail()
Publication Type | Authors | Book Authors | Book Editors | Book Group Authors | Author Full Names | Book Author Full Names | Group Authors | Article Title | Source Title | ... | Web of Science Index | Research Areas | IDS Number | Pubmed Id | Open Access Designations | Highly Cited Status | Hot Paper Status | Date of Export | UT (Unique WOS ID) | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12681 | J | Seregin, AP; Bochkov, DA; Shner, JV; Garin, EV... | NaN | NaN | NaN | Seregin, Alexey P.; Bochkov, Dmitriy A.; Shner... | NaN | NaN | Flora of Russia on iNaturalist: a dataset | BIODIVERSITY DATA JOURNAL | ... | Science Citation Index Expanded (SCI-EXPANDED) | Biodiversity & Conservation | OT0QT | 33244292.0 | Green Submitted, gold, Green Published | NaN | NaN | 2022-12-21 | WOS:000590560300001 | 0 |
12682 | J | Keith, P; Mennesson, MI | NaN | NaN | NaN | Keith, Philippe; Mennesson, Marion, I | NaN | NaN | Review of Ophiocara (Teleostei: Butidae) from ... | CYBIUM | ... | Science Citation Index Expanded (SCI-EXPANDED) | Zoology | ST1IC | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000662202900002 | 0 |
12683 | J | Kattge, J; Diaz, S; Lavorel, S; Prentice, C; L... | NaN | NaN | NaN | Kattge, J.; Diaz, S.; Lavorel, S.; Prentice, C... | NaN | NaN | TRY - a global database of plant traits | GLOBAL CHANGE BIOLOGY | ... | Science Citation Index Expanded (SCI-EXPANDED) | Biodiversity & Conservation; Environmental Sci... | 800WS | NaN | Green Published, Green Submitted, Bronze, Gree... | NaN | NaN | 2022-12-21 | WOS:000293399000011 | 0 |
12684 | J | Narasimhan, VM; Patterson, N; Moorjani, P; Roh... | NaN | NaN | NaN | Narasimhan, Vagheesh M.; Patterson, Nick; Moor... | NaN | NaN | The formation of human populations in South an... | SCIENCE | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Science & Technology - Other Topics | IW1MK | 31488661.0 | Green Submitted, Green Accepted | Y | N | 2022-12-21 | WOS:000484732700038 | 0 |
12685 | J | Potzelsberger, E; Gossner, MM; Beenken, L; Gaz... | NaN | NaN | NaN | Poetzelsberger, Elisabeth; Gossner, Martin M.;... | NaN | NaN | Biotic threats for 23 major non-native tree sp... | SCIENTIFIC DATA | ... | Science Citation Index Expanded (SCI-EXPANDED) | Science & Technology - Other Topics | TY6PT | 34362931.0 | Green Published, gold | NaN | NaN | 2022-12-21 | WOS:000683905100002 | 0 |
5 rows × 72 columns
df.tail(10)
Publication Type | Authors | Book Authors | Book Editors | Book Group Authors | Author Full Names | Book Author Full Names | Group Authors | Article Title | Source Title | ... | Web of Science Index | Research Areas | IDS Number | Pubmed Id | Open Access Designations | Highly Cited Status | Hot Paper Status | Date of Export | UT (Unique WOS ID) | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12676 | J | Dorigo, W; Himmelbauer, I; Aberer, D; Schremme... | NaN | NaN | NaN | Dorigo, Wouter; Himmelbauer, Irene; Aberer, Da... | NaN | NaN | The International Soil Moisture Network: servi... | HYDROLOGY AND EARTH SYSTEM SCIENCES | ... | Science Citation Index Expanded (SCI-EXPANDED) | Geology; Water Resources | WU3PD | NaN | Green Submitted, gold, Green Published | Y | N | 2022-12-21 | WOS:000716459500001 | 0 |
12677 | J | Cooper, A; Turney, CSM; Palmer, J; Hogg, A; Mc... | NaN | NaN | NaN | Cooper, Alan; Turney, Chris S. M.; Palmer, Jon... | NaN | NaN | Response to Comment on A global environmental ... | SCIENCE | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Science & Technology - Other Topics | XA6YJ | 34793228.0 | Green Accepted | NaN | NaN | 2022-12-21 | WOS:000720789200002 | 0 |
12678 | J | Cooper, A; Turney, CSM; Palmer, J; Hogg, A; Mc... | NaN | NaN | NaN | Cooper, Alan; Turney, Chris S. M.; Palmer, Jon... | NaN | NaN | Response to Comment on A global environmental ... | SCIENCE | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Science & Technology - Other Topics | XA6YJ | 34793203.0 | Green Accepted | NaN | NaN | 2022-12-21 | WOS:000720789200004 | 0 |
12679 | J | Jennings, B | NaN | NaN | NaN | Jennings, Bruce | NaN | NaN | Solidarity and care as relational practices | BIOETHICS | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Social Sciences - Other Topics; Medical Ethics... | HA5QZ | 30264873.0 | NaN | NaN | NaN | 2022-12-21 | WOS:000450332600003 | 0 |
12680 | J | Malaspinas, AS; Westaway, MC; Muller, C; Sousa... | NaN | NaN | NaN | Malaspinas, Anna-Sapfo; Westaway, Michael C.; ... | NaN | NaN | A genomic history of Aboriginal Australia | NATURE | ... | Science Citation Index Expanded (SCI-EXPANDED) | Science & Technology - Other Topics | EA5OS | 27654914.0 | Green Submitted | NaN | NaN | 2022-12-21 | WOS:000386671000038 | 0 |
12681 | J | Seregin, AP; Bochkov, DA; Shner, JV; Garin, EV... | NaN | NaN | NaN | Seregin, Alexey P.; Bochkov, Dmitriy A.; Shner... | NaN | NaN | Flora of Russia on iNaturalist: a dataset | BIODIVERSITY DATA JOURNAL | ... | Science Citation Index Expanded (SCI-EXPANDED) | Biodiversity & Conservation | OT0QT | 33244292.0 | Green Submitted, gold, Green Published | NaN | NaN | 2022-12-21 | WOS:000590560300001 | 0 |
12682 | J | Keith, P; Mennesson, MI | NaN | NaN | NaN | Keith, Philippe; Mennesson, Marion, I | NaN | NaN | Review of Ophiocara (Teleostei: Butidae) from ... | CYBIUM | ... | Science Citation Index Expanded (SCI-EXPANDED) | Zoology | ST1IC | NaN | NaN | NaN | NaN | 2022-12-21 | WOS:000662202900002 | 0 |
12683 | J | Kattge, J; Diaz, S; Lavorel, S; Prentice, C; L... | NaN | NaN | NaN | Kattge, J.; Diaz, S.; Lavorel, S.; Prentice, C... | NaN | NaN | TRY - a global database of plant traits | GLOBAL CHANGE BIOLOGY | ... | Science Citation Index Expanded (SCI-EXPANDED) | Biodiversity & Conservation; Environmental Sci... | 800WS | NaN | Green Published, Green Submitted, Bronze, Gree... | NaN | NaN | 2022-12-21 | WOS:000293399000011 | 0 |
12684 | J | Narasimhan, VM; Patterson, N; Moorjani, P; Roh... | NaN | NaN | NaN | Narasimhan, Vagheesh M.; Patterson, Nick; Moor... | NaN | NaN | The formation of human populations in South an... | SCIENCE | ... | Science Citation Index Expanded (SCI-EXPANDED)... | Science & Technology - Other Topics | IW1MK | 31488661.0 | Green Submitted, Green Accepted | Y | N | 2022-12-21 | WOS:000484732700038 | 0 |
12685 | J | Potzelsberger, E; Gossner, MM; Beenken, L; Gaz... | NaN | NaN | NaN | Poetzelsberger, Elisabeth; Gossner, Martin M.;... | NaN | NaN | Biotic threats for 23 major non-native tree sp... | SCIENTIFIC DATA | ... | Science Citation Index Expanded (SCI-EXPANDED) | Science & Technology - Other Topics | TY6PT | 34362931.0 | Green Published, gold | NaN | NaN | 2022-12-21 | WOS:000683905100002 | 0 |
10 rows × 72 columns
You can also slice the DataFrame to see a specific set of rows of data.
Say we want to look at just rows 100-104.
To do this we use the following format [start_row : end_row ]</i>
You can even use a step counter if you want. Why? I have no idea but if you do follow this format.
[start_row : end_row : step]</i>
df[100:101]
Publication Type | Authors | Book Authors | Book Editors | Book Group Authors | Author Full Names | Book Author Full Names | Group Authors | Article Title | Source Title | ... | Web of Science Index | Research Areas | IDS Number | Pubmed Id | Open Access Designations | Highly Cited Status | Hot Paper Status | Date of Export | UT (Unique WOS ID) | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | J | Stassen, C; Dommenget, D; Chadwick, R | NaN | NaN | NaN | Stassen, Christian; Dommenget, Dietmar; Chadwi... | NaN | NaN | Conceptual deconstruction of the simulated pre... | CLIMATE DYNAMICS | ... | Science Citation Index Expanded (SCI-EXPANDED) | Meteorology & Atmospheric Sciences | ML7MZ | NaN | Green Accepted | NaN | NaN | 2022-12-21 | WOS:000535336100001 | 0 |
1 rows × 72 columns
df[100:115:5]
Publication Type | Authors | Book Authors | Book Editors | Book Group Authors | Author Full Names | Book Author Full Names | Group Authors | Article Title | Source Title | ... | Web of Science Index | Research Areas | IDS Number | Pubmed Id | Open Access Designations | Highly Cited Status | Hot Paper Status | Date of Export | UT (Unique WOS ID) | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | J | Stassen, C; Dommenget, D; Chadwick, R | NaN | NaN | NaN | Stassen, Christian; Dommenget, Dietmar; Chadwi... | NaN | NaN | Conceptual deconstruction of the simulated pre... | CLIMATE DYNAMICS | ... | Science Citation Index Expanded (SCI-EXPANDED) | Meteorology & Atmospheric Sciences | ML7MZ | NaN | Green Accepted | NaN | NaN | 2022-12-21 | WOS:000535336100001 | 0 |
105 | J | Stucchi, L; Bignami, DF; Bocchiola, D; Del Cur... | NaN | NaN | NaN | Stucchi, Leonardo; Bignami, Daniele Fabrizio; ... | NaN | NaN | Assessment of Climate-Driven Flood Risk and Ad... | CLIMATE | ... | Emerging Sources Citation Index (ESCI) | Meteorology & Atmospheric Sciences | QN4FA | NaN | gold | NaN | NaN | 2022-12-21 | WOS:000622416300001 | 0 |
110 | J | Grose, MR; Narsey, S; Delage, FP; Dowdy, AJ; B... | NaN | NaN | NaN | Grose, M. R.; Narsey, S.; Delage, F. P.; Dowdy... | NaN | NaN | Insights From CMIP6 for Australia's Future Cli... | EARTHS FUTURE | ... | Science Citation Index Expanded (SCI-EXPANDED) | Environmental Sciences & Ecology; Geology; Met... | LU4NR | NaN | gold, Green Accepted, Green Published | Y | N | 2022-12-21 | WOS:000537734300001 | 0 |
3 rows × 72 columns
We can look at just a particular column in a simular manner, by just specifing the coolumn by its header name.
df['Article Title']
0 Representing nature: art and climate change 1 RED CROSS/RED CRESCENT CLIMATE CENTRE AND BALA... 2 The Contribution of Art to Climate Change Comm... 3 Climate change, adaptation and Eco-Art in Sing... 4 Facing climate injustices: Community trust-bui... ... 12681 Flora of Russia on iNaturalist: a dataset 12682 Review of Ophiocara (Teleostei: Butidae) from ... 12683 TRY - a global database of plant traits 12684 The formation of human populations in South an... 12685 Biotic threats for 23 major non-native tree sp... Name: Article Title, Length: 12686, dtype: object
df['Article Title'].head(3)
0 Representing nature: art and climate change 1 RED CROSS/RED CRESCENT CLIMATE CENTRE AND BALA... 2 The Contribution of Art to Climate Change Comm... Name: Article Title, dtype: object
We can even mix any of these methods.
As an example say we want to look at just rows 100-104 in the "Article Title" column
df['Article Title'][100:105]
100 Conceptual deconstruction of the simulated pre... 101 Art for a Future Planet Beyond Apocalypse 102 Assessing climate model projections: State of ... 103 Climate consensus: A multilevel study testing ... 104 Art, climate change and (other) eco materials:... Name: Article Title, dtype: object
You Try It: Look at the top 3 lines, using head() of the 'Abstract' column
Solution:
df['Abstract'].head(3)
0 Climate change is now an established scientifi... 1 The art! (sic) climate contest used art as a c... 2 This article examines how and why the climate ... Name: Abstract, dtype: object
We can find the dimensions of our data using the .shape function.
Note that unlike most Python functions we do not use the () to access this function. Pandas provides a special decorated function for the desired data.
Don't worry about what it means thats a discussion for an intermediate to advanced class in Python.
df.shape
(12686, 72)
If we want just the number of rows of data we can slice it out.
df.shape[0]
12686
or just the number of columns
df.shape[1]
72
info() is a great function to help you begin to understand your data.
In essence it gives you the name of each column, the number of rows with non-null data, and its data type.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12686 entries, 0 to 12685 Data columns (total 72 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Publication Type 12686 non-null object 1 Authors 12681 non-null object 2 Book Authors 58 non-null object 3 Book Editors 1244 non-null object 4 Book Group Authors 309 non-null object 5 Author Full Names 12681 non-null object 6 Book Author Full Names 58 non-null object 7 Group Authors 28 non-null object 8 Article Title 12686 non-null object 9 Source Title 12686 non-null object 10 Book Series Title 1182 non-null object 11 Book Series Subtitle 0 non-null float64 12 Language 12686 non-null object 13 Document Type 12686 non-null object 14 Conference Title 1518 non-null object 15 Conference Date 1518 non-null object 16 Conference Location 1518 non-null object 17 Conference Sponsor 1147 non-null object 18 Conference Host 256 non-null object 19 Author Keywords 8945 non-null object 20 Keywords Plus 10357 non-null object 21 Abstract 12356 non-null object 22 Addresses 12402 non-null object 23 Affiliations 11887 non-null object 24 Reprint Addresses 12388 non-null object 25 Email Addresses 11236 non-null object 26 Researcher Ids 7693 non-null object 27 ORCIDs 9244 non-null object 28 Funding Orgs 7732 non-null object 29 Funding Name Preferred 7714 non-null object 30 Funding Text 7607 non-null object 31 Cited References 0 non-null float64 32 Cited Reference Count 12686 non-null int64 33 Times Cited, WoS Core 12686 non-null int64 34 Times Cited, All Databases 12686 non-null int64 35 180 Day Usage Count 12686 non-null int64 36 Since 2013 Usage Count 12686 non-null int64 37 Publisher 12686 non-null object 38 Publisher City 12686 non-null object 39 Publisher Address 12686 non-null object 40 ISSN 10955 non-null object 41 eISSN 8921 non-null object 42 ISBN 1321 non-null object 43 Journal Abbreviation 12253 non-null object 44 Journal ISO Abbreviation 11165 non-null object 45 Publication Date 9661 non-null object 46 Publication Year 12527 non-null float64 47 Volume 11501 non-null object 48 Issue 8053 non-null object 49 Part Number 181 non-null object 50 Supplement 77 non-null object 51 Special Issue 671 non-null object 52 Meeting Abstract 1 non-null object 53 Start Page 8959 non-null object 54 End Page 8959 non-null object 55 Article Number 3559 non-null object 56 DOI 11473 non-null object 57 DOI Link 11473 non-null float64 58 Book DOI 197 non-null object 59 Early Access Date 1249 non-null object 60 Number of Pages 12686 non-null int64 61 WoS Categories 12683 non-null object 62 Web of Science Index 12686 non-null object 63 Research Areas 12683 non-null object 64 IDS Number 12686 non-null object 65 Pubmed Id 1525 non-null float64 66 Open Access Designations 6455 non-null object 67 Highly Cited Status 317 non-null object 68 Hot Paper Status 317 non-null object 69 Date of Export 12686 non-null object 70 UT (Unique WOS ID) 12686 non-null object 71 Web of Science Record 12686 non-null int64 dtypes: float64(5), int64(7), object(60) memory usage: 7.0+ MB
describe() gives you a quick statistical summart of your data.
This can be really useful for many types of data.
However for our data really not so much a help of any kind!
df.describe()
Book Series Subtitle | Cited References | Cited Reference Count | Times Cited, WoS Core | Times Cited, All Databases | 180 Day Usage Count | Since 2013 Usage Count | Publication Year | DOI Link | Number of Pages | Pubmed Id | Web of Science Record | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 0.0 | 0.0 | 12686.000000 | 12686.000000 | 12686.000000 | 12686.000000 | 12686.000000 | 12527.000000 | 11473.0 | 12686.000000 | 1.525000e+03 | 12686.0 |
mean | NaN | NaN | 66.851569 | 33.764071 | 34.873404 | 4.211729 | 32.252168 | 2016.078471 | 0.0 | 15.629040 | 2.955808e+07 | 0.0 |
std | NaN | NaN | 66.088802 | 238.393534 | 253.445908 | 11.660664 | 93.612961 | 5.758761 | 0.0 | 10.831938 | 5.745922e+06 | 0.0 |
min | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1956.000000 | 0.0 | 0.000000 | 1.832098e+06 | 0.0 |
25% | NaN | NaN | 31.000000 | 1.000000 | 1.000000 | 0.000000 | 4.000000 | 2013.000000 | 0.0 | 10.000000 | 2.637171e+07 | 0.0 |
50% | NaN | NaN | 53.000000 | 8.000000 | 8.000000 | 1.000000 | 12.000000 | 2018.000000 | 0.0 | 14.000000 | 3.120301e+07 | 0.0 |
75% | NaN | NaN | 81.000000 | 27.000000 | 28.000000 | 4.000000 | 31.000000 | 2020.000000 | 0.0 | 19.000000 | 3.407992e+07 | 0.0 |
max | NaN | NaN | 1295.000000 | 22838.000000 | 24472.000000 | 335.000000 | 4666.000000 | 2023.000000 | 0.0 | 359.000000 | 3.650743e+07 | 0.0 |
You Try It: Look at the stats, using describe, for the 'Number of Pages' column.
Solution:
df['Number of Pages'].describe()
count 12686.000000 mean 15.629040 std 10.831938 min 0.000000 25% 10.000000 50% 14.000000 75% 19.000000 max 359.000000 Name: Number of Pages, dtype: float64
Note: Look at the 'max' number of pages - that sucker will come back to haunt us later!!!
The last thing we will look at is creating a list of column header names.
This is actually very valuable tip which you may find yourself using frequently especially when you start plotting your data out!
To do this we will make use of a wonderful Python and thus Pandas trick called chaining.
In essence what we will be doing is chaining several different functions together.
This saves not only a lot of time coding, fewer lines of code, and can often execute faster.
What we need to do here is get the columns, get the value (which is a string representing the name) and then send that information to a list.
Further more we will save the resultant list to a new variable which we will wisely, for now, call "header_names".
# List of all column names
header_names = df.columns.values.tolist()
header_names
['Publication Type', 'Authors', 'Book Authors', 'Book Editors', 'Book Group Authors', 'Author Full Names', 'Book Author Full Names', 'Group Authors', 'Article Title', 'Source Title', 'Book Series Title', 'Book Series Subtitle', 'Language', 'Document Type', 'Conference Title', 'Conference Date', 'Conference Location', 'Conference Sponsor', 'Conference Host', 'Author Keywords', 'Keywords Plus', 'Abstract', 'Addresses', 'Affiliations', 'Reprint Addresses', 'Email Addresses', 'Researcher Ids', 'ORCIDs', 'Funding Orgs', 'Funding Name Preferred', 'Funding Text', 'Cited References', 'Cited Reference Count', 'Times Cited, WoS Core', 'Times Cited, All Databases', '180 Day Usage Count', 'Since 2013 Usage Count', 'Publisher', 'Publisher City', 'Publisher Address', 'ISSN', 'eISSN', 'ISBN', 'Journal Abbreviation', 'Journal ISO Abbreviation', 'Publication Date', 'Publication Year', 'Volume', 'Issue', 'Part Number', 'Supplement', 'Special Issue', 'Meeting Abstract', 'Start Page', 'End Page', 'Article Number', 'DOI', 'DOI Link', 'Book DOI', 'Early Access Date', 'Number of Pages', 'WoS Categories', 'Web of Science Index', 'Research Areas', 'IDS Number', 'Pubmed Id', 'Open Access Designations', 'Highly Cited Status', 'Hot Paper Status', 'Date of Export', 'UT (Unique WOS ID)', 'Web of Science Record']
There are numerous other methods we can use to look meaningfull insight into our data and we shall explore more of those as we proceed today.
For now its time to stop wast8ng time and look at how to clean the data and extract new data that we want for reasons we will explore as we go.
Now that we have our data all in one nice DataFrame, and safely saved, it now time to start cleaning and munging the data.
The next step is setting just the columns we want to work with, we have 72 and don't need them all.
There are two basic ways of getting a DataFrame with just the columns we want.
We will look at both ways!
First we will just create a new DataFrame which extract the columns of interest.
How did I create the following lists?
Remember our hang variable we created called header_names?
Will I just copy, pasted and editted!
keeper_cols = [
'Publication Type',
'Authors',
'Author Full Names',
'Article Title',
'Source Title',
'Book Series Title',
'Book Series Subtitle',
'Language',
'Document Type',
'Conference Title',
'Conference Date',
'Conference Location',
'Author Keywords',
'Keywords Plus',
'Abstract',
'Addresses',
'Affiliations',
'Email Addresses',
'Funding Orgs',
'Funding Name Preferred',
'Cited Reference Count',
'Times Cited, WoS Core',
'Times Cited, All Databases',
'180 Day Usage Count',
'Since 2013 Usage Count',
'Publisher',
'Publisher City',
'Journal Abbreviation',
'Journal ISO Abbreviation',
'Publication Date',
'Publication Year',
'Volume',
'Issue',
'Start Page',
'DOI',
'Number of Pages',
'WoS Categories',
'Web of Science Index',
'Research Areas']
df_keepers = df[keeper_cols]
df_keepers.head()
Publication Type | Authors | Author Full Names | Article Title | Source Title | Book Series Title | Book Series Subtitle | Language | Document Type | Conference Title | ... | Publication Date | Publication Year | Volume | Issue | Start Page | DOI | Number of Pages | WoS Categories | Web of Science Index | Research Areas | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | J | Miles, M | Miles, Malcolm | Representing nature: art and climate change | CULTURAL GEOGRAPHIES | NaN | NaN | English | Article | NaN | ... | JAN | 2010.0 | 17.0 | 1 | 19 | 10.1177/1474474009349997 | 17 | Environmental Studies; Geography | Social Science Citation Index (SSCI); Arts &am... | Environmental Sciences & Ecology; Geography |
1 | J | Dal Farra, R; Suarez, P | Dal Farra, Ricardo; Suarez, Pablo | RED CROSS/RED CRESCENT CLIMATE CENTRE AND BALA... | LEONARDO | NaN | NaN | English | Article | NaN | ... | OCT | 2014.0 | 47.0 | 5 | 493 | 10.1162/LEON_a_00818 | 1 | Art | Arts & Humanities Citation Index (A&HCI) | Art |
2 | J | Chen, MH | Chen, Mei-Hsin | The Contribution of Art to Climate Change Comm... | REVISTA HUMANIDADES | NaN | NaN | Spanish | Article | NaN | ... | JUL-DEC | 2022.0 | 12.0 | 2 | NaN | 10.15517/h.v12i2.51060 | 18 | Humanities, Multidisciplinary | Emerging Sources Citation Index (ESCI) | Arts & Humanities - Other Topics |
3 | J | Guy, S; Henshaw, V; Heidrich, O | Guy, Simon; Henshaw, Victoria; Heidrich, Oliver | Climate change, adaptation and Eco-Art in Sing... | JOURNAL OF ENVIRONMENTAL PLANNING AND MANAGEMENT | NaN | NaN | English | Article | NaN | ... | JAN 2 | 2015.0 | 58.0 | 1 | 39 | 10.1080/09640568.2013.839446 | 16 | Development Studies; Regional & Urban Planning | Social Science Citation Index (SSCI); Arts &am... | Development Studies; Public Administration |
4 | J | Baztan, J; Vanderlinden, JP; Jaffres, L; Jorge... | Baztan, Juan; Vanderlinden, Jean-Paul; Jaffres... | Facing climate injustices: Community trust-bui... | CLIMATE RISK MANAGEMENT | NaN | NaN | English | Article | NaN | ... | NaN | 2020.0 | 30.0 | NaN | NaN | 10.1016/j.crm.2020.100253 | 15 | Environmental Sciences; Environmental Studies;... | Science Citation Index Expanded (SCI-EXPANDED)... | Environmental Sciences & Ecology; Meteorology ... |
5 rows × 39 columns
df_keepers.shape
(12686, 39)
QUESTION: We did I bother creating a new DataFrame? (Yes this is a trick question really)
Answer
Because I also want to show you the 'drop()'process and don't want to have to rerun numerous cells to get back to this point.
Meaning we could have just reassigned df = df blah blah blah.
Like wise we can simply just remove/drop them using the drop() function like so.
There is an important bit of witchery in Pandas and that is there are numerous times when you are extremely wise to tell Pandas which you are actually wanting to work againts, rows, or colums!!!
To do this we use a special parameter called axis and it basically has to expected states a 0 or a 1.
For those that have taken linear algerbra you probably have rehearsed row, column, row, column, row, column in your head to remember the order of events for any opertation you desire to make (well unless your coding in Fortran!) For Pandas just remember 0, 1, 0, 1, 0, 1 :)
Note this time we will go ahead and just commit this to our active DataFrame (df) using the handy inplace=True parameter.
delete_cols = [
'Book Authors',
'Book Editors',
'Book Group Authors',
'Book Author Full Names',
'Group Authors',
'Conference Sponsor',
'Conference Host',
'Reprint Addresses',
'Researcher Ids',
'ORCIDs',
'Funding Text',
'Cited References',
'Publisher Address',
'ISSN',
'eISSN',
'ISBN',
'Part Number',
'Supplement',
'Special Issue',
'Meeting Abstract',
'End Page',
'Article Number',
'DOI Link',
'Book DOI',
'Early Access Date',
'IDS Number',
'Pubmed Id',
'Open Access Designations',
'Highly Cited Status',
'Hot Paper Status',
'Date of Export',
'UT (Unique WOS ID)',
'Web of Science Record']
df.drop(delete_cols, axis=1, inplace=True)
df.head()
Publication Type | Authors | Author Full Names | Article Title | Source Title | Book Series Title | Book Series Subtitle | Language | Document Type | Conference Title | ... | Publication Date | Publication Year | Volume | Issue | Start Page | DOI | Number of Pages | WoS Categories | Web of Science Index | Research Areas | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | J | Miles, M | Miles, Malcolm | Representing nature: art and climate change | CULTURAL GEOGRAPHIES | NaN | NaN | English | Article | NaN | ... | JAN | 2010.0 | 17.0 | 1 | 19 | 10.1177/1474474009349997 | 17 | Environmental Studies; Geography | Social Science Citation Index (SSCI); Arts &am... | Environmental Sciences & Ecology; Geography |
1 | J | Dal Farra, R; Suarez, P | Dal Farra, Ricardo; Suarez, Pablo | RED CROSS/RED CRESCENT CLIMATE CENTRE AND BALA... | LEONARDO | NaN | NaN | English | Article | NaN | ... | OCT | 2014.0 | 47.0 | 5 | 493 | 10.1162/LEON_a_00818 | 1 | Art | Arts & Humanities Citation Index (A&HCI) | Art |
2 | J | Chen, MH | Chen, Mei-Hsin | The Contribution of Art to Climate Change Comm... | REVISTA HUMANIDADES | NaN | NaN | Spanish | Article | NaN | ... | JUL-DEC | 2022.0 | 12.0 | 2 | NaN | 10.15517/h.v12i2.51060 | 18 | Humanities, Multidisciplinary | Emerging Sources Citation Index (ESCI) | Arts & Humanities - Other Topics |
3 | J | Guy, S; Henshaw, V; Heidrich, O | Guy, Simon; Henshaw, Victoria; Heidrich, Oliver | Climate change, adaptation and Eco-Art in Sing... | JOURNAL OF ENVIRONMENTAL PLANNING AND MANAGEMENT | NaN | NaN | English | Article | NaN | ... | JAN 2 | 2015.0 | 58.0 | 1 | 39 | 10.1080/09640568.2013.839446 | 16 | Development Studies; Regional & Urban Planning | Social Science Citation Index (SSCI); Arts &am... | Development Studies; Public Administration |
4 | J | Baztan, J; Vanderlinden, JP; Jaffres, L; Jorge... | Baztan, Juan; Vanderlinden, Jean-Paul; Jaffres... | Facing climate injustices: Community trust-bui... | CLIMATE RISK MANAGEMENT | NaN | NaN | English | Article | NaN | ... | NaN | 2020.0 | 30.0 | NaN | NaN | 10.1016/j.crm.2020.100253 | 15 | Environmental Sciences; Environmental Studies;... | Science Citation Index Expanded (SCI-EXPANDED)... | Environmental Sciences & Ecology; Meteorology ... |
5 rows × 39 columns
QUESTION: Which is the best method to use?
ANSWER
Easy the one that is fastest/easiest for you to generate your list of columns!df.columns.values.tolist()
The first column of data is called "Publication Type" and uses symbols to represent the various types of publication found in the Web of Science.
They definition for the symbols are as follows.
You Try It: Take a look at the 'Publication Type' column and add on the .unique() function just like you would with .head().
We will come back and discuss .unique() in just a couple of minutes.
df['Publication Type'].unique()
array(['J', 'C', 'S', 'B'], dtype=object)
Solution:
df['Publication Type'].unique()
array(['J', 'C', 'S', 'B'], dtype=object)
Publication Types:
B = Book
J = Journal
P = Patent
S = Book in Series
Source: https://images.webofknowledge.com/images/help/WOS/hs_wos_fieldtags.html
What we would like to do is to replace those symbols with their actual meaning.
In the non-Pandas world we would utilize the insane power of regex.
Insane you ask?
Yes, becuase it drives you insane trying to figure out what parameterizations you need to get things done.
I've used it for dedcades and still find myself googeling or opening a massive book ojust on regex which I have has since before many of you were born!
But Pandas, often, allows us to bypass the need for regex. In slightly more complex cases to use Pandas with regex built in help, or even with pure regex itself(just make sure to bring a comfy, wrap around, strap on, white jacket!)
For our purposes we will create a dictionary to aid us.
For those new to this a dictionary is one of the most powerful data structures in Python next to a list itself.
A dictionary is a specialzed list which contains a key and an associated value.
Note: while you only have one key per value you can have multiple values per key. Meaning a key can have a value which consists of a list, a tuple or even another dictionary!!!
We will stay simple and create a dictionary as such symbols (the key) and their meaning (the value).
pubtype = {'B':'Book', 'J':'Journal', 'P':'Patent', 'S':'Series'}
You can now find an associated value(s) from a desired key. Likewise you can find a key from an associated.
We will make use of this by iterating through the keys in our dictionary and using Pandas replace() function to make our desired changes.
To find a value from a key we just slice it from the dictionaries key of interest like this;
myvalue = dict_name.['desired_key_name']
for key in pubtype:
print(f'key: {key} value: {pubtype[key]}') # This is here just for show and tell!
df['Publication Type'].replace(key, pubtype[key], inplace=True)
key: B value: Book key: J value: Journal key: P value: Patent key: S value: Series
Are we good to go? Lets take a look at find out.
To do this we will use a handy Pandas function called unique().
This will find catalog all unique occurances of items in the DataFrame, row or column, ect we desire to look at.
df['Publication Type'].unique()
array(['Journal', 'C', 'Series', 'Book'], dtype=object)
So now we see that there is some mysterous 'C' symbol which is NOT listed in the Web of Science documentation!
Is this just a typo?
To check this we would like to know the count for each unique occruance in our column.
To do that we use the Pandas value_counts() function.
df['Publication Type'].value_counts()
Journal 11057 C 1269 Series 196 Book 164 Name: Publication Type, dtype: int64
So not only is it not a typo but its the second most used symbol.
In the real world we would have to invetsigate this further. Go back in adjust our dictionary accordingly and rerun that bit of code.
But lets ignore this, at least for now, and move on to a vastly more dirty bit of data munging.
First let us be safe and save our DataFrame to our csv file.
proc_dir = 'Processed_data'
df.to_csv(os.path.join(proc_dir, 'Combined_lists.csv'), index=False)
Extracting Location Data
For purposes we will utilze later we want to create a new column which contains the working countries of the authors.
The column we will use for this is the "Addresses" column.
proc_dir = 'Processed_data'
df = pd.read_csv(os.path.join(proc_dir, 'Combined_lists.csv'))
#df.head()
df['Addresses'].head(8)
0 Univ Plymouth, Sch Art & Media, Plymouth PL4 8... 1 [Dal Farra, Ricardo] Concordia Univ, Hexagram,... 2 [Chen, Mei-Hsin] Univ Navarra, Pamplona, Spain 3 [Guy, Simon] Univ Manchester, Sch Environm & D... 4 [Baztan, Juan; Vanderlinden, Jean-Paul; Zhu, Z... 5 [Burke, Miriam] Royal Holloway Univ London, De... 6 [Roedder, Simone] Univ Hamburg, Inst Sociol, D... 7 [Bentz, Julia] Univ Lisbon, Fac Sci, Ctr Ecol ... Name: Addresses, dtype: object
Lets take a closer look at the first first entry which will demonstrate a massive problem really quick!
df['Addresses'][0]
'Univ Plymouth, Sch Art & Media, Plymouth PL4 8AA, Devon, England'
Now looks look at an even more challenging example
df['Addresses'][118]
'[Mauritsen, Thorsten] Max Planck Inst Meteorol, D-20146 Hamburg, Germany; [Graversen, Rune G.] Stockholm Univ, Dept Meteorol, S-10691 Stockholm, Sweden; [Klocke, Daniel] European Ctr Medium Range Weather Forecasts, Reading RG2 9AX, Berks, England; [Langen, Peter L.; Stevens, Bjorn; Tomassini, Lorenzo] DMI, Copenhagen, Denmark'
So our issue is, well one of many, country names embedded through out each cell and somehow we need to extract them out.
pycountry is a handy package called pycountry which is an ISO compliant dataset of all countries.
Let me repeat that, it's and ISO compliant set of contry information! Trust me this will become important shortly!!!
Lets first look at what pycountry has avilable for us so we know how to work with it.
for c in pycountry.countries:
print(c)
Country(alpha_2='AW', alpha_3='ABW', name='Aruba', numeric='533') Country(alpha_2='AF', alpha_3='AFG', name='Afghanistan', numeric='004', official_name='Islamic Republic of Afghanistan') Country(alpha_2='AO', alpha_3='AGO', name='Angola', numeric='024', official_name='Republic of Angola') Country(alpha_2='AI', alpha_3='AIA', name='Anguilla', numeric='660') Country(alpha_2='AX', alpha_3='ALA', name='Åland Islands', numeric='248') Country(alpha_2='AL', alpha_3='ALB', name='Albania', numeric='008', official_name='Republic of Albania') Country(alpha_2='AD', alpha_3='AND', name='Andorra', numeric='020', official_name='Principality of Andorra') Country(alpha_2='AE', alpha_3='ARE', name='United Arab Emirates', numeric='784') Country(alpha_2='AR', alpha_3='ARG', name='Argentina', numeric='032', official_name='Argentine Republic') Country(alpha_2='AM', alpha_3='ARM', name='Armenia', numeric='051', official_name='Republic of Armenia') Country(alpha_2='AS', alpha_3='ASM', name='American Samoa', numeric='016') Country(alpha_2='AQ', alpha_3='ATA', name='Antarctica', numeric='010') Country(alpha_2='TF', alpha_3='ATF', name='French Southern Territories', numeric='260') Country(alpha_2='AG', alpha_3='ATG', name='Antigua and Barbuda', numeric='028') Country(alpha_2='AU', alpha_3='AUS', name='Australia', numeric='036') Country(alpha_2='AT', alpha_3='AUT', name='Austria', numeric='040', official_name='Republic of Austria') Country(alpha_2='AZ', alpha_3='AZE', name='Azerbaijan', numeric='031', official_name='Republic of Azerbaijan') Country(alpha_2='BI', alpha_3='BDI', name='Burundi', numeric='108', official_name='Republic of Burundi') Country(alpha_2='BE', alpha_3='BEL', name='Belgium', numeric='056', official_name='Kingdom of Belgium') Country(alpha_2='BJ', alpha_3='BEN', name='Benin', numeric='204', official_name='Republic of Benin') Country(alpha_2='BQ', alpha_3='BES', name='Bonaire, Sint Eustatius and Saba', numeric='535', official_name='Bonaire, Sint Eustatius and Saba') Country(alpha_2='BF', alpha_3='BFA', name='Burkina Faso', numeric='854') Country(alpha_2='BD', alpha_3='BGD', name='Bangladesh', numeric='050', official_name="People's Republic of Bangladesh") Country(alpha_2='BG', alpha_3='BGR', name='Bulgaria', numeric='100', official_name='Republic of Bulgaria') Country(alpha_2='BH', alpha_3='BHR', name='Bahrain', numeric='048', official_name='Kingdom of Bahrain') Country(alpha_2='BS', alpha_3='BHS', name='Bahamas', numeric='044', official_name='Commonwealth of the Bahamas') Country(alpha_2='BA', alpha_3='BIH', name='Bosnia and Herzegovina', numeric='070', official_name='Republic of Bosnia and Herzegovina') Country(alpha_2='BL', alpha_3='BLM', name='Saint Barthélemy', numeric='652') Country(alpha_2='BY', alpha_3='BLR', name='Belarus', numeric='112', official_name='Republic of Belarus') Country(alpha_2='BZ', alpha_3='BLZ', name='Belize', numeric='084') Country(alpha_2='BM', alpha_3='BMU', name='Bermuda', numeric='060') Country(alpha_2='BO', alpha_3='BOL', common_name='Bolivia', name='Bolivia, Plurinational State of', numeric='068', official_name='Plurinational State of Bolivia') Country(alpha_2='BR', alpha_3='BRA', name='Brazil', numeric='076', official_name='Federative Republic of Brazil') Country(alpha_2='BB', alpha_3='BRB', name='Barbados', numeric='052') Country(alpha_2='BN', alpha_3='BRN', name='Brunei Darussalam', numeric='096') Country(alpha_2='BT', alpha_3='BTN', name='Bhutan', numeric='064', official_name='Kingdom of Bhutan') Country(alpha_2='BV', alpha_3='BVT', name='Bouvet Island', numeric='074') Country(alpha_2='BW', alpha_3='BWA', name='Botswana', numeric='072', official_name='Republic of Botswana') Country(alpha_2='CF', alpha_3='CAF', name='Central African Republic', numeric='140') Country(alpha_2='CA', alpha_3='CAN', name='Canada', numeric='124') Country(alpha_2='CC', alpha_3='CCK', name='Cocos (Keeling) Islands', numeric='166') Country(alpha_2='CH', alpha_3='CHE', name='Switzerland', numeric='756', official_name='Swiss Confederation') Country(alpha_2='CL', alpha_3='CHL', name='Chile', numeric='152', official_name='Republic of Chile') Country(alpha_2='CN', alpha_3='CHN', name='China', numeric='156', official_name="People's Republic of China") Country(alpha_2='CI', alpha_3='CIV', name="Côte d'Ivoire", numeric='384', official_name="Republic of Côte d'Ivoire") Country(alpha_2='CM', alpha_3='CMR', name='Cameroon', numeric='120', official_name='Republic of Cameroon') Country(alpha_2='CD', alpha_3='COD', name='Congo, The Democratic Republic of the', numeric='180') Country(alpha_2='CG', alpha_3='COG', name='Congo', numeric='178', official_name='Republic of the Congo') Country(alpha_2='CK', alpha_3='COK', name='Cook Islands', numeric='184') Country(alpha_2='CO', alpha_3='COL', name='Colombia', numeric='170', official_name='Republic of Colombia') Country(alpha_2='KM', alpha_3='COM', name='Comoros', numeric='174', official_name='Union of the Comoros') Country(alpha_2='CV', alpha_3='CPV', name='Cabo Verde', numeric='132', official_name='Republic of Cabo Verde') Country(alpha_2='CR', alpha_3='CRI', name='Costa Rica', numeric='188', official_name='Republic of Costa Rica') Country(alpha_2='CU', alpha_3='CUB', name='Cuba', numeric='192', official_name='Republic of Cuba') Country(alpha_2='CW', alpha_3='CUW', name='Curaçao', numeric='531', official_name='Curaçao') Country(alpha_2='CX', alpha_3='CXR', name='Christmas Island', numeric='162') Country(alpha_2='KY', alpha_3='CYM', name='Cayman Islands', numeric='136') Country(alpha_2='CY', alpha_3='CYP', name='Cyprus', numeric='196', official_name='Republic of Cyprus') Country(alpha_2='CZ', alpha_3='CZE', name='Czechia', numeric='203', official_name='Czech Republic') Country(alpha_2='DE', alpha_3='DEU', name='Germany', numeric='276', official_name='Federal Republic of Germany') Country(alpha_2='DJ', alpha_3='DJI', name='Djibouti', numeric='262', official_name='Republic of Djibouti') Country(alpha_2='DM', alpha_3='DMA', name='Dominica', numeric='212', official_name='Commonwealth of Dominica') Country(alpha_2='DK', alpha_3='DNK', name='Denmark', numeric='208', official_name='Kingdom of Denmark') Country(alpha_2='DO', alpha_3='DOM', name='Dominican Republic', numeric='214') Country(alpha_2='DZ', alpha_3='DZA', name='Algeria', numeric='012', official_name="People's Democratic Republic of Algeria") Country(alpha_2='EC', alpha_3='ECU', name='Ecuador', numeric='218', official_name='Republic of Ecuador') Country(alpha_2='EG', alpha_3='EGY', name='Egypt', numeric='818', official_name='Arab Republic of Egypt') Country(alpha_2='ER', alpha_3='ERI', name='Eritrea', numeric='232', official_name='the State of Eritrea') Country(alpha_2='EH', alpha_3='ESH', name='Western Sahara', numeric='732') Country(alpha_2='ES', alpha_3='ESP', name='Spain', numeric='724', official_name='Kingdom of Spain') Country(alpha_2='EE', alpha_3='EST', name='Estonia', numeric='233', official_name='Republic of Estonia') Country(alpha_2='ET', alpha_3='ETH', name='Ethiopia', numeric='231', official_name='Federal Democratic Republic of Ethiopia') Country(alpha_2='FI', alpha_3='FIN', name='Finland', numeric='246', official_name='Republic of Finland') Country(alpha_2='FJ', alpha_3='FJI', name='Fiji', numeric='242', official_name='Republic of Fiji') Country(alpha_2='FK', alpha_3='FLK', name='Falkland Islands (Malvinas)', numeric='238') Country(alpha_2='FR', alpha_3='FRA', name='France', numeric='250', official_name='French Republic') Country(alpha_2='FO', alpha_3='FRO', name='Faroe Islands', numeric='234') Country(alpha_2='FM', alpha_3='FSM', name='Micronesia, Federated States of', numeric='583', official_name='Federated States of Micronesia') Country(alpha_2='GA', alpha_3='GAB', name='Gabon', numeric='266', official_name='Gabonese Republic') Country(alpha_2='GB', alpha_3='GBR', name='United Kingdom', numeric='826', official_name='United Kingdom of Great Britain and Northern Ireland') Country(alpha_2='GE', alpha_3='GEO', name='Georgia', numeric='268') Country(alpha_2='GG', alpha_3='GGY', name='Guernsey', numeric='831') Country(alpha_2='GH', alpha_3='GHA', name='Ghana', numeric='288', official_name='Republic of Ghana') Country(alpha_2='GI', alpha_3='GIB', name='Gibraltar', numeric='292') Country(alpha_2='GN', alpha_3='GIN', name='Guinea', numeric='324', official_name='Republic of Guinea') Country(alpha_2='GP', alpha_3='GLP', name='Guadeloupe', numeric='312') Country(alpha_2='GM', alpha_3='GMB', name='Gambia', numeric='270', official_name='Republic of the Gambia') Country(alpha_2='GW', alpha_3='GNB', name='Guinea-Bissau', numeric='624', official_name='Republic of Guinea-Bissau') Country(alpha_2='GQ', alpha_3='GNQ', name='Equatorial Guinea', numeric='226', official_name='Republic of Equatorial Guinea') Country(alpha_2='GR', alpha_3='GRC', name='Greece', numeric='300', official_name='Hellenic Republic') Country(alpha_2='GD', alpha_3='GRD', name='Grenada', numeric='308') Country(alpha_2='GL', alpha_3='GRL', name='Greenland', numeric='304') Country(alpha_2='GT', alpha_3='GTM', name='Guatemala', numeric='320', official_name='Republic of Guatemala') Country(alpha_2='GF', alpha_3='GUF', name='French Guiana', numeric='254') Country(alpha_2='GU', alpha_3='GUM', name='Guam', numeric='316') Country(alpha_2='GY', alpha_3='GUY', name='Guyana', numeric='328', official_name='Republic of Guyana') Country(alpha_2='HK', alpha_3='HKG', name='Hong Kong', numeric='344', official_name='Hong Kong Special Administrative Region of China') Country(alpha_2='HM', alpha_3='HMD', name='Heard Island and McDonald Islands', numeric='334') Country(alpha_2='HN', alpha_3='HND', name='Honduras', numeric='340', official_name='Republic of Honduras') Country(alpha_2='HR', alpha_3='HRV', name='Croatia', numeric='191', official_name='Republic of Croatia') Country(alpha_2='HT', alpha_3='HTI', name='Haiti', numeric='332', official_name='Republic of Haiti') Country(alpha_2='HU', alpha_3='HUN', name='Hungary', numeric='348', official_name='Hungary') Country(alpha_2='ID', alpha_3='IDN', name='Indonesia', numeric='360', official_name='Republic of Indonesia') Country(alpha_2='IM', alpha_3='IMN', name='Isle of Man', numeric='833') Country(alpha_2='IN', alpha_3='IND', name='India', numeric='356', official_name='Republic of India') Country(alpha_2='IO', alpha_3='IOT', name='British Indian Ocean Territory', numeric='086') Country(alpha_2='IE', alpha_3='IRL', name='Ireland', numeric='372') Country(alpha_2='IR', alpha_3='IRN', name='Iran, Islamic Republic of', numeric='364', official_name='Islamic Republic of Iran') Country(alpha_2='IQ', alpha_3='IRQ', name='Iraq', numeric='368', official_name='Republic of Iraq') Country(alpha_2='IS', alpha_3='ISL', name='Iceland', numeric='352', official_name='Republic of Iceland') Country(alpha_2='IL', alpha_3='ISR', name='Israel', numeric='376', official_name='State of Israel') Country(alpha_2='IT', alpha_3='ITA', name='Italy', numeric='380', official_name='Italian Republic') Country(alpha_2='JM', alpha_3='JAM', name='Jamaica', numeric='388') Country(alpha_2='JE', alpha_3='JEY', name='Jersey', numeric='832') Country(alpha_2='JO', alpha_3='JOR', name='Jordan', numeric='400', official_name='Hashemite Kingdom of Jordan') Country(alpha_2='JP', alpha_3='JPN', name='Japan', numeric='392') Country(alpha_2='KZ', alpha_3='KAZ', name='Kazakhstan', numeric='398', official_name='Republic of Kazakhstan') Country(alpha_2='KE', alpha_3='KEN', name='Kenya', numeric='404', official_name='Republic of Kenya') Country(alpha_2='KG', alpha_3='KGZ', name='Kyrgyzstan', numeric='417', official_name='Kyrgyz Republic') Country(alpha_2='KH', alpha_3='KHM', name='Cambodia', numeric='116', official_name='Kingdom of Cambodia') Country(alpha_2='KI', alpha_3='KIR', name='Kiribati', numeric='296', official_name='Republic of Kiribati') Country(alpha_2='KN', alpha_3='KNA', name='Saint Kitts and Nevis', numeric='659') Country(alpha_2='KR', alpha_3='KOR', name='Korea, Republic of', numeric='410') Country(alpha_2='KW', alpha_3='KWT', name='Kuwait', numeric='414', official_name='State of Kuwait') Country(alpha_2='LA', alpha_3='LAO', name="Lao People's Democratic Republic", numeric='418') Country(alpha_2='LB', alpha_3='LBN', name='Lebanon', numeric='422', official_name='Lebanese Republic') Country(alpha_2='LR', alpha_3='LBR', name='Liberia', numeric='430', official_name='Republic of Liberia') Country(alpha_2='LY', alpha_3='LBY', name='Libya', numeric='434', official_name='Libya') Country(alpha_2='LC', alpha_3='LCA', name='Saint Lucia', numeric='662') Country(alpha_2='LI', alpha_3='LIE', name='Liechtenstein', numeric='438', official_name='Principality of Liechtenstein') Country(alpha_2='LK', alpha_3='LKA', name='Sri Lanka', numeric='144', official_name='Democratic Socialist Republic of Sri Lanka') Country(alpha_2='LS', alpha_3='LSO', name='Lesotho', numeric='426', official_name='Kingdom of Lesotho') Country(alpha_2='LT', alpha_3='LTU', name='Lithuania', numeric='440', official_name='Republic of Lithuania') Country(alpha_2='LU', alpha_3='LUX', name='Luxembourg', numeric='442', official_name='Grand Duchy of Luxembourg') Country(alpha_2='LV', alpha_3='LVA', name='Latvia', numeric='428', official_name='Republic of Latvia') Country(alpha_2='MO', alpha_3='MAC', name='Macao', numeric='446', official_name='Macao Special Administrative Region of China') Country(alpha_2='MF', alpha_3='MAF', name='Saint Martin (French part)', numeric='663') Country(alpha_2='MA', alpha_3='MAR', name='Morocco', numeric='504', official_name='Kingdom of Morocco') Country(alpha_2='MC', alpha_3='MCO', name='Monaco', numeric='492', official_name='Principality of Monaco') Country(alpha_2='MD', alpha_3='MDA', common_name='Moldova', name='Moldova, Republic of', numeric='498', official_name='Republic of Moldova') Country(alpha_2='MG', alpha_3='MDG', name='Madagascar', numeric='450', official_name='Republic of Madagascar') Country(alpha_2='MV', alpha_3='MDV', name='Maldives', numeric='462', official_name='Republic of Maldives') Country(alpha_2='MX', alpha_3='MEX', name='Mexico', numeric='484', official_name='United Mexican States') Country(alpha_2='MH', alpha_3='MHL', name='Marshall Islands', numeric='584', official_name='Republic of the Marshall Islands') Country(alpha_2='MK', alpha_3='MKD', name='North Macedonia', numeric='807', official_name='Republic of North Macedonia') Country(alpha_2='ML', alpha_3='MLI', name='Mali', numeric='466', official_name='Republic of Mali') Country(alpha_2='MT', alpha_3='MLT', name='Malta', numeric='470', official_name='Republic of Malta') Country(alpha_2='MM', alpha_3='MMR', name='Myanmar', numeric='104', official_name='Republic of Myanmar') Country(alpha_2='ME', alpha_3='MNE', name='Montenegro', numeric='499', official_name='Montenegro') Country(alpha_2='MN', alpha_3='MNG', name='Mongolia', numeric='496') Country(alpha_2='MP', alpha_3='MNP', name='Northern Mariana Islands', numeric='580', official_name='Commonwealth of the Northern Mariana Islands') Country(alpha_2='MZ', alpha_3='MOZ', name='Mozambique', numeric='508', official_name='Republic of Mozambique') Country(alpha_2='MR', alpha_3='MRT', name='Mauritania', numeric='478', official_name='Islamic Republic of Mauritania') Country(alpha_2='MS', alpha_3='MSR', name='Montserrat', numeric='500') Country(alpha_2='MQ', alpha_3='MTQ', name='Martinique', numeric='474') Country(alpha_2='MU', alpha_3='MUS', name='Mauritius', numeric='480', official_name='Republic of Mauritius') Country(alpha_2='MW', alpha_3='MWI', name='Malawi', numeric='454', official_name='Republic of Malawi') Country(alpha_2='MY', alpha_3='MYS', name='Malaysia', numeric='458') Country(alpha_2='YT', alpha_3='MYT', name='Mayotte', numeric='175') Country(alpha_2='NA', alpha_3='NAM', name='Namibia', numeric='516', official_name='Republic of Namibia') Country(alpha_2='NC', alpha_3='NCL', name='New Caledonia', numeric='540') Country(alpha_2='NE', alpha_3='NER', name='Niger', numeric='562', official_name='Republic of the Niger') Country(alpha_2='NF', alpha_3='NFK', name='Norfolk Island', numeric='574') Country(alpha_2='NG', alpha_3='NGA', name='Nigeria', numeric='566', official_name='Federal Republic of Nigeria') Country(alpha_2='NI', alpha_3='NIC', name='Nicaragua', numeric='558', official_name='Republic of Nicaragua') Country(alpha_2='NU', alpha_3='NIU', name='Niue', numeric='570', official_name='Niue') Country(alpha_2='NL', alpha_3='NLD', name='Netherlands', numeric='528', official_name='Kingdom of the Netherlands') Country(alpha_2='NO', alpha_3='NOR', name='Norway', numeric='578', official_name='Kingdom of Norway') Country(alpha_2='NP', alpha_3='NPL', name='Nepal', numeric='524', official_name='Federal Democratic Republic of Nepal') Country(alpha_2='NR', alpha_3='NRU', name='Nauru', numeric='520', official_name='Republic of Nauru') Country(alpha_2='NZ', alpha_3='NZL', name='New Zealand', numeric='554') Country(alpha_2='OM', alpha_3='OMN', name='Oman', numeric='512', official_name='Sultanate of Oman') Country(alpha_2='PK', alpha_3='PAK', name='Pakistan', numeric='586', official_name='Islamic Republic of Pakistan') Country(alpha_2='PA', alpha_3='PAN', name='Panama', numeric='591', official_name='Republic of Panama') Country(alpha_2='PN', alpha_3='PCN', name='Pitcairn', numeric='612') Country(alpha_2='PE', alpha_3='PER', name='Peru', numeric='604', official_name='Republic of Peru') Country(alpha_2='PH', alpha_3='PHL', name='Philippines', numeric='608', official_name='Republic of the Philippines') Country(alpha_2='PW', alpha_3='PLW', name='Palau', numeric='585', official_name='Republic of Palau') Country(alpha_2='PG', alpha_3='PNG', name='Papua New Guinea', numeric='598', official_name='Independent State of Papua New Guinea') Country(alpha_2='PL', alpha_3='POL', name='Poland', numeric='616', official_name='Republic of Poland') Country(alpha_2='PR', alpha_3='PRI', name='Puerto Rico', numeric='630') Country(alpha_2='KP', alpha_3='PRK', name="Korea, Democratic People's Republic of", numeric='408', official_name="Democratic People's Republic of Korea") Country(alpha_2='PT', alpha_3='PRT', name='Portugal', numeric='620', official_name='Portuguese Republic') Country(alpha_2='PY', alpha_3='PRY', name='Paraguay', numeric='600', official_name='Republic of Paraguay') Country(alpha_2='PS', alpha_3='PSE', name='Palestine, State of', numeric='275', official_name='the State of Palestine') Country(alpha_2='PF', alpha_3='PYF', name='French Polynesia', numeric='258') Country(alpha_2='QA', alpha_3='QAT', name='Qatar', numeric='634', official_name='State of Qatar') Country(alpha_2='RE', alpha_3='REU', name='Réunion', numeric='638') Country(alpha_2='RO', alpha_3='ROU', name='Romania', numeric='642') Country(alpha_2='RU', alpha_3='RUS', name='Russian Federation', numeric='643') Country(alpha_2='RW', alpha_3='RWA', name='Rwanda', numeric='646', official_name='Rwandese Republic') Country(alpha_2='SA', alpha_3='SAU', name='Saudi Arabia', numeric='682', official_name='Kingdom of Saudi Arabia') Country(alpha_2='SD', alpha_3='SDN', name='Sudan', numeric='729', official_name='Republic of the Sudan') Country(alpha_2='SN', alpha_3='SEN', name='Senegal', numeric='686', official_name='Republic of Senegal') Country(alpha_2='SG', alpha_3='SGP', name='Singapore', numeric='702', official_name='Republic of Singapore') Country(alpha_2='GS', alpha_3='SGS', name='South Georgia and the South Sandwich Islands', numeric='239') Country(alpha_2='SH', alpha_3='SHN', name='Saint Helena, Ascension and Tristan da Cunha', numeric='654') Country(alpha_2='SJ', alpha_3='SJM', name='Svalbard and Jan Mayen', numeric='744') Country(alpha_2='SB', alpha_3='SLB', name='Solomon Islands', numeric='090') Country(alpha_2='SL', alpha_3='SLE', name='Sierra Leone', numeric='694', official_name='Republic of Sierra Leone') Country(alpha_2='SV', alpha_3='SLV', name='El Salvador', numeric='222', official_name='Republic of El Salvador') Country(alpha_2='SM', alpha_3='SMR', name='San Marino', numeric='674', official_name='Republic of San Marino') Country(alpha_2='SO', alpha_3='SOM', name='Somalia', numeric='706', official_name='Federal Republic of Somalia') Country(alpha_2='PM', alpha_3='SPM', name='Saint Pierre and Miquelon', numeric='666') Country(alpha_2='RS', alpha_3='SRB', name='Serbia', numeric='688', official_name='Republic of Serbia') Country(alpha_2='SS', alpha_3='SSD', name='South Sudan', numeric='728', official_name='Republic of South Sudan') Country(alpha_2='ST', alpha_3='STP', name='Sao Tome and Principe', numeric='678', official_name='Democratic Republic of Sao Tome and Principe') Country(alpha_2='SR', alpha_3='SUR', name='Suriname', numeric='740', official_name='Republic of Suriname') Country(alpha_2='SK', alpha_3='SVK', name='Slovakia', numeric='703', official_name='Slovak Republic') Country(alpha_2='SI', alpha_3='SVN', name='Slovenia', numeric='705', official_name='Republic of Slovenia') Country(alpha_2='SE', alpha_3='SWE', name='Sweden', numeric='752', official_name='Kingdom of Sweden') Country(alpha_2='SZ', alpha_3='SWZ', name='Eswatini', numeric='748', official_name='Kingdom of Eswatini') Country(alpha_2='SX', alpha_3='SXM', name='Sint Maarten (Dutch part)', numeric='534', official_name='Sint Maarten (Dutch part)') Country(alpha_2='SC', alpha_3='SYC', name='Seychelles', numeric='690', official_name='Republic of Seychelles') Country(alpha_2='SY', alpha_3='SYR', name='Syrian Arab Republic', numeric='760') Country(alpha_2='TC', alpha_3='TCA', name='Turks and Caicos Islands', numeric='796') Country(alpha_2='TD', alpha_3='TCD', name='Chad', numeric='148', official_name='Republic of Chad') Country(alpha_2='TG', alpha_3='TGO', name='Togo', numeric='768', official_name='Togolese Republic') Country(alpha_2='TH', alpha_3='THA', name='Thailand', numeric='764', official_name='Kingdom of Thailand') Country(alpha_2='TJ', alpha_3='TJK', name='Tajikistan', numeric='762', official_name='Republic of Tajikistan') Country(alpha_2='TK', alpha_3='TKL', name='Tokelau', numeric='772') Country(alpha_2='TM', alpha_3='TKM', name='Turkmenistan', numeric='795') Country(alpha_2='TL', alpha_3='TLS', name='Timor-Leste', numeric='626', official_name='Democratic Republic of Timor-Leste') Country(alpha_2='TO', alpha_3='TON', name='Tonga', numeric='776', official_name='Kingdom of Tonga') Country(alpha_2='TT', alpha_3='TTO', name='Trinidad and Tobago', numeric='780', official_name='Republic of Trinidad and Tobago') Country(alpha_2='TN', alpha_3='TUN', name='Tunisia', numeric='788', official_name='Republic of Tunisia') Country(alpha_2='TR', alpha_3='TUR', name='Turkey', numeric='792', official_name='Republic of Turkey') Country(alpha_2='TV', alpha_3='TUV', name='Tuvalu', numeric='798') Country(alpha_2='TW', alpha_3='TWN', common_name='Taiwan', name='Taiwan, Province of China', numeric='158', official_name='Taiwan, Province of China') Country(alpha_2='TZ', alpha_3='TZA', common_name='Tanzania', name='Tanzania, United Republic of', numeric='834', official_name='United Republic of Tanzania') Country(alpha_2='UG', alpha_3='UGA', name='Uganda', numeric='800', official_name='Republic of Uganda') Country(alpha_2='UA', alpha_3='UKR', name='Ukraine', numeric='804') Country(alpha_2='UM', alpha_3='UMI', name='United States Minor Outlying Islands', numeric='581') Country(alpha_2='UY', alpha_3='URY', name='Uruguay', numeric='858', official_name='Eastern Republic of Uruguay') Country(alpha_2='US', alpha_3='USA', name='United States', numeric='840', official_name='United States of America') Country(alpha_2='UZ', alpha_3='UZB', name='Uzbekistan', numeric='860', official_name='Republic of Uzbekistan') Country(alpha_2='VA', alpha_3='VAT', name='Holy See (Vatican City State)', numeric='336') Country(alpha_2='VC', alpha_3='VCT', name='Saint Vincent and the Grenadines', numeric='670') Country(alpha_2='VE', alpha_3='VEN', common_name='Venezuela', name='Venezuela, Bolivarian Republic of', numeric='862', official_name='Bolivarian Republic of Venezuela') Country(alpha_2='VG', alpha_3='VGB', name='Virgin Islands, British', numeric='092', official_name='British Virgin Islands') Country(alpha_2='VI', alpha_3='VIR', name='Virgin Islands, U.S.', numeric='850', official_name='Virgin Islands of the United States') Country(alpha_2='VN', alpha_3='VNM', common_name='Vietnam', name='Viet Nam', numeric='704', official_name='Socialist Republic of Viet Nam') Country(alpha_2='VU', alpha_3='VUT', name='Vanuatu', numeric='548', official_name='Republic of Vanuatu') Country(alpha_2='WF', alpha_3='WLF', name='Wallis and Futuna', numeric='876') Country(alpha_2='WS', alpha_3='WSM', name='Samoa', numeric='882', official_name='Independent State of Samoa') Country(alpha_2='YE', alpha_3='YEM', name='Yemen', numeric='887', official_name='Republic of Yemen') Country(alpha_2='ZA', alpha_3='ZAF', name='South Africa', numeric='710', official_name='Republic of South Africa') Country(alpha_2='ZM', alpha_3='ZMB', name='Zambia', numeric='894', official_name='Republic of Zambia') Country(alpha_2='ZW', alpha_3='ZWE', name='Zimbabwe', numeric='716', official_name='Republic of Zimbabwe')
So what we see is we have the following;
We will run into several issues as we develope this and we will work through them to produce an 'acceptable' (for us today) solution.
Basically what we want to do is something simular to what we did for 'Publication Type'.
We will create a new 'Country' column derived from the countries we will extract from the 'Addresses' column.
We will approaches this from a common sense methodology and ignore the more 'pythonic' way of doing things. While faster and more effiecent they are to be reserved for a more advanced discussion.
To kick this off we will iterate through each row of data (I told you we were not going pythonic right!?!)
To iterate trhough rows of data we use the iterrows() function.
We will send the cell for each row to a new function we will call find_country()
This new function will be responible to finding the correct country(s) for 'Addresses' and adding them to the 'Country' column we will create.
We will start off by creating the new, empty, 'Country' column like so;
df['Country'] = ''
Next we will start working on our new function find_country()
For it we will pass in the text for the current rows 'Addresses' cell.
Since we already demonstarted the issue that some entries have authors from multiple countries, we will store each of them into a list called clist (for country list).
Remember that we replaced all those empty and nan cells with the string 'None'? Well we will make partial use of that right now. If the input text is 'None' just return the sting 'None'
Then we will iterate through each country in Pycontries and if it exists we will store/append that countries 'Official ISO country name' into the list.
Finally we return clist
Below is our first iteration of our function.
def find_country(txt):
clist = []
if txt == 'None':
return 'None'
for c in pycountry.countries:
#print(f'\nc: {c} txt: {txt}')
if c.name in txt:
clist.append(c.name)
return clist
Now we need to utilize our function via iterrows() which will retun both the index of the row we are in and its value.
For now we will just print the result out and not worry about storing them in the new column.
for index, row in df.iterrows():
addr = df['Addresses'][index]
#print(f'INDEX: {index} ADDRESS: {addr}')
c = find_country(addr)
#print(f'index: {index} c: {c}')
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_46064\4012802089.py in <cell line: 1>() 2 addr = df['Addresses'][index] 3 #print(f'INDEX: {index} ADDRESS: {addr}') ----> 4 c = find_country(addr) 5 #print(f'index: {index} c: {c}') 6 ~\AppData\Local\Temp\ipykernel_46064\1686651311.py in find_country(txt) 5 for c in pycountry.countries: 6 #print(f'\nc: {c} txt: {txt}') ----> 7 if c.name in txt: 8 clist.append(c.name) 9 return clist TypeError: argument of type 'float' is not iterable
So right away on our very first row we run into problems.
Then later on we run into somesort of float error. Wait what? What float we are passing text are we not?
Lets deal with the error first.
We see that out last successful index value was 21 so lets see whats hidding in cell 22 that cuased our error.
df['Addresses'][22]
nan
nan stands for Not a Number and often when you intentionally use/test for it you will want to use the nan supplied with numpy!
But we don't need to worry about that right now.
What we need is a string replacement for those nan's (yes theres actually many)
The first thing we will do is fill in all the missing data with "None" (as in the string 'None'!)
Why "None" instead of something like NaN or other common things?
Something we will do later will fail painfully if we don't have a string to work with.
Can we get around haveing to pipe a string in? Sure its Python theres a million ways to do everything! But we will make it simple.
fillna() - is a function which replaces any specified cell(s) with whatever we desire.
We can change the to literal Nulls, NaN's (almost always you will use numpy NaNs), 0's, some rational default value like -666666 (yes thats actually used frequently for certain scientists some of whom are represented here today!
For our purposes we want to replace all empty cells in the entire DataFrame with "None".
Since we want this active in our whole DataFrame() we will use a handy function parameter called inplace setting it to be True.
What this does is allows all changes to be made active to our current DataFrame without having to reassign it.
df['Addresses'].fillna('None', inplace=True)
For those curious or confused, with out using inplace=True the above bit of code would have to look like this;
df = df.fillna('None')
A hint for those with larger datasets use this feature as much as available as its faster!
Sadly it does not exist for all functions and this is a major bummer really!
Lets rerun our loop, taking out the print statement for now and makesure that clears the error for us.
for index, row in df.iterrows():
addr = df['Addresses'][index]
#print(f'INDEX: {index} ADDRESS: {addr}')
c = find_country(addr)
Lets take a look at that very first rows cell and figure out what the heck happened there. We never saved the country to clist or failed to return it or what???
df['Addresses'][0]
'Univ Plymouth, Sch Art & Media, Plymouth PL4 8AA, Devon, England'
OK, so 'Englnad is the country we are interested in. So uhmmm yeah???
Lets take a look at our pycountry listing again.
Uhm, theres no England in the pycountry but we are all pretty sure it exists right? Or maybe it ceased to exist with the death of the Queen or???
Actually remember all the verbage I vomited about ISO names?
Yep, 'England' is NOT an ISO named country!
The correct ISO name is the 'United Kingdom'
Well we are not going to fix pycountry, trust me others have pushed for that hence why some countries have the 'common name' entry. And we surely don;t want to create our own package, at least not today, to deal with this. So?
What we will do is create a sort of preprocessor for pycountry.
We will do this by building a dictionary called ccode (for country code) which will replace all occurances of 'England' with 'United Kindgom' in the 'Addresses' column like so.
ccode = {'England':'United Kingdom'}
for key in ccode:
df['Addresses'] = df['Addresses'].str.replace(key, ccode.get(key))
Cool we should be good to go!
Lets rerun our loop and find out
for index, row in df.iterrows():
addr = df['Addresses'][index]
#print(f'INDEX: {index} ADDRESS: {addr}')
c = find_country(addr)
#print(f'index: {index} c: {c}')
Buggers, OK now what?!? I thought we just fixed the problem.
OK lets make life a little easier and get an list of all the bad indices so we can look at all these issues.
c_index = []
for index, row in df.iterrows():
addr = df['Addresses'][index]
#print(f'INDEX: {index} ADDRESS: {addr}')
c = find_country(addr)
if c == []:
c_index.append(index)
c_index[0:10]
[9, 15, 18, 23, 35, 36, 38, 39, 44, 45]
Wait just how bad is it?
Lets find the length (number of indices) in c_index.
len(c_index)
2095
Holy flippen mow cows Batman thats one HUGE list of issues and I so do not have time to go through each one individually.
What to do, what to do?!?
OK what if we creat a new DataFrame, with the index values and the actual addresses?
QUESTION: Why build a new DataFrame instead of just using print statements or something?
Answer
Well simply put, just to teach you new powerful and important tricks!
test_df = pd.DataFrame(columns=['c_index', 'addresses'])
test_df['c_index'] = c_index
for index, value in enumerate(c_index):
test_df['addresses'][index] = df['Addresses'][value]
test_df
C:\Users\tdunn\AppData\Local\Temp\ipykernel_46064\2048375063.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['addresses'][index] = df['Addresses'][value]
c_index | addresses | |
---|---|---|
0 | 9 | [Kim, Sunhee] Seoul Womens Univ, Seoul, South ... |
1 | 15 | [Lee, Keunhye] Gachon Univ, Dept Interior Arch... |
2 | 18 | [Evans, Eleri] Swansea Univ, Swansea, W Glam, ... |
3 | 23 | [Gonzalez, Jorge E.; Ramamurthy, Prathap] CUNY... |
4 | 35 | [Tosca, Mika G.; Gilbert, Ilai; Walls, Kelvin ... |
... | ... | ... |
2090 | 12633 | [Fields, Lindsey; Nixon, Scott W.; Oviatt, Can... |
2091 | 12655 | [Farrara, John D.; Chao, Yi; Li, Zhijin; Wang,... |
2092 | 12657 | [Spangler, Lee H.; Dobeck, Laura M.; Gullickso... |
2093 | 12675 | [Kilb, Debi; Yang, Alan] Univ Calif San Diego,... |
2094 | 12679 | [Jennings, Bruce] Vanderbilt Univ, Med Sch, Ct... |
2095 rows × 2 columns
Ignore the warning. If we were serious about this bit of code then we would deal with it but since we are just exploring our issues and plan to get rid of this bit from our final code don't waste time on it!
By now looking at all the entries in test_df['addresses'] we can see 3 major issues;
#for i in test_df['addresses']:
# print(i)
Good news we know how to deal with that we just need to expand our ccode dictionary like so.
ccode = {'England':'United Kingdom',
'Wales':'United Kingdom',
'South Korea':'Korea, Republic of',
'USA':'United States',
'Czech Republic':'Czechia',
'Scotland':'United Kingdom',
'Russia':'Russian Federation',
'Iran':'Iran, Islamic Republic of',
'U Arab Emirates':'United Arab Emirates',
'Taiwan':'Taiwan, Province of China',
'Venezuela':'Venezuela, Bolivarian Republic of',
'Vietnam':'Viet Nam',
'ENGLAND':'United Kingdom',
'VENEZUELA':'Venezuela, Bolivarian Republic of',
'CZECH REPUBLIC':'Czechia',
'SCOTLAND':'United Kingdom'}
For this one we need to cast everything to lower() inside our find_country() function like so
def find_country(txt):
clist = []
if txt == 'None':
return 'None'
for c in pycountry.countries:
if c.name.lower() in txt.lower():
clist.append(c.name)
return clist
This ones slightly trickier but we can deal with this via a new function we will call find_state()
Basically it will do like we did with find_country() but this time look for the culprite states (stored in a list) and replace them with 'United States' as it were.
```python def find_state(txt): states = ['CA', 'NJ', 'CO', 'WA', 'NM', 'DC', 'OR', 'MA', 'MD', 'AZ', 'NE'] clist = [] for c in states: if c in txt: clist.append('United States') return clist '''
IMORTANT NOTE: I intentionally skipped a Item 4.
What Item 4, you wisely ask?
Well if you were to use this new code as is and then throughly check the new results you would find two (2) cells with no indication whatsoever of a country.
So we will make one last modification for find_state() to return the string 'None' if no country is found nor a state to apply 'United States' to.
Like so
def find_state(txt):
states = ['CA', 'NJ', 'CO', 'WA', 'NM', 'DC', 'OR', 'MA', 'MD', 'AZ', 'NE']
clist = []
for c in states:
if c in txt:
clist.append('United States')
return clist
else:
return 'None'
We can now rewrite our base loop to deal with both find_country() and find_state()
Here we will also go ahead and add the countries returned from our functions into out df['Country'] column.
for index, row in df.iterrows():
addr = df['Addresses'][index]
c = find_country(addr)
if c:
df['Country'][index] = c
else:
df['Country'][index] = find_state(addr)
IMPORTANT NOTE: Remember that "A value is trying to be set on a copy of a slice from a DataFrame" warning we received in our test_df code?
If you did the same thing here you would receive the exact same warning!
To fix that we 'assign' the desired value to the cell using the at() function!
Like so:
for index, row in df.iterrows():
addr = df['Addresses'][index]
c = find_country(addr)
if c:
df.at[index, 'Country'] = c
else:
df.at[index, 'Country'] = find_state(addr)
Putting all of that together now we can actually create the final df['Country'] column!
ccode = {'England':'United Kingdom',
'Wales':'United Kingdom',
'South Korea':'Korea, Republic of',
'USA':'United States',
'Czech Republic':'Czechia',
'Scotland':'United Kingdom',
'Russia':'Russian Federation',
'Iran':'Iran, Islamic Republic of',
'U Arab Emirates':'United Arab Emirates',
'Taiwan':'Taiwan, Province of China',
'Venezuela':'Venezuela, Bolivarian Republic of',
'Vietnam':'Viet Nam',
'ENGLAND':'United Kingdom',
'VENEZUELA':'Venezuela, Bolivarian Republic of',
'CZECH REPUBLIC':'Czechia',
'SCOTLAND':'United Kingdom'}
for key in ccode:
df['Addresses'] = df['Addresses'].str.replace(key, ccode.get(key))
def find_state(txt):
states = ['CA', 'NJ', 'CO', 'WA', 'NM', 'DC', 'OR', 'MA', 'MD', 'AZ', 'NE']
clist = []
for c in states:
if c in txt:
clist.append('United States')
return clist
else:
return 'None'
def find_country(txt):
clist = []
if txt == 'None':
return 'None'
for c in pycountry.countries:
if c.name.lower() in txt.lower():
clist.append(c.name)
return clist
df['Country'] = ''
for index, row in df.iterrows():
addr = df['Addresses'][index]
c = find_country(addr)
if c:
df.at[index, 'Country'] = c
else:
df.at[index, 'Country'] = find_state(addr)
df['Country']
0 [United Kingdom] 1 [Argentina, Canada, Netherlands] 2 [Spain] 3 [United Kingdom] 4 [France, United States] ... 12681 [Canada, Russian Federation] 12682 [France] 12683 [Argentina, Australia, Austria, Brazil, Canada... 12684 [Afghanistan, Austria, Canada, China, Czechia,... 12685 [Austria, Belgium, Bulgaria, Switzerland, Czec... Name: Country, Length: 12686, dtype: object
Lets make sure we save all the changes.
df.to_csv(os.path.join('Processed_data', 'Combined_lists.csv'))
Due to time constraints we will call this section done - for now.
We will actually continue to explore our data as well as manipulate it and create new data derived from this data in the next section;
Part_02_Data_Viz.