Bee dataViz journal ..



“The greatest value of a picture is when it forces us to notice what we never expected to see.”

― John Tukey



The purpose of this journal: recording little snips of my DataViz learning.

I wish to create better pictures with data.

Each small gap is only a baby step away; each step forward will bring me closer to where I want to be.

They are not glamorous strides, and may even be clumsy.

But happy with each small step of progress.



Dated: 11 May 2017

R Corrgrams - "Considering the pairwise relationship in a car"


One form of analysis that I look at very often in my work is Correlation Matrices.

Thus, I am always happy to find new ways to explore Correlation.


About Correlation:

  • What is correlation?

    Correlation is the statistical relationship between a pair of variables. And, correlation does not imply causation.

  • How to measure correlation?

    A correlation coefficient measures the strength of correlation, the value ranges from 1 to -1;

    • 1 means highly correlated;
    • -1 means highly negatively correlated;
    • 0 means there is no correlation.
  • What is Correlation Matrix?

    A correlation matrix is a table that shows the correlation coefficients between many pairs of variables.

  • What is corrgrams?

    Corrgram is a graphical representation of the correlation matrix.



Some R packages has such beautiful options for corrgrams. One example is the corrgram() Package.

The following is an example using "mtcars" dataset from Base R.


About "mtcars" dataset:

  • mtcars: "Motor Trend Car Road Tests"

  • Data Source:

    "The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)."

  • Original Data Source:

    "Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411."

  • Data:

    32 observations on 11 variables, including,

  • variable meaning
    mpg Miles/(US) gallon
    cyl Number of cylinders
    disp Displacement (cu.in.)
    hp Gross horsepower
    drat Rear axle ratio
    wt Weight (1000 lbs)
    qsec 1/4 mile time
    vs V/S
    am Transmission (0 = automatic, 1 = manual)
    gear Number of forward gears
    carb Number of carburetors


Typically, a Correlation Matrix looks something like this, which is useful, but tedious to go through if the number of variables grow too huge.

              ##       mpg   cyl  disp    hp   drat    wt   qsec    vs     am  gear   carb
              ## mpg   1.00 -0.85 -0.85 -0.78  0.681 -0.87  0.419  0.66  0.600  0.48 -0.551
              ## cyl  -0.85  1.00  0.90  0.83 -0.700  0.78 -0.591 -0.81 -0.523 -0.49  0.527
              ## disp -0.85  0.90  1.00  0.79 -0.710  0.89 -0.434 -0.71 -0.591 -0.56  0.395
              ## hp   -0.78  0.83  0.79  1.00 -0.449  0.66 -0.708 -0.72 -0.243 -0.13  0.750
              ## drat  0.68 -0.70 -0.71 -0.45  1.000 -0.71  0.091  0.44  0.713  0.70 -0.091
              ## wt   -0.87  0.78  0.89  0.66 -0.712  1.00 -0.175 -0.55 -0.692 -0.58  0.428
              ## qsec  0.42 -0.59 -0.43 -0.71  0.091 -0.17  1.000  0.74 -0.230 -0.21 -0.656
              ## vs    0.66 -0.81 -0.71 -0.72  0.440 -0.55  0.745  1.00  0.168  0.21 -0.570
              ## am    0.60 -0.52 -0.59 -0.24  0.713 -0.69 -0.230  0.17  1.000  0.79  0.058
              ## gear  0.48 -0.49 -0.56 -0.13  0.700 -0.58 -0.213  0.21  0.794  1.00  0.274
              ## carb -0.55  0.53  0.39  0.75 -0.091  0.43 -0.656 -0.57  0.058  0.27  1.000
              

A Corrgram gives this correlation matrix a beautiful visual make-over so you can spot important information at one glance; the shading tells you where to look for any pairs that are highly correlated; the colors tells you which pairs are postively correlated or negatively correlated.

              corrgram(mtcars, order=TRUE,
                 main="Corrgram of mtcars intercorrelations Sample1",
                 lower.panel=panel.shade,
                 upper.panel=panel.cor,
                 text.panel=panel.txt)
              
plot of chunk unnamed-chunk-3

The lower panel with the color shading shows how strong the correlation is; the darker the color, the higher the correlation; blue for the pairs with positive correlation, red for the pairs with negative correlation.

The upper panel display the correlation coefficient for each pair.

For example, at the top left corner, the first pair of variables, gear(Number of forward gears) vs am (Transmission automactic and manual) is higher correlated (dark blue) with a correlation coefficient of 0.79.



And in case you want to view the visual plot in different ways, there are various other options you can interchange for both the lower panel and upper panel.

Or, say you want to emphasize the importance of being "green" and portray a more nature friendly automobile plot, the color combination can be altered using "col.regions" option.

Below is an autumn inspired plot using corrgram() package.

              library(corrgram)
              cols <- colorRampPalette(c("darkgoldenrod4", "burlywood1","darkkhaki","darkgreen"))
              corrgram(mtcars, order=TRUE, col.regions=cols,
                 main="Corrgram of mtcars intercorrelations Sample2 - with autumn colors",
                 lower.panel=panel.shade,
                 upper.panel=panel.ellipse,
                 text.panel=panel.txt)
              
plot of chunk unnamed-chunk-4

In addition to changing the colors for the lower panel, here I also replace the correlation coefficient with the confidence ellipse and smoothed line.

With this new upper panel confidence ellipse and smoothed line, you can see that the relationship of the variables are not always linear.


For those of you who want to learn more on this topic, the book I learned this from is:

"R In Action, Data analysis and graphics with R." - by Rober I. Kabacoff (ISBN: 9781617291388)



Dated: 10 May 2017

Plot a cloud - "Just me being random"


Data analysts always advocate that all graphs should convey a meaning or explain some data fact.

But what if I just want to "paint" a cloud in random?

Why can't I paint a plot just like I paint on canvas, and create an abstract art just to have fun with numbers?

My name has a "cloud" in it, so here is me scattered in random.

Enhanced scatter-plot


Dated: 8 May 2017

Coloring your plots - "Vitamins Trial"


I have been taking a few online courses and one of my recent favourite is the Udemy Course: "Colors for Data Science A-Z: Data Visualization Color Theory"

The course explained the effect of color scheme on your data visualisation. I like the course, worth going through.

The course included a data project that allows us to test out what we have learned about colors and color scheme, using Tableau to present the testing here.

The colors for the different color schemes can be chosen using either Paletton or Adobe Color .


This sample project from the course is related to vitamin trials: "a new vitamin was prescribed to people of different ages and genders in different dosages. At the end of the trial they were all given the same control exam to test their cognitive ability." Project data can be downloaded from Super DataScience webpage.


A word of caution, this vitamin trials dataset is probably mock-up data prepared by the instructor, so it's really not a good idea to start pouring vitamins into your kids after looking at the result.


The following are 4 charts using the same dataset, with 4 different color schemes. Can see the difference? Go take the course. :)



Test 1 - Version 1

Not the best color strategy of course. Too many groups, too many colors, too confusing.


Test 2 - Using Monochromatic Colour Scheme

Similar to Version 1 but only focusing on one age group so the trend becomes apparent.
Monochromatic color schemes are derived from a single base hue.
Because the colors are from the same base hue, the increase in shade can be aligned with the data to show the sequential data meaning.


Test 3 - Using Triadic Colour Scheme

Triadic color scheme uses three colors equally spaced around the color wheel.
This scheme produces strong contrast but still retains harmony.
Thus, to show that 3 groups of data are of equal importance, this scheme is a good choice.
The only challenge is to make sure that we choose the right balance and no one color dominates.


Test 4 - Using Analogous Colour Scheme

Analogous color schemes use colors that are next to each other on the color wheel. This color schemes feel comfortable and pleasing to the eye.


Dated: 5 April 2017

Scatter-plot using R - "Relationship between SIBOR Rate vs SOR Rate"


Still exploring R "car" Package. The previous sample dataset ("women" from Base R) is too "slim"; the scatter-plot don't look scattered enough.

So tested the same R "car" Package scatterplot() function with Singapore SIBOR and SOR interest rate data.

Gentle reminder, the "car" here is not refering to your Nissan or Toyota car;

"car" stands for "Companion to Applied Regression".


SIBOR stands for Singapore Interbank Offered Rate. It is based on interest rate used by banks in Singapore when lending unsecured funds to each other. So, it reflects the demand and supply of funds between banks in Singapore.

SOR stands for Singapore Swap Offer Rate. It is based on the forward exchange rate between the US dollar and Singapore dollar. Thus, it is responsive to the current state of US econonmy.

Please refer to The Association of Banks in Singapore (ABS) for more informaiton.

Or, refer to the website moneysmart.sg for more explanation.


The dataset I am using here is extracted from the following website (Sibor.sg).

I find the site useful, but since this site (Sibor.sg) is not an official government site, please do not use the information obtained from this analysis for your financial decision. Quoting from their site disclaimer: "Any reliance you place on such information is therefore strictly at your own risk."

For more official information about SIBOR or SOR, it's best to refer The Association of Banks in Singapore (ABS).


The dataset extracted consists of variables:

  1. 3 months SIBOR Rate
  2. 3 months SOR Rate

Dataset Date Range: Apr 2015 - Mar 2017. (24months)


Our new house is going to be ready soon, so we are researching on the various Home Loan options.

Home Loan Interest Rate pegged with SIBOR used to be very popular, so I am looking at the trend of SIBOR for the last two years.

Although SIBOR is mainly reflecting the demand and supply between Singapore banks, nevertheless, it is still indirectly affected (heavily) by global econonmy, especially US economy.

Thus plotting the scatterplot between SIBOR and SOR to observe the pairwise relationship between the two interest rate.


The scatterplot seems to indicate that although SIBOR and SOR generally move in the same direction, SIBOR is definitely much less volatile than SOR. SOR seems more skewed towards the lower rate and has a wider value range for the higher rates.


For someone with low risk appetite, SOR is definitely not suitable, however the plot will give us a feel of how SIBOR move with SOR for the last two years. Therefore, it somewhat exhibits how "sensitive" is SIBOR towards global economy for the last 2 years. The upper spread worries me. :{


Enhanced scatter-plot using scatterplot() function in the "car" package:

scatterplot(sibor3m~sor3m,data=loan,
               spread=TRUE, smoother.args=list(lty=2),
               pch=19,cex=1.6,
               xlab="3 months Sor(%)",
               ylab="3 months Sibor(%)",
               main="Enhanced Scatterplot using car Package, Sibor3m vs Sor3m"
               )
   
Enhanced scatter-plot

Data Source: Sibor.sg

Dataset: "3 months SIBOR Rate" vs "3 months SOR Rate"

Dataset date range: Apr 2015 - Mar 2017. (24months).



Dated: 3 April 2017

Scatter-plot using R - "Sample Data from R - Relationship between Women's Weight vs Height "


It's been a while since I practice R coding.

Just tested some scatter-plots using Base R plot() function and "car" Package scatterplot() function.

R "car" Package is very useful for Regresion Modeling. Please note that the "car" here does not mean that machine with 4 wheels; "car" means "Companion to Applied Regression".

The scatter-plot within the "car" Package provides nice enhancement. Instead of using par() function to superimposed figures together, can consider using "car" Package scatterplot() function instead.

A good reference: "R In Action, Data analysis and graphics with R." - by Rober I. Kabacoff (ISBN: 9781617291388) .


This is a sample dataset from Base R installation. This dataset "women" gives the average heights and weight for American women aged 30-39.

Original Data Source: The World Almanac and Book of Facts, 1975.


Like many ladies, I am very self-consious of my weight. Understand that certain factors that affect weight, like height or genes, is really beyond my control.

Not very sure how reliable is this set of data, but if the data is a true representation of the shape of the real world, then it says that taller people tends to get heavier easier? (Since the relationships is more quadratic than linear.)

In other words, my lack of height finally has an advantage. :D


For the R codes of this Simple Linear Model Analysis and its plots, refer to this link.

Enhanced scatter-plot using scatterplot() function in the "car" package:

scatterplot(weight~height,data=women,
                       spread=FALSE, smoother.args=list(lty=2),
                       pch=19,cex=1.6,
                       xlab="Height (in cm)",
                       ylab="Weight (in kg)",
                       main="Enhanced Scatterplot using car Package, Weight(cm) vs Height(kg)"
                       )
Enhanced scatter-plot

Data Source: Sample dataset "women" from Base R installation

Original Data Source: The World Almanac and Book of Facts, 1975.

Dataset: This dataset "women" gives the average heights and weight for American women aged 30-39, dated 1975.



Dated: 29 March 2017

Fine-tuning my first D3 - "SG - Total Monthly Rainfall, 2010 - 2016"


Still working on my first D3 project.

I was confused why some of the my CSS rules don't work, and then finally realized that CSS properties and SVG attributes are not always the same.

For example, to set color for text in CSS, we use:

  p {color: blue;} 

but for SVG, to set color for text, we use:

  text {fill: blue;} 

Also somewhat figured out how to change the axis label with the data, so now the x-axis label will change according to the year displayed. I suppose I still need more practice to get familiar with the method.

Extended the year range for this set of Singapore Total Rainfall plot, now displaying year 2010 - 2016. Data collected from Changi Climate Station, from NEA.


From the data, it seems that the total rainfall per month is less consistent year to year than I thought. And also, the data affirmed my mum's advice, always have an umbrella in your bag.





Dated: 25 March 2017

My first D3 - "SG - Total Monthly Rainfall, 2014 - 2016"


This is my very first D3 project.

D3 stands for Data Driven Documents; it is a vehicle to drive your data into a document (web-page); it is a JavaScript library for creating beautiful interactive data visualizations in the web.


Someone first showed me D3 last year, and then I saw some astonishingly stunning D3 visual arts on the web and I fell in love.

I don't have web coding background, so started learning all the necessary basics: CSS, HTML, Javascript, jQuery etc, just so that I know how to get closer to D3.

This is really a very simple plot, in fact when I showed it to a friend, she is not impressed, even with the interactive year buttons and dynamically colored bar (with colors based on rainfall volume)! Gently hinting that it is not that much different from an Excel chart.

Hopefully, I can create more interesting D3 plots in the near future, that can really make others go wow.


This is a set of Singapore Total Rainfall record for year 2014 - 2016, collected from Changi Climate Station, from NEA.


Recently, a cute little Sunbird came to our balcony to build a nest and is currently bringing up two baby birds at our balcony. I hope coming April won't rain too much so the baby birds will be safe. That is why I am so obsessed with rainfall "history" recently.






Dated: 20 March 2017

Using Chart.js and CodePen - "SG - Total Monthly Rainfall, 2016"


Using Chart.js, JavaScript chart, within CodePen to test Line Tension.

The diagram below shows four different plots, using the same data, with exactly the same setting, except for Colors and Tension.

Chart.js Line Tension is an option to have Bezier curve tension of line. If the Line Tension is set to 0, we'll get straight lines. Usually, we'll set the Line Tension to about 0.5, to get a smoothed line joining the data points. Higher than the value of 0.5, the lines get more artistic. For example, if the Line Tension is set to say 5, the plot become an absolutely beautiful abstract art that tells you nothing about data fact.

So, be careful of Tension.

Downloaded this set of Singapore Total Rainfall record for 2016, collected from Changi Climate Station, from NEA.


Now is 20 March 2017, and I am wondering why is it raining so much recently, whereas last year (2016) March was so dry. I think it will be interesting to plot last few years of monthly rainfall together, to see how they differs over the years. Will try that one of these days.



See the Pen ChartJS Test Tension0.5 by Bee (@tbeehoon) on CodePen.



Dated: 17 March 2017

Using Plotly - "LakeVille New Transacted Unit Sale ($psf) Over Time, for 2 bedroom units"


Plotly is interactive, click on the trend lines in the legend box to view the trend for each group.

There are many other Plotly API libraries and online chart settings that I'll like to try. But many of the online chart types or analysis options are not available for the free verison. (Example, I wanted to try time-series chart and logistic curve fit.)

Downloaded another new private residential project transacted sale price dataset from Singapore URA website. The development project is "The LakeVille Condo" by MCL Land.


This time, I tried to look at the data for only 2-bedroom units. And group them by floor grouping: level 1to5, level 6to10, level 11-15, level 16-20.

I noticed that despite the softening of the property market. The sales price for this project's 2-bedroom remain strong. The lower floors trend is quite flat. But the higher floor units has a general upward trend during this dataset periods. I suppose the sentiment for this property is optimistic due to its proximity to an international school (next door) and the proposed SG-KL High-Speed Rail.

This little analysis was done for a friend, to assure her that her investment for the small unit in this project was not a bad decision.




Dated: 11 March 2017

Using Plotly - "My Dream House New Transacted Unit Sale ($psf) Over Time"


I love Plotly ease of use and the plots can be so nice. Plotly is an online analytics and data visualization tool.

Plotly is also interactive, you can click on the option in the legend box to enable or disable the option.

(However, there are some limitations for the free version.)

Here, I downloaded a particular new private residential project's transacted sale price dataset from Singapore URA website.


The focus is on 3-bedroom units of this project. This is my dream house.


In the plot, I am using the Triangular Moving Average, that means it is double smoothed. Triangle moving averages are most often applied to the price of an asset.

I am also using Right Alignment for my Moving Average. This is also known as Trailing Moving Average, it means that this moving average is aligned with the last observation. Right Aligned Moving Average is often preferred if you want to use it as a forecast or as a decision point.




Dated: 5 March 2017

Using Chart.js and jsFiddle - "SG Housing Loan Interest Rate 2014-2016"


jsFiddle works in a similar way as CodePen . Comrparing betwee the two, I feel that jsFiddle may be better for collaboration work.

I like Chart.js too, which is a HTML5 based JavaScript chart. The flexibitly for customization is nice, but there is quite a bit of learning to get familiar with the parameters available.

Downloaded Singapore Housing Loan Interest Rate from Singapore MAS website. The average rates is compiled from the quoted rates by 10 leading Singapore banks and finance companies.

Looking at the past 3 years Housing Loan Interest Rate, we noticed the big jump from 2014's 2.93% to 2016's 3.41% . Hopefully, this rate will remain stable for the next few years.




Using the same set of data as previous. I have included buttons to allow user to select a specific Year.

ChartJS is JavaScript chart, so "animation" like the buttons can be added easier to allow more "interactive" features.


As we click the button to select year to year, the changes between the yearly Interest Rate feel more prominent. You can almost feel the heart pulse going up and down correspondingly.




Dated: 4 March 2017

Using Chart.js and CodePen - "SG Waste Disposed Of And Recycled, Annual"


Chart.js is a HTML5 based JavaScript chart. It's easy to use and it's free. And it's quite pretty too.

CodePen is a nice way to test out smaller chunk of codes. Useful when we are learning a new tool like ChartJS that involve many parameters and options.

Downloaded this set of Singapore Waste Disposed of and Recycled data from NEA.


Happy to see that our country has been Recycling more for the past few years; but waste disposed of has not really reduced. So, besides recycling, we probably need to do more Reducing and Reusing.


See the Pen ChartJS Test1 by Bee (@tbeehoon) on CodePen.