# Getting Started in Python3

The EdgeLake is a open data exploration service that offers data science, analytics and technology experimentation.

This tutorial assumes some programming experience.

## Uploading Data

A private data lake is used to store the data that the applications use. To upload data to the data lake, one would typically:
1. upload data to the home directory of the datalake using commands like "-put"
2. execute commands like "-ls" to ensure data was uploaded in the data lake

```
bdl -ls /
16/09/26 17:18:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxrwxrwx   - hdfs supergroup          0 2018-09-15 13:12 bdl://demo.demoproject/baseball
drwxrwxrwt   - hdfs supergroup          0 2018-09-15 12:08 bdl:///demo.demoproject/tmp
drwxr-xr-x   - hdfs supergroup          0 2018-09-15 12:08 bdl://demo.demoproject/tmp/user
```

Data can be uploaded to the data lake also by using the File Browser that is available in the File Browser tab in our user interface.

To download data from the internet locally, perform the following commands:

In [None]:
%%bash
wget http://www.exploredata.net/ftp/MLB2008.csv

The "-put" command uploads a document from a local storage to the cloud object storage associated with a particular project. The object storage path is bdl://data_pool_name.project_name/

In [None]:
%%bash
/opt/bigstepdatalake-0.10.4/bin/bdl -put MLB2008.csv bdl://demo.demoproject/

## Initialize Spark Context

For all Spark functions to be available, a Spark context has to be initialized in the current notebook. Just set a name for your Spark application and choose the Spark Master that should coordinate the application's execution. You can retrieve the SparkMaster from the Spark tab in the Spark cluster or from the Spark Web UI.

In [None]:
from pyspark import SparkContext, SparkConf
sc = SparkContext.getOrCreate(conf=SparkConf().setAppName("test-app").setMaster("spark://demoproject-test-spark-bdl-spark-master.demoproject.svc.cluster.local:7077"))

## RDDs

An Resilient Distributed Dataset is an array that is spread across multiple servers. It allows the programmer to abstract away the complexity of transforming large volumes of distributed data.

In [None]:
%%bash
wget http://seanlahman.com/files/database/baseballdatabank-master_2016-03-02.zip

apt-get install -y unzip
unzip baseballdatabank-master_2016-03-02.zip
rm -rf baseballdatabank-master_2016-03-02.zip

/opt/bigstepdatalake-0.10.4/bin/bdl -put /lentiq/notebooks/notebooks/baseballdatabank-master/core/AllstarFull.csv \
bdl://demo.demoproject/

In [None]:
textFile = sc.textFile("bdl://demo.demoproject/AllstarFull.csv")

In [None]:
textFile.count()

In [None]:
textFile.first()

In [None]:
linesWithRuth = textFile.filter(lambda line: "ruth" in line)

In [None]:
linesWithRuth.count()

In [None]:
linesWithRuth.collect()

In [None]:
import time
linesWithRuth.saveAsTextFile("/lines_with_ruth-"+str(time.time()))

## DataFrames and SparkSQL

Using the SparkSQL module, SQL queries can be run directly on the files that are stored in the data lake in various data formats, such as Parquet, after initializing the SparkSQL context. The type that aggregates data for SQL is the dataframe, which is built on top of RDDs.

In [None]:
# initialize SparkSQL context

from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.appName("demoNotebookSparkSQL").getOrCreate()
sc = spark.sparkContext

Spark 2.4.0 has a built-in CSV reader:

In [None]:
allstar=spark.read.csv("bdl://demo.demoproject/AllstarFull.csv",header=True)



In [None]:
type(allstar)

In [None]:
allstar.show()

In [None]:
#register this table as a "table" within the sql context.
allstar.createOrReplaceTempView("allstar")

# SQL can be run over DataFrames that have been registered as a table.
player = spark.sql("SELECT * FROM allstar WHERE playerID like '%ruth%' and yearID<1935")

In [None]:
player.show()

# ParquetFiles
Parquet files are typically much faster and take up less space than CSVs and the data lake supports them as well. As Spark is a clustering system the parquet files are composed out of many fragments generated by the workres independently. The collection of files is operated as a single big table by SparkSQL.

To write the dataframe:

In [None]:
import time
path="bdl://demo.demoproject/allstar-"+str(time.time())+".parquet"
player.write.save(path)

Read the dataframe:

In [None]:
dfParquet = spark.read.parquet(path)
dfParquet.createOrReplaceTempView("player")
spark.sql("SELECT playerID,YearID FROM player").show()

## External packages

There are a lot of packages that are included in the standard offering. However, we know that different use cases require a different packages in order to be able to perform advanced analytics, machine learning operations or complex visualizations. 

As a result, the EdgeLake offers a simple solution to add custom packages in a specific notebook either by downloading it from the included library of installing it following the next steps:

In [None]:
%%bash 
pip install plotly

## Advanced visualization using Plotly

Graphs help every data science team create and diseminate a powerful story built on the data they are analyzing. 

Using the default visualization libraries or external ones beautiful graphs can be created. As an example, using a standard Plotly example we can have a sneak peek into USA's flight paths from 2011:
    

In [None]:
import pandas as pd
import plotly
from plotly.offline import iplot

plotly.offline.init_notebook_mode()

df_airports = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_us_airport_traffic.csv')
df_airports.head()

df_flight_paths = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_aa_flight_paths.csv')
df_flight_paths.head()

airports = [ dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_airports['long'],
        lat = df_airports['lat'],
        hoverinfo = 'text',
        text = df_airports['airport'],
        mode = 'markers',
        marker = dict( 
            size=2, 
            color='rgb(255, 0, 0)',
            line = dict(
                width=3,
                color='rgba(68, 68, 68, 0)'
            )
        ))]
        
flight_paths = []
for i in range( len( df_flight_paths ) ):
    flight_paths.append(
        dict(
            type = 'scattergeo',
            locationmode = 'USA-states',
            lon = [ df_flight_paths['start_lon'][i], df_flight_paths['end_lon'][i] ],
            lat = [ df_flight_paths['start_lat'][i], df_flight_paths['end_lat'][i] ],
            mode = 'lines',
            line = dict(
                width = 1,
                color = 'red',
            ),
            opacity = float(df_flight_paths['cnt'][i])/float(df_flight_paths['cnt'].max()),
        )
    )
    
layout = dict(
        title = 'Feb. 2011 American Airline flight paths<br>(Hover for airport names)',
        showlegend = False, 
        height = 800,
        geo = dict(
            scope='north america',
            projection=dict( type='azimuthal equal area' ),
            showland = True,
            landcolor = 'rgb(243, 243, 243)',
            countrycolor = 'rgb(204, 204, 204)',
        ),
    )
    
fig = dict( data=flight_paths + airports, layout=layout )

plotly.offline.iplot(fig)


## Resources

[Apache Spark 2.4.0 Programming Guide](http://spark.apache.org/docs/latest/sql-programming-guide.html)
