# Understanding the Kiva Dataset

Before we actually get into the work of predicting anything based on the data Kiva makes public, we first want to get a better picture of what the dataset actually looks like.

Our first step: What is the schema of the data? Spark SQL will make it easy to query data in the future, but we need to know first what is available.

In [1]:
sparkSql = (SparkSession.builder
         .master("local")
         .appName("Kiva Exploration")
         .getOrCreate())

loans = sparkSql.read.format('json').load('kiva-data/loans.json')
lenders = sparkSql.read.format('json').load('kiva-data/lenders.json')
loans_lenders = sparkSql.read.format('json').load('kiva-data/loans_lenders.json')

In [2]:
loans.printSchema()

root
 |-- activity: string (nullable = true)
 |-- basket_amount: long (nullable = true)
 |-- bonus_credit_eligibility: boolean (nullable = true)
 |-- borrowers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- first_name: string (nullable = true)
 |    |    |-- gender: string (nullable = true)
 |    |    |-- last_name: string (nullable = true)
 |    |    |-- pictured: boolean (nullable = true)
 |-- currency_exchange_loss_amount: double (nullable = true)
 |-- delinquent: boolean (nullable = true)
 |-- description: struct (nullable = true)
 |    |-- languages: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- texts: struct (nullable = true)
 |    |    |-- ar: string (nullable = true)
 |    |    |-- en: string (nullable = true)
 |    |    |-- es: string (nullable = true)
 |    |    |-- fr: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- mn: string (nullable = true)
 |    |    |--

In [3]:
loans.groupby(loans.status).count().collect()

[Row(status=u'refunded', count=5504),
 Row(status=u'defaulted', count=21776),
 Row(status=u'in_repayment', count=155749),
 Row(status=u'reviewed', count=3),
 Row(status=u'deleted', count=2721),
 Row(status=u'paid', count=775330),
 Row(status=u'issue', count=199),
 Row(status=u'inactive_expired', count=12421),
 Row(status=u'fundraising', count=3986),
 Row(status=u'expired', count=33773),
 Row(status=u'inactive', count=2493),
 Row(status=u'funded', count=173),
 Row(status=u'', count=2)]

In [4]:
loans.groupby(loans.delinquent).count().collect()

[Row(delinquent=None, count=970465), Row(delinquent=True, count=43665)]

In [6]:
loans.where(loans.delinquent == True).groupby(loans.status).count().collect()

[Row(status=u'refunded', count=156),
 Row(status=u'defaulted', count=20116),
 Row(status=u'in_repayment', count=23393)]

In [19]:
loans.registerTempTable('loans')
sparkSql.sql('''
SELECT loans.status
FROM loans
LIMIT 1
''').collect()

[Row(status=u'in_repayment')]