Aula 08: Numpy/Scipy/Pandas/Matplotlib¶

Python se tornou uma linguagem de programação amplamente usada em meios científicos. Em algumas áreas se tornou a lingua franca. Porque e como isso ocorreu.

Tradicionalmente, a linguagem usada em computação científica é o FORTRAN, mais especificamente o FORTRAN 77. Também é usado o C/C++ e até hoje essa é a base de quase todas as ferramentas que se utiliza em computação científica. Essas são linguagens compiladas sem interatividade (não estritamente verdade) e o fluxo de trabalho era (ou é...) o seguinte:

Escreva um programa
Compile o programa
Escreva um arquivo com os dados de entrada
Execute o programa com os dados de saída
O programa escreve seus resultados em um arquivo de saída
Use um outro programa para fazer gráficos e analisar os dados
Volte para 1 até dar certo

Estamos na última aula do curso de python e nunca fizemos nada parecido. Com python e seus ambientes interativos (tanto o terminal quanto o jupyter) fazem esta tarefa ser muito mais fácil, rápido e integrado.

Mas aí temos alguns problemas:

Python, sozinho, não tem nenhuma das ferramentas necessárias:
- Matrizes
- Álgebra linear
- Equações diferenciais
- Plotar gráficos
- etc, etc, etc
Python é lento

Este era o estado das coisas em 2000 (ou um pouco antes).

Qualquer aplicação científica ou de ciência de dados precisa tratar de matrizes e álgebra linear. Isso é tão importante que existem bibliotecas básicas para isso com implementações padrão:

TODO mundo usa as bibliotecas acima. Se não usa deveria usar.

MATLAB e outros¶

Um pessoal, por volta de 1980 criou um software que fazia uma interface interativa para essas bibliotecas. Esse software é o MATLAB - MATrix LABoratory.

Mas o MATLAB, apesar de todo o sucesso, tem alguns problemas

É caro
É muito caro
Também é lento (para algumas coisas)
A linguagem de programação parece uma gambiarra
Existem limitações (legais) ao que se pode fazer

Outras ferramentas semelhantes surgiram que competem com MATLAB ou focam em outras áreas

Mathematica
IDL
S-PLUS
e outras, muitas outras...

Python como uma linguagem para aplicações matemáticas¶

No final dos anos 90, python estava se tornando uma linguagem de script popular por ser uma linguagem simples, amigável e fácil de aprender. Algumas pessoas acharam interessante usar python para aplicações matemáticas e numéricas. Com isso, começaram a criar extensões que permitissem isso. Após alguns anos e alguns traumas (numeric vs numarray vs numpy) chegou-se a algumas ferramentas básicas que permitiu todo o desenvolvimento posterior:

Numpy: operação com arrays uniformes multidimensionais
Scipy: algoritmos numéricos básicos
Matplotlib: gráficos com qualidade
Pandas: manipulação de dados e tabelas

Porque python se tornou tão popular nessa área? Porque resolve problemas! Na minha opinião, python se tornou popular pelos seguintes motivos

Software livre: você usa como quiser e onde quiser e não precisa pagar preços exorbitantes
Base sólida (numpy+scipy+matplotlib+pandas)
Linguagem de programação de verdade (não uma gambiarra)
Linguagem de programação agradável
Extensível: dá para você interfacear com C/C++/Fortran sem grandes dificuldades
Interage bem com a Internet - importante para a ciência de dados

Numpy¶

Numpy é uma biblioteca Python para trabalhar com matrizes multidimensionais.

Instalação¶

Se você está usando a distribuição de Python Anaconda, numpy já vem. Abra um terminal, inicie o Python e execute o seguinte comando:

import numpy

Se não deu erro, quer dizer o Numpy está instalado.

Caso contrário, execute o comando a seguir num terminal (não do Python, o prompt ou powershell)

python -m pip install numpy

Isso vai baixar o numpy e instalar.

Executando o numpy¶

Para usar o numpy, você primeiro precisa carregar a biblioteca. Para isso existem algumas opções:

import numpy
import numpy as np
from numpy import *

A primeira opção tem um problema: tem que digitar numpy. na frente de tudo. A última opção você acessa diretamente a funcionalidade mas vai poluir o teu namespace. Eu prefiro a opção do meio e é o que vou utilizar.

In [1]:

import numpy as np

O tipo básico do numpy é o ndarray

In [2]:

?np.ndarray

O ndarray é usado para representar uma matriz multidimensional homogênea com ítens de tamanho fixo.

Criando objetos `ndarray`¶

Dificilmente vamos usar o ndarray diretamente.

A função array cria um ndarray a partir de listas tuplas e coisas parecidas:

In [3]:

x = np.array(range(30))

In [4]:

type(x)

Out[4]:

numpy.ndarray

In [5]:

x[0]

Out[5]:

In [6]:

x[10:15]

Out[6]:

array([10, 11, 12, 13, 14])

In [7]:

x.size

Out[7]:

In [8]:

# Dá para fatiar!
y = x[3:15:3]
y

Out[8]:

array([ 3,  6,  9, 12])

In [9]:

# y faz referência à mesma memória que x
y[0] = 123
y[0], x[3]

Out[9]:

(123, 123)

In [10]:

x.strides

Out[10]:

(8,)

In [11]:

y.strides

Out[11]:

(24,)

In [12]:

x.dtype

Out[12]:

dtype('int64')

Usando a mesma sintaxe, é possível criar matrizes:

In [13]:

mat = np.array([[1,2,3], [4,5,6]])  # Lista de listas

In [14]:

mat

Out[14]:

array([[1, 2, 3],
       [4, 5, 6]])

In [15]:

mat[0,0]

Out[15]:

In [16]:

mat[1,0]

Out[16]:

In [17]:

mat[0,:]

Out[17]:

array([1, 2, 3])

In [18]:

mat[:,0]

Out[18]:

array([1, 4])

In [19]:

mat.size

Out[19]:

In [20]:

mat.ndim

Out[20]:

In [21]:

mat.shape

Out[21]:

(2, 3)

In [22]:

mat.dtype

Out[22]:

dtype('int64')

In [23]:

# Array 3D
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

In [24]:

arr

Out[24]:

array([[[1, 2, 3],
        [4, 5, 6]],

       [[1, 2, 3],
        [4, 5, 6]]])

In [25]:

arr.ndim

Out[25]:

In [26]:

arr.size

Out[26]:

In [27]:

arr.shape

Out[27]:

(2, 2, 3)

In [28]:

arr

Out[28]:

array([[[1, 2, 3],
        [4, 5, 6]],

       [[1, 2, 3],
        [4, 5, 6]]])

In [29]:

# Indexando
arr[0,0,0]

Out[29]:

In [30]:

arr[1,0,0]

Out[30]:

In [31]:

arr[0,1,0]

Out[31]:

In [32]:

arr[0]

Out[32]:

array([[1, 2, 3],
       [4, 5, 6]])

In [33]:

arr[0,0]

Out[33]:

array([1, 2, 3])

In [34]:

arr.strides

Out[34]:

(48, 24, 8)

Melhor jeito para criar os `ndarray`s¶

In [35]:

np.zeros( (3,4) )

Out[35]:

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [36]:

np.zeros( (2,3,4), dtype='int64')

Out[36]:

array([[[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]])

In [37]:

np.ones((2,3,4))

Out[37]:

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

In [38]:

np.ones((2,3,4), dtype='int64')

Out[38]:

array([[[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]]])

Arrays igualmente espaçados¶

In [39]:

np.arange(0, 1, 0.01)

Out[39]:

array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
       0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
       0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
       0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
       0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
       0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65,
       0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76,
       0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87,
       0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98,
       0.99])

In [40]:

np.linspace(0,1,num=11)

Out[40]:

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [41]:

np.logspace(1,10, num=11)

Out[41]:

array([1.00000000e+01, 7.94328235e+01, 6.30957344e+02, 5.01187234e+03,
       3.98107171e+04, 3.16227766e+05, 2.51188643e+06, 1.99526231e+07,
       1.58489319e+08, 1.25892541e+09, 1.00000000e+10])

In [42]:

np.geomspace(1,100,num=4)

Out[42]:

array([  1.        ,   4.64158883,  21.5443469 , 100.        ])

Os objetos `ndarray` têm um monte de métodos interessantes¶

Soma de tudo quanto é tipo¶

In [43]:

x = np.array(range(1,13))
x

Out[43]:

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [44]:

y = x.reshape(3,4)
y

Out[44]:

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [45]:

y.sum()

Out[45]:

In [46]:

y.sum(0)

Out[46]:

array([15, 18, 21, 24])

In [47]:

y.sum(1)

Out[47]:

array([10, 26, 42])

In [48]:

z = np.array(range(1,25)).reshape(2,3,4)
z

Out[48]:

array([[[ 1,  2,  3,  4],
        [ 5,  6,  7,  8],
        [ 9, 10, 11, 12]],

       [[13, 14, 15, 16],
        [17, 18, 19, 20],
        [21, 22, 23, 24]]])

In [49]:

z.sum()

Out[49]:

In [50]:

z.sum(0)

Out[50]:

array([[14, 16, 18, 20],
       [22, 24, 26, 28],
       [30, 32, 34, 36]])

In [51]:

z.sum(1)

Out[51]:

array([[15, 18, 21, 24],
       [51, 54, 57, 60]])

In [52]:

z.sum(2)

Out[52]:

array([[10, 26, 42],
       [58, 74, 90]])

In [53]:

z.sum((0,1))

Out[53]:

array([66, 72, 78, 84])

In [54]:

z.sum((0,2))

Out[54]:

array([ 68, 100, 132])

In [55]:

z.sum((1,2))

Out[55]:

array([ 78, 222])

Média e desvio padrão¶

In [56]:

z.mean()

Out[56]:

12.5

In [57]:

z.mean(0)

Out[57]:

array([[ 7.,  8.,  9., 10.],
       [11., 12., 13., 14.],
       [15., 16., 17., 18.]])

In [58]:

z.mean(1)

Out[58]:

array([[ 5.,  6.,  7.,  8.],
       [17., 18., 19., 20.]])

In [59]:

z.std()

Out[59]:

6.922186552431729

In [60]:

z.std(0)

Out[60]:

array([[6., 6., 6., 6.],
       [6., 6., 6., 6.],
       [6., 6., 6., 6.]])

Máximos e mínimos¶

In [61]:

z.max()

Out[61]:

In [62]:

z.max(0)

Out[62]:

array([[13, 14, 15, 16],
       [17, 18, 19, 20],
       [21, 22, 23, 24]])

In [63]:

z.min(1)

Out[63]:

array([[ 1,  2,  3,  4],
       [13, 14, 15, 16]])

Números aleatórios¶

In [64]:

import numpy.random as rnd

In [65]:

x = rnd.randn(1000)

In [66]:

x.mean()

Out[66]:

0.014018602934028006

In [67]:

x.std()

Out[67]:

0.9904797535277062

In [68]:

import matplotlib.pyplot as plt

In [69]:

plt.plot(x)

Out[69]:

[<matplotlib.lines.Line2D at 0x7f051936ea60>]

In [70]:

plt.plot(x, "o")

Out[70]:

[<matplotlib.lines.Line2D at 0x7f0519254490>]

In [71]:

plt.hist(x)

Out[71]:

(array([  7.,  27.,  85., 161., 235., 237., 165.,  64.,  15.,   4.]),
 array([-3.09730862, -2.46438094, -1.83145326, -1.19852558, -0.56559791,
         0.06732977,  0.70025745,  1.33318513,  1.9661128 ,  2.59904048,
         3.23196816]),
 <BarContainer object of 10 artists>)

In [72]:

y = rnd.gumbel(size=1000)

In [73]:

plt.hist(y)

Out[73]:

(array([ 38., 218., 322., 217., 115.,  50.,  28.,   6.,   4.,   2.]),
 array([-2.1885189 , -1.25522222, -0.32192555,  0.61137113,  1.54466781,
         2.47796448,  3.41126116,  4.34455783,  5.27785451,  6.21115118,
         7.14444786]),
 <BarContainer object of 10 artists>)

Operações vetorizadas e broadcasting¶

In [74]:

x = np.linspace(0,1,num=51)
y = np.sin(2*np.pi*x)

In [75]:

plt.plot(x,y)

Out[75]:

[<matplotlib.lines.Line2D at 0x7f051911c1f0>]

In [76]:

plt.plot(x, x**2*np.cos(6*x) + np.exp(x))

Out[76]:

[<matplotlib.lines.Line2D at 0x7f05190f3d90>]

In [77]:

x + 1

Out[77]:

array([1.  , 1.02, 1.04, 1.06, 1.08, 1.1 , 1.12, 1.14, 1.16, 1.18, 1.2 ,
       1.22, 1.24, 1.26, 1.28, 1.3 , 1.32, 1.34, 1.36, 1.38, 1.4 , 1.42,
       1.44, 1.46, 1.48, 1.5 , 1.52, 1.54, 1.56, 1.58, 1.6 , 1.62, 1.64,
       1.66, 1.68, 1.7 , 1.72, 1.74, 1.76, 1.78, 1.8 , 1.82, 1.84, 1.86,
       1.88, 1.9 , 1.92, 1.94, 1.96, 1.98, 2.  ])

In [78]:

x * 2

Out[78]:

array([0.  , 0.04, 0.08, 0.12, 0.16, 0.2 , 0.24, 0.28, 0.32, 0.36, 0.4 ,
       0.44, 0.48, 0.52, 0.56, 0.6 , 0.64, 0.68, 0.72, 0.76, 0.8 , 0.84,
       0.88, 0.92, 0.96, 1.  , 1.04, 1.08, 1.12, 1.16, 1.2 , 1.24, 1.28,
       1.32, 1.36, 1.4 , 1.44, 1.48, 1.52, 1.56, 1.6 , 1.64, 1.68, 1.72,
       1.76, 1.8 , 1.84, 1.88, 1.92, 1.96, 2.  ])

Outros tipos de indexações¶

Indexando usando variáveis lógicas¶

In [79]:

x = rnd.randn(10)
x

Out[79]:

array([ 0.77997741,  1.04347792, -2.54628676,  1.24286073, -1.69926903,
       -1.24987317,  2.38368852, -0.19999258,  1.58995438,  0.4589341 ])

In [80]:

positivo = x > 0
positivo

Out[80]:

array([ True,  True, False,  True, False, False,  True, False,  True,
        True])

In [81]:

x[positivo]

Out[81]:

array([0.77997741, 1.04347792, 1.24286073, 2.38368852, 1.58995438,
       0.4589341 ])

In [82]:

x[~positivo]

Out[82]:

array([-2.54628676, -1.69926903, -1.24987317, -0.19999258])

In [83]:

y = rnd.randn(5,8)
y[y > 0]

Out[83]:

array([1.51323665e+00, 6.60425522e-01, 7.81835392e-02, 1.76947247e-01,
       2.71410749e-01, 3.87662914e-01, 6.00594630e-01, 7.12319894e-01,
       1.44725779e+00, 1.34976924e+00, 8.69802835e-02, 9.76557367e-01,
       2.84019557e-05, 5.88129534e-01, 1.10595408e+00, 4.94789833e-01,
       2.29261008e-01, 1.37328334e+00])

Já tínhamos visto fatiamento¶

In [84]:

x = np.array(range(50))

In [85]:

x[ [10,15,20]]

Out[85]:

array([10, 15, 20])

Álgebra linear¶

In [86]:

A0 = rnd.randn(5,5)
A0

Out[86]:

array([[ 0.61369072, -0.74715209, -1.18625723, -0.09583723, -1.42886673],
       [-0.67965437,  1.63275252, -2.5878914 ,  0.59475804, -1.02750672],
       [ 1.07161319, -0.51377731, -0.45160769,  0.59791653, -2.45569912],
       [-0.94591804, -0.0476593 ,  0.7482287 ,  2.31135166,  0.00738173],
       [ 0.06073121, -1.49248878,  1.10102178,  1.19930112, -1.74138814]])

In [87]:

# Matriz diagonal com diagonal = 10
A1 = np.diag(10*np.ones(5))
A1

Out[87]:

array([[10.,  0.,  0.,  0.,  0.],
       [ 0., 10.,  0.,  0.,  0.],
       [ 0.,  0., 10.,  0.,  0.],
       [ 0.,  0.,  0., 10.,  0.],
       [ 0.,  0.,  0.,  0., 10.]])

In [88]:

A = A0 + A1

In [89]:

b = np.ones(5)

In [90]:

import numpy.linalg as linalg

In [91]:

# Resolvendo o sistema linear A*x = b
x = linalg.solve(A, b)
x

Out[91]:

array([0.13280483, 0.12639701, 0.12080103, 0.08450851, 0.11457434])

In [92]:

np.dot(A,x)

Out[92]:

array([1., 1., 1., 1., 1.])

In [93]:

A.dot(x)

Out[93]:

array([1., 1., 1., 1., 1.])

In [94]:

# Matriz transposta
A.T

Out[94]:

array([[ 1.06136907e+01, -6.79654373e-01,  1.07161319e+00,
        -9.45918036e-01,  6.07312127e-02],
       [-7.47152094e-01,  1.16327525e+01, -5.13777312e-01,
        -4.76593049e-02, -1.49248878e+00],
       [-1.18625723e+00, -2.58789140e+00,  9.54839231e+00,
         7.48228700e-01,  1.10102178e+00],
       [-9.58372280e-02,  5.94758040e-01,  5.97916531e-01,
         1.23113517e+01,  1.19930112e+00],
       [-1.42886673e+00, -1.02750672e+00, -2.45569912e+00,
         7.38173448e-03,  8.25861186e+00]])

In [95]:

# Inversa da matriz
iA = linalg.inv(A)

In [96]:

np.dot(iA,A)

Out[96]:

array([[ 1.00000000e+00,  0.00000000e+00,  6.93889390e-18,
         0.00000000e+00, -5.55111512e-17],
       [-7.37257477e-18,  1.00000000e+00, -3.12250226e-17,
        -6.93889390e-18, -2.77555756e-17],
       [ 1.30104261e-18, -1.38777878e-17,  1.00000000e+00,
         6.93889390e-18, -2.77555756e-17],
       [-4.73322011e-18, -6.50521303e-19, -1.31459513e-17,
         1.00000000e+00, -2.16840434e-19],
       [-8.67361738e-19,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.00000000e+00]])

In [97]:

# Determinante da matriz
linalg.det(A)

Out[97]:

119920.46580368283

In [98]:

1 / linalg.det(iA)

Out[98]:

119920.46580368285

In [99]:

# Vamos criar uma matriz simétrica
B = A + A.T

In [100]:

Out[100]:

array([[21.22738143, -1.42680647, -0.11464403, -1.04175526, -1.36813551],
       [-1.42680647, 23.26550503, -3.10166871,  0.54709873, -2.5199955 ],
       [-0.11464403, -3.10166871, 19.09678463,  1.34614523, -1.35467734],
       [-1.04175526,  0.54709873,  1.34614523, 24.62270331,  1.20668286],
       [-1.36813551, -2.5199955 , -1.35467734,  1.20668286, 16.51722371]])

In [101]:

linalg.cholesky(B)

Out[101]:

array([[ 4.60731825,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.30968264,  4.81348124,  0.        ,  0.        ,  0.        ],
       [-0.02488303, -0.64597209,  4.32190762,  0.        ,  0.        ],
       [-0.22610881,  0.09911262,  0.32498218,  4.94531509,  0.        ],
       [-0.29694834, -0.54263331, -0.39625842,  0.26734372,  3.98824536]])

Scipy ¶

Pacote construído sobre o numpy que tem um monte de funcionalidade.

Em particular essa biblioteca tem as seguintes áreas:

Álgebra linear¶

Bem mais completo que o numpy!

In [102]:

import scipy.linalg as lin

In [103]:

# Problema de autovalor
λ, ϕ = lin.eig(B)

In [104]:

Λ = np.diag(λ)

In [105]:

i = 3
np.abs(B.dot(ϕ[:,i]) - λ[i] * ϕ[:,i]).max()

Out[105]:

3.552713678800501e-15

In [106]:

# Decomposição LU
P,L,U = lin.lu(A)

In [107]:

Out[107]:

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [108]:

Out[108]:

array([[ 1.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.06403563,  1.        ,  0.        ,  0.        ,  0.        ],
       [ 0.10096518, -0.03783724,  1.        ,  0.        ,  0.        ],
       [-0.08912244, -0.00986174,  0.06441021,  1.        ,  0.        ],
       [ 0.00572197, -0.12846141,  0.08002273,  0.09985789,  1.        ]])

In [109]:

Out[109]:

array([[10.61369072, -0.74715209, -1.18625723, -0.09583723, -1.42886673],
       [ 0.        , 11.58490816, -2.66385413,  0.58862104, -1.11900511],
       [ 0.        ,  0.        ,  9.5673701 ,  0.62986455, -2.3537734 ],
       [ 0.        ,  0.        ,  0.        , 12.26804553,  0.02060935],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  8.30933618]])

In [110]:

lu, piv = lin.lu_factor(A)

In [111]:

lu

Out[111]:

array([[ 1.06136907e+01, -7.47152094e-01, -1.18625723e+00,
        -9.58372280e-02, -1.42886673e+00],
       [-6.40356301e-02,  1.15849082e+01, -2.66385413e+00,
         5.88621042e-01, -1.11900511e+00],
       [ 1.00965180e-01, -3.78372414e-02,  9.56737010e+00,
         6.29864550e-01, -2.35377340e+00],
       [-8.91224420e-02, -9.86173757e-03,  6.44102112e-02,
         1.22680455e+01,  2.06093500e-02],
       [ 5.72196932e-03, -1.28461406e-01,  8.00227288e-02,
         9.98578873e-02,  8.30933618e+00]])

In [112]:

lin.lu_solve( (lu,piv), b)

Out[112]:

array([0.13280483, 0.12639701, 0.12080103, 0.08450851, 0.11457434])

In [ ]:

Integração numérica (funções)¶

In [113]:

import scipy.integrate as integr

Quero calcular a seguinte integral:

$$ \int_0^1 \cos(x) \: dx = \sin(1) $$

In [114]:

I1 = integr.romberg(np.cos, 0, 1)
I1

Out[114]:

0.841470984807879

In [115]:

Ie = np.sin(1)
Ie

Out[115]:

0.8414709848078965

In [116]:

erro = I1 - Ie
erro

Out[116]:

-1.7541523789077473e-14

In [117]:

# Funçãosinha bem chata para integrar na mão...
def minhafun(x):
    return np.cos(x**2)

In [118]:

integr.romberg(minhafun, 0, 1)

Out[118]:

0.9045242379007594

Exemplo Camada limite¶

A camada limite foi medida no túnel de vento. Vamos analisar estes dados. Os dados estão armazenados em um arquivo no format HDF5, um formato binário adequado para armazenar arrays de grandes dimensões

Ler os dados de um arquivo HDF5¶

In [119]:

# Carregar a biblioteca python para ler arquivos HDF5:
import h5py

In [120]:

# Abrir o arquivo:
f = h5py.File("camada-limite.h5", "r")

In [121]:

# Vamos ver o que tem dentro deste arquivo
f.keys()

Out[121]:

<KeysViewHDF5 ['U', 'z']>

In [122]:

U = f['U'][:]
z = f['z'][:]

In [123]:

U.shape

Out[123]:

(39, 80000)

In [124]:

z.shape

Out[124]:

(39,)

In [125]:

# Vamos ler a taxa de amostragem a partir dos atributos de U
fs = f['U'].attrs['rate'][0]

Plotar os perfis de velocidade¶

Vamos plotar o perfil de velocidades e intensidade de turbulência:

In [126]:

Um = U.mean(1)
sU = U.std(1)
I = sU / Um * 100

In [127]:

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.plot(Um, z, "o")
plt.xlabel("Velocidade (m/s)")
plt.ylabel("Altura (m)")
plt.subplot(1,2,2)
plt.plot(I, z, "o")
plt.xlabel("Intensidade de turb. (%)")
plt.ylabel("Altura (m)")

Out[127]:

Text(0, 0.5, 'Altura (m)')

In [128]:

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(12,8))
ax1.plot(Um, z, "o")
ax1.grid()
ax1.set_xlabel("Velocidade (m/s)")
ax1.set_ylabel("Altura (m)")
ax1.set_title("Perfil de velocidade")
ax2.plot(I, z, "o")
ax2.grid()
ax2.set_xlabel("Intensidade de turb. (%)")
ax2.set_title("Turbulência")
#plt.ylabel("Altura (m)")

Out[128]:

Text(0.5, 1.0, 'Turbulência')

Espectro de turbulência¶

In [129]:

iref = 14
zref = z[iref]
u = U[iref]
N = len(u)
dt = 1/fs

In [130]:

t = np.arange(N) * dt
ū = u.mean()
σ = u.std()

In [131]:

plt.figure(figsize=(12,6))
plt.plot(t, u, label="$U(t)$", color='k')
plt.axhline(y=ū, linewidth=3, color='g', label="$\\bar{U}$")
plt.axhline(y = ū+2*σ, color='r', linestyle='--', linewidth=2, label='$\\bar{U} + 2\\sigma$')
plt.axhline(y = ū-2*σ, color='b', linestyle='--', linewidth=2, label='$\\bar{U} - 2\\sigma$')

plt.legend(loc='upper right')
plt.xlabel("Tempo (s)")
plt.ylabel("Velocidade (m/s)")

Out[131]:

Text(0, 0.5, 'Velocidade (m/s)')

In [132]:

plt.figure(figsize=(12,6))
plt.xlim( (18,20))
plt.plot(t, u, label="$U(t)$", color='k')
plt.axhline(y=ū, linewidth=3, color='g', label="$\\bar{U}$")
plt.axhline(y = ū+2*σ, color='r', linestyle='--', linewidth=2, label='$\\bar{U} + 2\\sigma$')
plt.axhline(y = ū-2*σ, color='b', linestyle='--', linewidth=2, label='$\\bar{U} - 2\\sigma$')

plt.legend(loc='upper right')
plt.xlabel("Tempo (s)")
plt.ylabel("Velocidade (m/s)")

Out[132]:

Text(0, 0.5, 'Velocidade (m/s)')

Análise espectral da velocidade¶

In [133]:

import scipy.signal as signal

In [134]:

f, S = signal.welch(u, fs, scaling='spectrum', nperseg=1000)

In [135]:

f1 = f[1:]
S1 = S[1:]

In [136]:

plt.figure(figsize=(12,6))
plt.grid()
plt.loglog(f1, S1, label="Medido")
plt.xlabel("Frequência (Hz)")
plt.ylabel("PSD")
ff = 10.0
ss = 1.0
a = ss / (ff**(-5/3))
plt.plot(f1, a*f1**(-5/3), "r--", label="Espectro de Kolmogorov $f^{-5/3}$")
plt.legend()

Out[136]:

<matplotlib.legend.Legend at 0x7f050c9ff5e0>

Exemplo: Equações de Lorenz¶

https://jupyter.org/try-jupyter/retro/notebooks/?path=notebooks/Lorenz.ipynb

$$ \begin{aligned} \dot{x} & = \sigma(y-x) \\ \dot{y} & = \rho x - y - xz \\ \dot{z} & = -\beta z + xy \end{aligned} $$

In [137]:

import numpy as np
from matplotlib import pyplot as plt
from scipy import integrate
np.random.seed(1)
x0 = -15 + 30 * np.random.random((N, 3))

In [ ]:

In [138]:

def lorenz_deriv(x_y_z, t0, sigma=10.0, beta=8./3., rho=28.0):
    """Compute the time-derivative of a Lorenz system."""
    x, y, z = x_y_z
    return [sigma * (y - x), x * (rho - z) - y, x * y - beta * z]

In [139]:

max_time = 100.0
t = np.linspace(0, max_time, int(250*max_time))

In [140]:

x1 = integrate.odeint(lorenz_deriv, [0.5, 0.4, 0.6], t)
x2 = integrate.odeint(lorenz_deriv, [0.50000001, 0.4, 0.6], t)

In [141]:

plt.figure(figsize=(18,10))

plt.plot(t, x1[:,2])
plt.plot(t, x2[:,2])

Out[141]:

[<matplotlib.lines.Line2D at 0x7f050c7d67c0>]

Pandas¶

Biblioteca para análise de dados. Dados em forma de tabelas. Pode-se pensar nisso como se fosse uma planilha.

Referência: Análise de dados com Python e Pandas de Daniel Y. Chen

In [ ]:

In [142]:

import pandas as pd

In [143]:

df = pd.read_csv('gapminder.tsv', sep='\t')

In [144]:

df

Out[144]:

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106
...	...	...	...	...	...	...
3308	Zimbabwe	Africa	1987	62.351	9216418	706.157306
3309	Zimbabwe	Africa	1992	60.377	10704340	693.420786
3310	Zimbabwe	Africa	1997	46.809	11404948	792.449960
3311	Zimbabwe	Africa	2002	39.989	11926563	672.038623
3312	Zimbabwe	Africa	2007	43.487	12311143	469.709298

3313 rows × 6 columns

In [145]:

type(df)

Out[145]:

pandas.core.frame.DataFrame

In [146]:

df.shape

Out[146]:

(3313, 6)

In [147]:

df.columns

Out[147]:

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

In [148]:

df.dtypes

Out[148]:

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

In [149]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3313 entries, 0 to 3312
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    3313 non-null   object 
 1   continent  3313 non-null   object 
 2   year       3313 non-null   int64  
 3   lifeExp    3313 non-null   float64
 4   pop        3313 non-null   int64  
 5   gdpPercap  3313 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 155.4+ KB

In [150]:

df.head()

Out[150]:

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

In [151]:

df.tail()

Out[151]:

	country	continent	year	lifeExp	pop	gdpPercap
3308	Zimbabwe	Africa	1987	62.351	9216418	706.157306
3309	Zimbabwe	Africa	1992	60.377	10704340	693.420786
3310	Zimbabwe	Africa	1997	46.809	11404948	792.449960
3311	Zimbabwe	Africa	2002	39.989	11926563	672.038623
3312	Zimbabwe	Africa	2007	43.487	12311143	469.709298

In [152]:

country_df = df['country']

In [153]:

country_df

Out[153]:

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
3308       Zimbabwe
3309       Zimbabwe
3310       Zimbabwe
3311       Zimbabwe
3312       Zimbabwe
Name: country, Length: 3313, dtype: object

In [154]:

type(country_df)

Out[154]:

pandas.core.series.Series

In [155]:

# Pegar linhas de acordo com o rótulo
df.loc[0]

Out[155]:

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

In [156]:

df.loc[10:20]

Out[156]:

	country	continent	year	lifeExp	pop	gdpPercap
10	Afghanistan	Asia	2002	42.129	25268405	726.734055
11	Afghanistan	Asia	2007	43.828	31889923	974.580338
12	Albania	Europe	1952	55.230	1282697	1601.056136
13	Albania	Europe	1957	59.280	1476505	1942.284244
14	Albania	Europe	1962	64.820	1728137	2312.888958
15	Albania	Europe	1967	66.220	1984060	2760.196931
16	Albania	Europe	1972	67.690	2263554	3313.422188
17	Albania	Europe	1977	68.930	2509048	3533.003910
18	Albania	Europe	1982	70.420	2780097	3630.880722
19	Albania	Europe	1987	72.000	3075321	3738.932735
20	Albania	Europe	1992	71.581	3326498	2497.437901

In [157]:

# Pegar linhas de acordo com o índice
df.iloc[0]

Out[157]:

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

In [158]:

df.iloc[10:20]

Out[158]:

	country	continent	year	lifeExp	pop	gdpPercap
10	Afghanistan	Asia	2002	42.129	25268405	726.734055
11	Afghanistan	Asia	2007	43.828	31889923	974.580338
12	Albania	Europe	1952	55.230	1282697	1601.056136
13	Albania	Europe	1957	59.280	1476505	1942.284244
14	Albania	Europe	1962	64.820	1728137	2312.888958
15	Albania	Europe	1967	66.220	1984060	2760.196931
16	Albania	Europe	1972	67.690	2263554	3313.422188
17	Albania	Europe	1977	68.930	2509048	3533.003910
18	Albania	Europe	1982	70.420	2780097	3630.880722
19	Albania	Europe	1987	72.000	3075321	3738.932735

In [159]:

df.loc[-1]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    354                 try:
--> 355                     return self._range.index(new_key)
    356                 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_10036/1145300078.py in <module>
----> 1 df.loc[-1]

/usr/lib/python3/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
    877 
    878             maybe_callable = com.apply_if_callable(key, self.obj)
--> 879             return self._getitem_axis(maybe_callable, axis=axis)
    880 
    881     def _is_scalar_access(self, key: Tuple):

/usr/lib/python3/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1108         # fall thru to straight lookup
   1109         self._validate_key(key, axis)
-> 1110         return self._get_label(key, axis=axis)
   1111 
   1112     def _get_slice_axis(self, slice_obj: slice, axis: int):

/usr/lib/python3/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1057     def _get_label(self, label, axis: int):
   1058         # GH#5667 this will fail if the label is not present in the axis.
-> 1059         return self.obj.xs(label, axis=axis)
   1060 
   1061     def _handle_lowerdim_multi_index_axis0(self, tup: Tuple):

/usr/lib/python3/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3491             loc, new_index = self.index.get_loc_level(key, drop_level=drop_level)
   3492         else:
-> 3493             loc = self.index.get_loc(key)
   3494 
   3495             if isinstance(loc, np.ndarray):

/usr/lib/python3/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    355                     return self._range.index(new_key)
    356                 except ValueError as err:
--> 357                     raise KeyError(key) from err
    358             raise KeyError(key)
    359         return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: -1

In [160]:

df.iloc[-1]

Out[160]:

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 3312, dtype: object

Agrupando e agregando dados¶

Perguntas.

Para cada ano qual era a expectativa de vida média
Podemos ver isso por continente?
Quantos países estão listados por continente?

In [161]:

df.groupby('year')['lifeExp'].mean()

Out[161]:

year
1950    62.002568
1951    65.904167
1952    49.170708
1953    66.674563
1954    67.459817
1955    67.806757
1956    67.950637
1957    51.614590
1958    68.815936
1959    68.226579
1960    68.470837
1961    68.862480
1962    54.035234
1963    69.595735
1964    70.063105
1965    70.259881
1966    70.447526
1967    56.263629
1968    70.689081
1969    70.653896
1970    70.961141
1971    71.103976
1972    58.474481
1973    71.500338
1974    71.778504
1975    71.939218
1976    72.158050
1977    60.429090
1978    72.717567
1979    73.018717
1980    73.064524
1981    73.337399
1982    62.365871
1983    73.787778
1984    74.100741
1985    74.112222
1986    74.452222
1987    63.984860
1988    74.760000
1989    74.924444
1990    74.283437
1991    74.374848
1992    65.008443
1993    74.324545
1994    74.456667
1995    74.552727
1996    75.029394
1997    65.873799
1998    75.569697
1999    75.703636
2000    76.026364
2001    76.257879
2002    66.835695
2003    76.586667
2004    76.921563
2005    76.718667
2006    77.887778
2007    67.868557
Name: lifeExp, dtype: float64

Vamos detalhar isso.

In [162]:

grouped_by_year_df = df.groupby('year')
type(grouped_by_year_df)

Out[162]:

pandas.core.groupby.generic.DataFrameGroupBy

In [163]:

grouped_by_year_lifeExp = grouped_by_year_df['lifeExp']
type(grouped_by_year_lifeExp)

Out[163]:

pandas.core.groupby.generic.SeriesGroupBy

In [164]:

b = df['year'] == 1960
b

Out[164]:

0       False
1       False
2       False
3       False
4       False
        ...  
3308    False
3309    False
3310    False
3311    False
3312    False
Name: year, Length: 3313, dtype: bool

In [165]:

df60 = df[b]

In [166]:

df60['lifeExp']

Out[166]:

82      70.89000
138     68.78000
261     69.65000
405     69.18000
510     71.04000
612     31.63176
756     70.46000
814     72.21000
996     69.01000
1054    70.40000
1296    68.09000
1353    74.09000
1484    69.21000
1552    67.80000
1776    69.07000
2041    73.37000
2116    71.26000
2207    73.58000
2341    67.86000
2399    64.25000
2613    70.35000
2736    69.27000
2839    73.04000
2897    71.44000
2967    64.40000
3171    69.91000
Name: lifeExp, dtype: float64

In [167]:

df60['lifeExp'].mean()

Out[167]:

68.47083692307693

Vamos fazer por continente

In [168]:

multi_group_var = df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean()
multi_group_var

Out[168]:

		lifeExp	gdpPercap
year	continent
1950	Africa	41.361500	1422.081643
	Americas	59.405286	5567.245762
	Asia	53.675000	1363.645814
	Europe	65.755916	6804.996873
	FSU	59.950000	3638.203164
...	...	...	...
2007	Americas	73.400485	11940.901644
	Asia	70.660233	15338.057326
	Europe	77.299059	24174.152973
	FSU	69.457333	9522.539346
	Oceania	71.315000	13156.978742

286 rows × 2 columns

Python na ASSIPT

Aula 08: Numpy, Scipy, Matplotlib e Pandas

Aula 08: Numpy/Scipy/Pandas/Matplotlib¶

MATLAB e outros¶

Python como uma linguagem para aplicações matemáticas¶

Numpy¶

Instalação¶

Executando o numpy¶

Criando objetos `ndarray`¶

Melhor jeito para criar os `ndarray`s¶

Arrays igualmente espaçados¶

Os objetos `ndarray` têm um monte de métodos interessantes¶

Soma de tudo quanto é tipo¶

Média e desvio padrão¶

Máximos e mínimos¶

Números aleatórios¶

Operações vetorizadas e broadcasting¶

Outros tipos de indexações¶

Indexando usando variáveis lógicas¶

Já tínhamos visto fatiamento¶

Álgebra linear¶

Scipy ¶

Álgebra linear¶

Integração numérica (funções)¶

Exemplo Camada limite¶

Ler os dados de um arquivo HDF5¶

Plotar os perfis de velocidade¶

Espectro de turbulência¶

Análise espectral da velocidade¶

Exemplo: Equações de Lorenz¶

Pandas¶

Agrupando e agregando dados¶

Aula 08: Numpy/Scipy/Pandas/Matplotlib¶

MATLAB e outros¶

Python como uma linguagem para aplicações matemáticas¶

Numpy¶

Instalação¶

Executando o numpy¶

Criando objetos ndarray¶

Melhor jeito para criar os ndarrays¶

Arrays igualmente espaçados¶

Os objetos ndarray têm um monte de métodos interessantes¶

Soma de tudo quanto é tipo¶

Média e desvio padrão¶

Máximos e mínimos¶

Números aleatórios¶

Operações vetorizadas e broadcasting¶

Outros tipos de indexações¶

Indexando usando variáveis lógicas¶

Já tínhamos visto fatiamento¶

Álgebra linear¶

Scipy¶

Álgebra linear¶

Integração numérica (funções)¶

Exemplo Camada limite¶

Ler os dados de um arquivo HDF5¶

Plotar os perfis de velocidade¶

Espectro de turbulência¶

Análise espectral da velocidade¶

Exemplo: Equações de Lorenz¶

Pandas¶

Agrupando e agregando dados¶

Comentários

Criando objetos `ndarray`¶

Melhor jeito para criar os `ndarray`s¶

Os objetos `ndarray` têm um monte de métodos interessantes¶

Scipy ¶