forked from pivotal-cf/docs-pcf-install
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtroubleshooting.html.md.erb
411 lines (291 loc) · 17.9 KB
/
troubleshooting.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
#encoding: utf-8
---
title: Pivotal Cloud Foundry® Troubleshooting Guide
owner: Ops Manager
---
<strong><%= modified_date %></strong>
This guide provides help with diagnosing and resolving issues encountered
during a [Pivotal Cloud Foundry®](https://network.pivotal.io/products/pivotal-cf) (PCF) installation.
For help troubleshooting issues that are specific to PCF deployments on
VMware vSphere, refer to the topic on [Troubleshooting Ops Manager for VMware vSphere](./troubleshooting-vsphere.html).
An install or update can fail for many reasons.
Fortunately, the system tends to heal or work around hardware or network faults.
By the time you click the `Install` or `Apply Changes` button again, the
problem may be resolved.
Some failures produce only generic errors like `Exited with 1`.
In cases like this, where a failure is not accompanied by useful information,
retry clicking `Install` or `Apply Changes`.
When the system does provide informative evidence, review the [Common Problems](#common_probs) section at the end of this guide to see if your
problem is covered there.
Besides whether products install successfully or not, an important area to
consider when troubleshooting is communication between VMs deployed by Pivotal
Cloud Foundry®.
Depending on what products you install, communication takes the form of
messaging, routing, or both.
If they go wrong, an installation can fail.
For example, in an Elastic Runtime installation the PCF VM tries to push
a test application to the cloud during post-installation testing.
The installation fails if the resulting traffic cannot be routed to the HA
Proxy load balancer.
## <a id='debug'></a>Viewing the Debug Endpoint ##
The debug endpoint is a web page that provides information useful in
troubleshooting.
If you have superuser privileges and can view the Ops Manager Installation
Dashboard, you can access the debug endpoint.
* In a browser, open the URL:
`https://OPS-MANAGER-IP-ADDRESS/debug`
The debug endpoint offers three links:
* *Files* allows you to view the YAML files that Ops Manager uses to configure
products that you install.
The most important YAML file, `installation.yml`, provides networking settings
and describes `microbosh`.
In this case, `microbosh` is the VM whose BOSH Director component is used by
Ops Manager to perform installations and updates of Elastic Runtime and other
products.
* *Components* describes the components in detail.
* *Rails log* shows errors thrown by the VM where the Ops Manager web
application (a Rails application) is running, as recorded in the
`production.log` file.
See the next section to learn how to explore other logs.
## <a id='tips'> </a>Logging Tips ##
### <a id='starting'> </a>Identifying Where to Start ###
This section contains general tips for locating where a particular problem is called out in the log files. Refer to the later sections for tips regarding specific logs (such as those for Elastic Runtime Components).
* Start with the largest and most recently updated files in the job log
* Identify logs that contain 'err' in the name
* Scan the file contents for a "failed" or "error" string
### <a id='component_logs'></a>Viewing Logs for Elastic Runtime Components ###
To troubleshoot specific Elastic Runtime components by viewing their log files, browse to the Ops Manager interface and follow the procedure below.
1. In Ops Manager, browse to the **Pivotal Elastic Runtime > Status** tab. In the **Job** column, locate the component of interest.
1. In the **Logs** column for the component, click the download icon.
<%= image_tag("troubleshooting/status.png") %>
1. Browse to the **Pivotal Elastic Runtime > Logs** tab.
<%= image_tag("troubleshooting/logs.png") %>
1. Once the zip file corresponding to the component of interest moves to the **Downloaded** list, click the linked file path to download the zip file.
1. Once the download completes, unzip the file.
The contents of the log directory vary depending on which component you view. For example, the DEA log directory contains subdirectories for the `dea_logging_agent`, `dea_next`, `monit`, and `warden` processes. To view the standard error stream for `warden`, download the DEA logs and open `dea.0.job > warden > warden.stderr.log`.
###<a id='view_in_terminal'></a>Viewing Web Application and BOSH Failure Logs in a Terminal Window ###
You can obtain diagnostic information from the Operations Manager by logging
in to the VM where it is running. To log in to the Operations Manager VM, you need the following information:
* The IP address of the PCF VM shown in the `Settings` tab of the Ops
Manager Director tile.
* Your **import credentials**. Import credentials are the username and password
used to import the PCF `.ova` or `.ovf` file into your virtualization
system.
Complete the following steps to log in to the Operations Manager VM:
1. Open a terminal window.
1. Run `ssh IMPORT-USERNAME@PCF-VM-IP-ADDRESS` to connect to the PCF installation VM.
1. Enter your import password when prompted.
1. Change directories to the home directory of the web application:
`cd /home/tempest-web/tempest/web/`
1. You are now in a position to explore whether things are as they should be
within the web application.
<br /><br />
You can also verify that the `microbosh` component is successfully
installed.
A successful MicroBOSH installation is required to install Elastic Runtime and any products like databases and messaging services.
1. Change directories to the BOSH installation log home:
`cd /var/tempest/workspaces/default/deployments/micro`
1. You may want to begin by running a tail command on the `current` log:
`cd /var/tempest/workspaces/default/deployments/micro`
If you are unable to resolve an issue by viewing configurations, exploring
logs, or reviewing common problems, you can troubleshoot further by running
BOSH diagnostic commands with the BOSH Command-Line Interface (CLI).
<p class="note"><strong>Note</strong>: Do not manually modify the deployment manifest. Operations Manager will overwrite manual changes to this manifest.
In addition, manually changing the manifest may cause future deployments to
fail.</p>
##<a id='console-logs'></a>Viewing Apps Manager Logs in a Terminal Window ##
The [Apps Manager](../console/dev-console.html) provides a graphical user
interface to help manage organizations, users, applications, and spaces.
When troubleshooting Apps Manager performance, you might want to view the
Apps Manager application logs.
To view the Apps Manager application logs, follow these steps:
1. Run `cf login -a api.MY-SYSTEM-DOMAIN -u admin` from a command line to log
in to PCF using the UAA Administrator credentials. In Pivotal Ops
Manager, refer to **Pivotal Elastic Runtime > Credentials** for these
credentials.
<pre class='terminal'>
$ cf login -a api.example.com -u admin
API endpoint: api.example.com
Password>******
Authenticating...
OK
</pre>
1. Run `cf target -o system -s apps-manager` to target the `system` org and the
`apps-manager` space.
<pre class='terminal'>
$ cf target -o system -s apps-manager
</pre>
1. Run `cf logs apps-manager` to tail the Apps Manager logs.
<pre class='terminal'>
$ cf logs apps-manager
Connected, tailing logs for app apps-manager in org system / space apps-manager as
admin...
</pre>
###<a id='changing-log-level'></a>Changing Logging Levels for the Apps Manager ##
The Apps Manager recognizes the `LOG_LEVEL` environment variable.
The `LOG_LEVEL` environment variable allows you to filter the messages
reported in the Apps Manager log files by severity level. The Apps Manager defines severity levels using the Ruby standard library [Logger class](http://www.ruby-doc.org/stdlib-1.9.3/libdoc/logger/rdoc/Logger.html).
By default, the Apps Manager `LOG_LEVEL` is set to `info`.
The logs show more verbose messaging when you set the `LOG_LEVEL` to `debug`.
To change the Apps Manager `LOG_LEVEL`, run `cf set-env apps-manager LOG_LEVEL` with the desired severity level.
<pre class='terminal'>
$ cf set-env apps-manager LOG_LEVEL debug
</pre>
You can set `LOG_LEVEL` to one of the six severity levels defined by the Ruby
Logger class:
* **Level 5**: `unknown` -- An unknown message that should always be logged
* **Level 4**: `fatal` -- An unhandleable error that results in a program crash
* **Level 3**: `error` -- A handleable error condition
* **Level 2**: `warn` -- A warning
* **Level 1**: `info` -- General information about system operation
* **Level 0**: `debug` -- Low-level information for developers
Once set, the Apps Manager log files only include messages at the set
severity level and above.
For example, if you set `LOG_LEVEL` to `fatal`, the log includes `fatal` and
`unknown` level messages only.
## <a id='common_probs'></a>Common Issues ##
Compare evidence that you have gathered to the descriptions below.
If your issue is covered, try the recommended remediation procedures.
### <a id='reinstall_bosh'></a>BOSH Does Not Reinstall ###
You might want to reinstall BOSH for troubleshooting purposes.
However, if PCF does not detect any changes, BOSH does not reinstall.
To force a reinstall of BOSH, select **Ops Manager Director > Resource Sizes**
and change a resource value.
For example, you could increase the amount of RAM by 4 MB.
### <a id='time_out'></a>Creating Bound Missing VMs Times Out ###
This task happens immediately following package compilation, but before job
assignment to agents.
For example:
<pre class="terminal">
cloud_controller/0: Timed out pinging to f690db09-876c-475e-865f-2cece06aba79 after 600 seconds (00:10:24)
</pre>
This is most likely a NATS issue with the VM in question.
To identify a NATS issue, inspect the agent log for the VM.
Since the BOSH director is unable to reach the BOSH agent, you must access the
VM using another method.
You will likely also be unable to access the VM using TCP.
In this case, access the VM using your virtualization console.
To diagnose:
1. Access the VM using your virtualization console and log in.
1. Navigate to the **Credentials** tab of the **Elastic Runtime** tile and
locate the VM in question to find the **VM credentials**.
1. Become root.
1. Run `cd /var/vcap/bosh/log`.
1. Open the file `current`.
1. First, determine whether the BOSH agent and director have successfully
completed a handshake, represented in the logs as a “ping-pong”:
<pre class="terminal">
2013-10-03\_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[],
"reply\_to"=>"director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab"}
2013-10-03\_14:35:48.60182 #[608] INFO: reply\_to: director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab:
payload: {:value=>"pong"}
</pre>
This handshake must complete for the agent to receive instructions from the
director.
1. If you do not see the handshake, look for another line near the beginning of
the file, prefixed `INFO: loaded new infrastructure settings`.
For example:
<pre class="terminal">
2013-10-03\_14:35:21.83222 #[608] INFO: loaded new infrastructure settings:
{"vm"=>{"name"=>"vm-4d80ede4-b0a5-4992-aea6a0386e18e", "id"=>"vm-360"},
"agent\_id"=>"56aea4ef-6aa9-4c39-8019-7024ccfdde4",
"networks"=>{"default"=>{"ip"=>"192.168.86.19",
"netmask"=>"255.255.255.0", "cloud\_properties"=>{"name"=>"VMNetwork"},
"default"=>["dns", "gateway"],
"dns"=>["192.168.86.2", "192.168.86.17"], "gateway"=>"192.168.86.2",
"dns\_record\_name"=>"0.nats.default.cf-d729343071061.microbosh",
"mac"=>"00:50:56:9b:71:67"}}, "disks"=>{"system"=>0, "ephemeral"=>1,
"persistent"=>{}}, "ntp"=>[], "blobstore"=>{"provider"=>"dav",
"options"=>{"endpoint"=>"http://192.168.86.17:25250",
"user"=>"agent", "password"=>"agent"}},
"mbus"=>"nats://nats:nats@192.168.86.17:4222",
"env"=>{"bosh"=>{"password"=>"$6$40ftQ9K4rvvC/8ADZHW0"}}}
</pre>
This is a JSON blob of key/value pairs representing the expected infrastructure
for the BOSH agent.
For this issue, the following section is the most important:
`"mbus"=>"nats://nats:nats@192.168.86.17:4222"`
This key/value pair represents where the agent expects the NATS server to be.
One diagnostic tactic is to try pinging this NATS IP address from the VM to
determine whether you are experiencing routing issues.
### <a id='RSA-cert'></a>Install Exits With a Creates/Updates/Deletes App Failure or With a 403 Error###
**Scenario 1:**
Your PCF install exits with the following 403 error when you attempt to
log in to the Apps Manager:
<pre>
{"type": "step_finished", "id": "apps-manager.deploy"}
/home/tempest-web/tempest/web/vendor/bundle/ruby/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:306:in
`fetch': 403 => Net::HTTPForbidden for https://login.api.x.y/oauth/authorizeresponse_type=code&client_id=portal&redirect_uri=https%3...
-- unhandled response (Mechanize::ResponseCodeError)
</pre>
**Scenario 2:**
Your PCF install exits with a `creates/updates/deletes an app (FAILED - 1)` error message with the following stack trace:
<pre>
1) App CRUD creates/updates/deletes an app
Failure/Error: Unable to find matching line from backtrace
CFoundry::TargetRefused:
Connection refused - connect(2)
</pre>
In either of the above scenarios, ensure that you have correctly entered your
domains in wildcard format:
1. Browse to the Operations Manager interface IP.
1. Click the **Elastic Runtime** tile.
1. Select **HAProxy** and click **Generate Self-Signed RSA Certificate**.
1. Enter your system and app domains in wildcard format, as well as optionally
any custom domains, and click **Save**.
Refer to **Elastic Runtime > Cloud Controller** for explanations of these
domain values.
<%= image_tag('rsa_cert.png') %>
### <a id='runtime_depend'></a>Install Fails When Gateway Instances Exceed Zero ###
If you configure the number of Gateway instances to be greater than zero for a
given product, you create a dependency on Elastic Runtime for that product
installation.
If you attempt to install a product tile with an Elastic Runtime dependency
before installing Elastic Runtime, the install fails.
To change the number of Gateway instances, click the product
tile, then select **Settings > Resource sizes > INSTANCES** and change the
value next to the product Gateway job.
To remove the Elastic Runtime dependency, change the value of this field to `0`.
### <a id='out_of_disk_space'></a>Out of Disk Space Error ###
PCF displays an `Out of Disk Space` error if log files expand to fill
all available disk space.
If this happens, rebooting the PCF installation VM clears the tmp
directory of these log files and resolves the error.
### <a id='vsphere_fails'></a>Installing Ops Manager Director Fails ###
If the DNS information for the PCF VM is incorrectly specified when
deploying the PCF .ova file, installing Ops Manager Director fails at the "Installing Micro BOSH" step.
To resolve this issue, correct the DNS settings in the PCF Virtual
Machine properties.
### <a id='delete_om_fails'></a>Deleting Ops Manager Fails ###
Ops Manager displays an error message when it cannot delete your installation. This scenario might happen if the Ops Manager Director cannot access the VMs or is experiencing other issues. To manually delete your installation and all VMs, you must do the following:
1. Use your IaaS dashboard to manually delete the VMs for all installed products, with the exception of the Ops Manager VM.
1. SSH into your Ops Manager VM and remove the `installation.yml` file from `/var/tempest/workspaces/default/`.
<p class="note"><strong>Note</strong>: Deleting the <code>installation.yml</code> file does not prevent you from reinstalling Ops Manager. For future deploys, Ops Manager regenerates this file when you click <strong>Save</strong> on any page in the Ops Manager Director.
Your installation is now deleted.
### <a id='elastic_runtime_fails'></a>Installing Elastic Runtime Fails ###
If the DNS information for the PCF VM becomes incorrect after Ops Manager Director has been installed, installing Elastic Runtime with Pivotal Operations
Manager fails at the "Verifying app push" step.
To resolve this issue, correct the DNS settings in the PCF Virtual
Machine properties.
### <a id='disk-attach'></a>Cannot Attach Disk During MicroBOSH Deploy to vCloud ###
When attempting to attach a disk to a MicroBOSH VM, you might receive the following error:
`The requested operation cannot be performed because disk XXXXXXXXX was not created properly.`
Possible causes and recommendations:
* If the account used during deployment lacks permission to access the default storage profile, attaching the disk might fail.
* vCloud Director can incorrectly report a successful disk creation even if the operation fails, resulting in subsequent error messages. To resolve this issue, redeploy MicroBOSH.
### <a id='ip_address_taken'></a>Ops Manager Hangs During MicroBOSH Install or HAProxy States "IP Address Already Taken" ###
During an Ops Manager installation, you might receive the following errors:
* The Ops Manager GUI shows that the installation stops at the "Setting MicroBOSH deployment manifest" task.
* When you set the IP address for the HAProxy, the "IP Address Already Taken" message appears.
When you install Ops Manager, you assign it an IP address. Ops Manager then takes the next two consecutive IP addresses, assigns the first to MicroBOSH, and reserves the second. For example:
```
10.17.108.1 - Ops Manager (User assigned)
10.17.108.2 - MicroBOSH (Ops Manager assigned)
10.17.108.3 - Reserved (Ops Manager reserved)
```
To resolve this issue, ensure that the next two subsequent IP addresses from the manually assigned address are unassigned.
## <a id='common_probs_firewalls'></a>Common Issues Caused by Firewalls ##
This section describes various issues you might encounter when installing
Elastic Runtime in an environment that uses a strong firewall.
### <a id='DNS_res_fails'></a>DNS Resolution Fails ###
When you install PCF in an environment that uses a strong firewall, the firewall might block DNS resolution. To resolve this issue, refer to the [Troubleshooting DNS Resolution Issues](./config_firewall.html#DNS_fails) section of the Preparing Your Firewall for Deploying PCF topic.